Your Perfect Assignment is Just a Click Away
We Write Custom Academic Papers

100% Original, Plagiarism Free, Customized to your instructions!

glass
pen
clip
papers
heaphones

Learning Structured Sparsity in Deep Neural Networks

Learning Structured Sparsity in Deep Neural Networks

Learning Structured Sparsity in Deep Neural Networks

Wei Wen University of Pittsburgh

[email protected]

Chunpeng Wu University of Pittsburgh [email protected]

Yandan Wang University of Pittsburgh

[email protected]

Yiran Chen University of Pittsburgh

[email protected]

Hai Li University of Pittsburgh

[email protected]

Abstract

High demand for computation resources severely hinders deployment of large-scale Deep Neural Networks (DNN) in resource constrained devices. In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN’s evaluation. Experimental results show that SSL achieves on average 5.1× and 3.1× speedups of convolutional layer computation of AlexNet against CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about twice speedups of non-structured sparsity; (3) regularize the DNN structure to improve classification accuracy. The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network (ResNet) to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers. For AlexNet, structure regularization by SSL also reduces the error by ? 1%. Our source code can be found at https://github.com/wenwei202/caffe/tree/scnn

1 Introduction

Deep neural networks (DNN), especially deep convolutional neural networks (CNN), made re- markable success in visual tasks[1][2][3][4][5] by leveraging large-scale networks learning from a huge volume of data. Deployment of such big models, however, is computation-intensive and memory-intensive. To reduce computation cost, many studies are performed to compress the scale of DNN, including sparsity regularization[6], connection pruning[7][8] and low rank approximation [9][10][11][12][13]. Sparsity regularization and connection pruning approaches, however, often pro- duce non-structured random connectivity in DNN and thus, irregular memory access that adversely impacts practical acceleration in hardware platforms. Figure 1 depicts practical speedup of each layer of a AlexNet, which is non-structurally sparsified by `1-norm. Compared to original model, the accuracy loss of the sparsified model is controlled within 2%. Because of the poor data locality associated with the scattered weight distribution, the achieved speedups are either very limited or negative even the actual sparsity is high, say, >95%. We define sparsity as the ratio of zeros in this paper. In recently proposed low rank approximation approaches, the DNN is trained first and then each trained weight tensor is decomposed and approximated by a product of smaller factors. Finally, fine-tuning is performed to restore the model accuracy. Low rank approximation is able to achieve practical speedups because it coordinates model parameters in dense matrixes and avoids the locality problem of non-structured sparsity regularization. However, low rank approximation can only obtain

ar X

iv :1

60 8.

03 66

5v 4

[ cs

.N E

] 1

8 O

ct 2

01 6

https://github.com/wenwei202/caffe/tree/scnn
0

1

0

0.5

1

1.5

conv1 conv2 conv3 conv4 conv5

Quadro K600 Tesla K40c GTX Titan Sparsity

Sp ee

du p

Sp ar

si ty

Figure 1: Evaluation speedups of AlexNet on GPU platforms and the sparsity. conv1 refers to convolutional layer 1, and so forth. Baseline is profiled by GEMM of cuBLAS. The sparse matrixes are stored in the format of Compressed Sparse Row (CSR) and accelerated by cuSPARSE.

the compact structure within each layer, and the structures of the layers are fixed during fine-tuning such that costly reiterations of decomposing and fine-tuning are required to find an optimal weight approximation for performance speedup and accuracy retaining.

Inspired by the facts that (1) there is redundancy across filters and channels [11]; (2) shapes of filters are usually fixed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary computation imposed by this fixation; and (3) depth of the network is critical for classification but deeper layers cannot always guarantee a lower error because of the exploding gradients and degradation problem [5], we propose Structured Sparsity Learning (SSL) method to directly learn a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a generic regularization to adaptively adjust mutiple structures in DNN, including structures of filters, channels, and filter shapes within each layer, and structure of depth beyond the layers. SSL combines structure regularization (on DNN for classification accuracy) with locality optimization (on memory access for computation efficiency), offering not only well-regularized big models with improved accuracy but greatly accelerated computation (e.g. 5.1× on CPU and 3.1× on GPU for AlexNet).

2 Related works Connection pruning and weight sparsifying. Han et al. [7][8] reduced number of parameters of AlexNet by 9× and VGG-16 by 13× using connection pruning. Since most reduction is achieved on fully-connected layers, the authors obtained 3× to 4× layer-wise speedup for fully-connected layers. However, no practical speedups of convolutional layers are observed because of the issue shown in Figure 1. As convolution is the computational bottleneck and many new DNNs use fewer fully-connected layers, e.g., only 3.99% parameters of ResNet-152 in [5] are from fully-connected layers, compression and acceleration on convolutional layers become essential. Liu et al. [6] achieved >90% sparsity of convolutional layers in AlexNet with 2% accuracy loss, and bypassed the issue shown in Figure 1 by hardcoding the sparse weights into program, achieving layer-wise 4.59× speedup on a CPU. In this work, we also focus on convolutional layers. Compared to the above techniques, our SSL method can coordinate sparse weights in adjacent memory space and achieve higher speedups with the same accuracy. Note that hardware and program optimizations can further boost the system performance on top of the level of SSL but are not covered in this work.

Low rank approximation. Denil et al. [9] predicted 95% parameters in a DNN by exploiting the redundancy across filters and channels. Inspired by it, Jaderberg et al. [11] achieved 4.5× speedup on CPUs for scene text character recognition and Denton et al. [10] achieved 2× speedups on both CPUs and GPUs for the first two layers. Both of the works used Low Rank Approximation (LRA) with ?1% accuracy drop. [13][12] improved and extended LRA to larger DNNs. However, the network structure compressed by LRA is fixed; reiterations of decomposing, training/fine-tuning, and cross-validating are still needed to find an optimal structure for accuracy and speed trade-off. As number of hyper-parameters in LRA method increases linearly with layer depth [10][13], the search space increases linearly or even polynomially for very deep DNNs. Comparing to LRA, our contributions are: (1) SSL can dynamically optimize the compactness of DNN structure with only one hyper-parameter and no reiterations; (2) besides the redundancy within the layers, SSL also exploits the necessity of deep layers and reduce them; (3) DNN filters regularized by SSL have lower rank approximation, so it can work together with LRA for more efficient model compression.

Model structure learning. Group Lasso [14] is an efficient regularization to learn sparse structures. Kim et al. [15] used group Lasso to regularize the structure of correlation tree for multi-task regression problem and reduced prediction errors. Liu et al. [6] utilized group Lasso to constrain the scale

2

shortcut

depth-wise

filter-wise

channel-wise

…

shape-wise

W (l)nl,:,:,: (1)

W (l):,cl,:,: (2)

W (l):,cl,ml,kl (3)

W (l) (4)

1

W (l)nl,:,:,: (1)

W (l):,cl,:,: (2)

W (l):,cl,ml,kl (3)

W (l) (4)

1

W (l)nl,:,:,: (1)

W (l):,cl,:,: (2)

W (l):,cl,ml,kl (3)

W (l) (4)

1

W (l)nl,:,:,: (1)

W (l):,cl,:,: (2)

W (l):,cl,ml,kl (3)

W (l) (4)

1

Figure 2: The proposed structured sparsity learning (SSL) for DNNs. Weights in filters are split into multiple groups. Through group Lasso regularization, a more compact DNN is obtained by removing some groups. The figure illustrates the filter-wise, channel-wise, shape-wise, and depth-wise structured sparsity that were explored in the work.

of the structure of LRA. To adapt DNN structure to different databases, Feng et al. [16] learned the appropriate number of filters in DNN. Different from these prior arts, we apply group Lasso to regularize multiple DNN structures (filters, channels, filter shapes, and layer depth). Our source code can be found at https://github.com/wenwei202/caffe/tree/scnn.

3 Structured Sparsity Learning Method for DNNs

We focus mainly on the Structured Sparsity Learning (SSL) on convolutional layers to regularize the structure of DNNs. We first propose a generic method to regularize structures of DNN in Section 3.1, and then specify the method to structures of filters, channels, filter shapes and depth in section 3.2. Variants of formulations are also discussed from computational efficiency viewpoint in Section 3.3.

3.1 Proposed structured sparsity learning for generic structures

Suppose weights of convolutional layers in a DNN form a sequence of 4-D tensors W (l) ? RNl×Cl×Ml×Kl , where Nl, Cl, Ml and Kl are the dimensions of the l-th (1 ? l ? L) weight tensor along the axes of filter, channel, spatial height and spatial width, respectively. L denotes the number of convolutional layers. Then the proposed generic optimization target of a DNN with structured sparsity regularization can be formulated as:

E(W ) = ED(W ) + ? ·R(W ) + ?g · L?

l=1

Rg ( W (l)

) . (1)

Here W represents the collection of all weights in the DNN; ED(W ) is the loss on data; R(·) is non-structured regularization applying on every weight, e.g., `2-norm; and Rg(·) is the structured sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights w can be represented as Rg(w) =

?G g=1 ||w(g)||g, where w(g) is a group of partial weights in w

and G is the total number of groups. Different groups may overlap. Here || · ||g is the group Lasso, or

||w(g)||g = ??|w(g)|

i=1

( w

(g) i

)2 , where |w(g)| is the number of weights in w(g).

3.2 Structured sparsity learning for structures of filters, channels, filter shapes and depth

In SSL, the learned “structure” is decided by the way of splitting groups of w(g). We investigate and formulate the filer-wise, channel-wise, shape-wise, and depth-wise structured sparsity in Figure 2. For simplicity, the R(·) term of Eq. (1) is omitted in the following formulation expressions.

Penalizing unimportant filers and channels. Suppose W (l)nl,:,:,: is the nl-th filter and W (l) :,cl,:,: is the

cl-th channel of all filters in the l-th layer. The optimization target of learning the filter-wise and

3

https://github.com/wenwei202/caffe/tree/scnn
channel-wise structured sparsity can be defined as

E(W ) = ED(W ) + ?n · L?

l=1

?? Nl? nl=1

||W (l)nl,:,:,:||g

??+ ?c · L? l=1

?? Cl? cl=1

||W (l):,cl,:,:||g

?? . (2) As indicated in Eq. (2), our approach tends to remove less important filters and channels. Note that zeroing out a filter in the l-th layer results in a dummy zero output feature map, which in turn makes a corresponding channel in the (l+ 1)-th layer useless. Hence, we combine the filter-wise and channel-wise structured sparsity in the learning simultaneously.

Learning arbitrary shapes of filers. As illustrated in Figure 2, W (l):,cl,ml,kl denotes the vector of all corresponding weights located at spatial position of (ml, kl) in the 2D filters across the cl-th channel. Thus, we define W (l):,cl,ml,kl as the shape fiber related to learning arbitrary filter shape because a homogeneous non-cubic filter shape can be learned by zeroing out some shape fibers. The optimization target of learning shapes of filers becomes:

E(W ) = ED(W ) + ?s · L?

l=1

?? Cl? cl=1

Ml? ml=1

Kl? kl=1

||W (l):,cl,ml,kl ||g

?? . (3) Regularizing layer depth. We also explore the depth-wise sparsity to regularize the depth of DNNs in order to improve accuracy and reduce computation cost. The corresponding optimization target is E(W ) = ED(W )+?d ·

?L l=1 ||W (l)||g . Different from other discussed sparsification techniques,

zeroing out all the filters in a layer will cut off the message propagation in the DNN so that the output neurons cannot perform any classification. Inspired by the structure of highway networks [17] and deep residual networks [5], we propose to leverage the shortcuts across layers to solve this issue. As illustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still be forwarded through the shortcut.

3.3 Structured sparsity learning for computationally efficient structures

All proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction. Moreover, some variants of the formulations of these schemes can directly learn structures that can be efficiently computed.

2D-filter-wise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D convolutions. To perform efficient convolution, we explored a fine-grain variant of filter-wise sparsity, namely, 2D-filter-wise sparsity, to spatially enforce group Lasso on each 2D filter of W (l)nl,cl,:,:. The saved convolution is proportional to the percentage of the removed 2D filters. The fine-grain version of filter-wise sparsity can more efficiently reduce the computation associated with convolution: Because the group sizes are much smaller and thus the weight updating gradients are shaper, it helps group Lasso to quickly obtain a high ratio of zero groups for a large-scale DNN.

Combination of filter-wise and shape-wise sparsity for GEMM. Convolutional computation in DNNs is commonly converted to modality of GEneral Matrix Multiplication (GEMM) by lowering weight tensors and feature tensors to matrices [18]. For example, in Caffe [19], a 3D filter W (l)nl,:,:,: is reshaped to a row in the weight matrix where each column is the collection of weights W (l):,cl,ml,kl related to shape-wise sparsity. Combining filter-wise and shape-wise sparsity can directly reduce the dimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use row-wise and column-wise sparsity as the interchangeable terminology of filter-wise and shape-wise sparsity, respectively.

4 Experiments

We evaluated the effectiveness of our SSL using published models on three databases – MNIST, CIFAR-10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights are initialized by the baseline, and speedups are measured in matrix-matrix multiplication by Caffe in a single-thread Intel Xeon E5-2630 CPU .

4

Table 1: Results after penalizing unimportant filters and channels in LeNet

LeNet # Error Filter # § Channel # § FLOP § Speedup §

1 (baseline) 0.9% 20—50 1—20 100%—100% 1.00ח1.00× 2 0.8% 5—19 1—4 25%—7.6% 1.64ח5.23× 3 1.0% 3—12 1—3 15%—3.6% 1.99ח7.44×

§In the order of conv1—conv2

Table 2: Results after learning filter shapes in LeNet

LeNet # Error Filter size § Channel # FLOP Speedup

1 (baseline) 0.9% 25—500 1—20 100%—100% 1.00ח1.00× 4 0.8% 21—41 1—2 8.4%—8.2% 2.33ח6.93× 5 1.0% 7—14 1—1 1.4%—2.8% 5.19ח10.82×

§ The sizes of filters after removing zero shape fibers, in the order of conv1—conv2

4.1 LeNet and multilayer perceptron on MNIST

In the experiment of MNIST, we examined the effectiveness of SSL in two types of networks: LeNet [20] implemented by Caffe and a multilayer perceptron (MLP) network. Both networks were trained without data augmentation.

LeNet: When applying SSL to LeNet, we constrain the network with filter-wise and channel-wise sparsity in convolutional layers to penalize unimportant filters and channels. Table 1 summarizes the remained filters and channels, floating-point operations (FLOP), and practical speedups. In the table, LeNet 1 is the baseline and the others are the results after applying SSL in different strengths of structured sparsity regularization. The results show that our method achieves the similar error (±0.1%) with much fewer filters and channels, and saves significant FLOP and computation time. To demonstrate the impact of SSL on the structures of filters, we present all learned conv1 filters in Figure 3. It can be seen that most filters in LeNet 2 are entirely zeroed out except for five most important detectors of stroke patterns that are sufficient for feature extraction. The accuracy of LeNet 3 (that further removes the weakest and redundant stroke detector) drops only 0.2% from that of LeNet 2. Compared to the random and blurry filter patterns in LeNet 1 that resulted from the high freedom of parameter space, the filters in LeNet 2 & 3 are regularized and converge to smoother and more natural patterns. This explains why our proposed SSL obtains the same-level accuracy but has much less filters. The smoothness of the filters are also observed in the deeper layers.

The effectiveness of the shape-wise sparsity on LeNet is summarized in Table 2. The baseline LeNet 1 has conv1 filters with a regular 5× 5 square (size = 25) while LeNet 5 reduces the dimension that can be constrained by a 2× 4 rectangle (size = 7). The 3D shape of conv2 filters in the baseline is also regularized to the 2D shape in LeNet 5 within only one channel, indicating that only one filter in conv1 is needed. This fact significantly saves FLOP and computation time.

Figure 3: Learned conv1 filters in LeNet 1 (top), LeNet 2 (middle) and LeNet 3 (bottom)

MLP: Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e. the number of neurons) of fully-connected layers. We enforce the group Lasso regularization on all the input (or output) connections of each neuron. A neuron whose input connections are all zeroed out can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a removable dummy neuron if all of its output connections are zeroed out. Figure 4(a) summarizes the learned structure and FLOP of different MLP networks. The results show that SSL can not only remove hidden neurons but also discover the sparsity of images. For example, Figure 4(b) depicts the number of connections of each input neuron in MLP 2, where 40.18% of input neurons have zero connections and they concentrate at the boundary of the image. Such a distribution is consistent with our intuition:

5

Table 2: Results after learning filter shapes in LeNet

LeNet # Error Filter size § Channel # FLOP Speedup

1 (baseline) 0.9% 25–500 1–20 100%–100% 1.00?–1.00? 4 0.8% 21–41 1–2 8.4%–8.2% 2.33?–6.93? 5 1.0% 7–14 1–1 1.4%–2.8% 5.19?–10.82?

§ The sizes of filters after removing zero shape fibers, in the order of conv1–conv2

0 50 100 0

10

20

30

40

50

% R

ec on

st ru

ct io

n er

ro r

conv1 conv2

0 50 100 0

10

20

30

40

50

% ranks

conv1 conv2 conv3

0 50 100 0

10

20

30

40

50

conv1 conv2 conv3 conv4 conv5

Figure 4: The normalized reconstructure error of weight matrix vs. the percent of ranks. Principal Component Analysis (PCA) is utilized to explore the redundancy among filters. % ranks of eigenvec- tors corresponding to the largest eigenvalues are selected as basis to perform low rank approximation. Left: LeNet 2 in Table 1; middle: ConvNet 2 in Table 4; right: AlexNet 4 in Table 5. Dash lines indicate baselines and solid lines indicate results of SSL.

detectors of stroke patterns which are sufficient for feature extraction. The accuracy of LeNet 3170 (that further removes one weakest and one redundant stroke detector) compared with LeNet 2 drops171 only 0.2%. Although the training processes of three networks are independent, the corresponding172 regularized filters in LeNet 2 and LeNet 3 demonstrate very high similarity and represent certain level173 of alikeness to those in LeNet 1. Comparing with random and blurry filter patterns in LeNet 1 resulted174 from the high freedom of parameter space, the filters in LeNet 2 & 3 are regularized through the175 filter-wise and channel-wise sparsity and therefore converge at smoother and more natural patterns.176 This explains why our proposed SSL obtains the same-level accuracy but having much less filters.177 These regularity and similarity phenomena are also observed in deeper layers. Different from low178 rank decomposition which only explore the redundancy and does not change the rank, SSL can reduce179 the redundancy as shown in Figure 4.180

We also explore the effectiveness of the shape-wise sparsity on LeNet in Table 2. The baseline LeNet181 1 has a regular 5 ? 5 square size of conv1 filters, while LeNet 5 reduces the dimension to less than182 2 ? 4. And the 3D shape of filters in conv2 of LeNet 1 are regularized to 2D shape of LeNet 5 with183 only one channel, indicating that only one filter in conv1 is needed. This saves significant FLOP and184 computing time.185

MLP: Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.186 the number of neurons) in fully-connected layers. Here, the baseline MLP network composed of187 two hidden layers with 500 and 300 neurons respectively obtains a test error of 1.43%. We enforced188 the group Lasso regularization on all the input (or output) connections of every neuron, including189 those of the input layer. Note that a neuron with all the input connections zeroed out degenerate190 to a bias neuron in the next layer; similarly, a neuron degenerates to a removable dummy neuron191 if all of its output connections are zeroed out. As such, the computation of GEneral Matrix Vector192 (GEMV) product in fully-connected layers can be significantly reduced. Table 3 summarizes the193

Table 3: Learning the number of neurons in multi-layer perceptron

MLP # Error Neuron # per layer § FLOP per layer §

1 (baseline) 1.43% 784–500–300–10 100%–100%–100% 2 1.34% 469–294–166–10 35.18%–32.54%–55.33% 3 1.53% 434–174–78–10 19.26%–9.05%–26.00%

§In the order of input layer–hidden layer 1–hidden layer 2–output layer

6

(a)

1 28

1

28 0

291

(b) Figure 4: (a) Results of learning the number of neurons in MLP. (b) the connection numbers of input neurons (i.e. pixels) in MLP 2 after SSL.

Table 3: Learning row-wise and column-wise sparsity of ConvNet on CIFAR-10

ConvNet # Error Row sparsity § Column sparsity § Speedup §

1 (baseline) 17.9% 12.5%–0%–0% 0%–0%–0% 1.00ז1.00ז1.00× 2 17.9% 50.0%–28.1%–1.6% 0%–59.3%–35.1% 1.43ז3.05ז1.57× 3 16.9% 31.3%–0%–1.6% 0%–42.8%–9.8% 1.25ז2.01ז1.18×

§in the order of conv1–conv2–conv3

handwriting digits are usually written in the center and pixels close to the boundary contain little discriminative classification information.

4.2 ConvNet and ResNet on CIFAR-10

We implemented the ConvNet of [1] and deep residual networks (ResNet) [5] on CIFAR-10. When regularizing filters, channels, and filter shapes, the results and observations of both networks are similar to that of the MNIST experiment. Moreover, we simultaneously learn the filter-wise and shape-wise sparsity to reduce the dimension of weight matrix in GEMM of ConvNet. We also learn the depth-wise sparsity of ResNet to regularize the depth of the DNNs.

ConvNet: We use the network from Alex Krizhevsky et al. [1] as the baseline and implement it using Caffe. All the configurations remain the same as the original implementation except that we added a dropout layer with a ratio of 0.5 in the fully-connected layer to avoid over-fitting. ConvNet is trained without data augmentation. Table 3 summarizes the results of three ConvNet networks. Here, the row/column sparsity of a weight matrix is defined as the percentage of all-zero rows/columns. Figure 5 shows their learned conv1 filters. In Table 3, SSL can reduce the size of weight matrix in ConvNet 2 by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups without accuracy drop. Surprisingly, without SSL, four conv1 filters of the baseline are actually all-zeros as shown in Figure 5, demonstrating the great potential of filter sparsity. When SSL is applied, half of conv1 filters in ConvNet 2 can be zeroed out without accuracy drop.

On the other hand, in ConvNet 3, SSL achieves 1.0% (±0.16%) lower error with a model even smaller than the baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a better network structure (including the number of filters and filer shapes) to reduce the error.

Figure 5: Learned conv1 filters in ConvNet 1 (top), ConvNet 2 (middle) and ConvNet 3 (bottom)

ResNet: To investigate the necessary depth of DNNs required by SSL, we use a 20-layer deep residual networks (ResNet-20) proposed in [5] as the baseline. The network has 19 convolutional layers and 1 fully-connected layer. Identity shortcuts are utilized to connect the feature maps with the same dimension while 1×1 convolutional layers are chosen as shortcuts between the feature maps with different dimensions. Batch normalization [21] is adopted after convolution and before activation. We use the same data augmentation and training hyper-parameters as that in [5]. The final error of baseline is 8.82%. In SSL, the depth of ResNet-20 is regularized by depth-wise sparsity. Group Lasso regularization is only enforced on the convolutional layers between each pair of shortcut endpoints, excluding the first convolutional layer and all convolutional shortcuts. After SSL converges, layers

6

12 14 16 18 20 7

8

9

10

SSL?ResNet?#

% e

rr or

SSL ResNet?20 ResNet?32

12 14 16 18 20 0 2 4 6 8

10 12 14 16 18 20

SSL?ResNet?#

# co

nv la

ye rs

32×32 16×16 8×8

12 14 16 18 20 7

8

9

10

SSL?ResNet?#

% e

rr or

SSL ResNet?20 ResNet?32

12 14 16 18 20 0 2 4 6 8

10 12 14 16 18 20

SSL?ResNet?#

# co

nv la

ye rs

32×32 16×16 8×8

Figure 6: Error vs. layer number after depth regularization by SSL. ResNet-# is the original ResNet in [5] with # layers. SSL-ResNet-# is the depth-regularized ResNet by SSL with # layers, including the last fully-connected layer. 32×32 indicates the convolutional layers with an output map size of 32×32, and so forth.

with all zero weights are …

Read more
Applied Sciences
Architecture and Design
Biology
Business & Finance
Chemistry
Computer Science
Geography
Geology
Education
Engineering
English
Environmental science
Spanish
Government
History
Human Resource Management
Information Systems
Law
Literature
Mathematics
Nursing
Physics
Political Science
Psychology
Reading
Science
Social Science
Home
Homework Answers
Blog
Archive
Tags
Reviews
Contact
google+twitterfacebook
Copyright © 2021 HomeworkMarket.com

Order Solution Now

Our Service Charter

1. Professional & Expert Writers: Topnotch Essay only hires the best. Our writers are specially selected and recruited, after which they undergo further training to perfect their skills for specialization purposes. Moreover, our writers are holders of masters and Ph.D. degrees. They have impressive academic records, besides being native English speakers.

2. Top Quality Papers: Our customers are always guaranteed of papers that exceed their expectations. All our writers have +5 years of experience. This implies that all papers are written by individuals who are experts in their fields. In addition, the quality team reviews all the papers before sending them to the customers.

3. Plagiarism-Free Papers: All papers provided byTopnotch Essay are written from scratch. Appropriate referencing and citation of key information are followed. Plagiarism checkers are used by the Quality assurance team and our editors just to double-check that there are no instances of plagiarism.

4. Timely Delivery: Time wasted is equivalent to a failed dedication and commitment. Topnotch Essay is known for timely delivery of any pending customer orders. Customers are well informed of the progress of their papers to ensure they keep track of what the writer is providing before the final draft is sent for grading.

5. Affordable Prices: Our prices are fairly structured to fit in all groups. Any customer willing to place their assignments with us can do so at very affordable prices. In addition, our customers enjoy regular discounts and bonuses.

6. 24/7 Customer Support: At Topnotch Essay, we have put in place a team of experts who answer to all customer inquiries promptly. The best part is the ever-availability of the team. Customers can make inquiries anytime.