Network Sparsity

05 Nov 2017

This article is about some methods to sparsify neural network by training.

Recently I have read two paper published by Sung Ju Hwang group in ICML 2017. I think these two papers share some ideas. Put them here to be compared.

Combined Group and Exclusive Sparisty for Deep Neural Network

Paper in one sentence

This paper adds two regularizers: Group sparsity and exclusive sparsity, to train neural network.

Motivation

In the optimal case, the weights at each layer will be fully orthogonal to each other, and thus forming an orthogonal basis set.
To do so, enforce network weights at each layer to fit to different sets of input features as much as possible. (exclusive sparsity)
However, it is not practical nor desirable to restrict each weight to be completely disjoint from others as some features still need to be shared (think about low level features). Therefore introduce an additional group sparsity regularizer based on (2, 1)-norm. (group sparsity)

Idea Visiualization

Group Sparsity tries to delete input neurons.
Exclusive Sparsity tries to make output neurons fight for input neurons. That is to say input neurons belong to only one output neuron.

Formulation

Group Sparsity

\(\Omega (\mathbf{W}) = \sum_g ||\mathbf{W}_g^l||_2 = \sum_g \sqrt{\sum_i (w_{g,i}^l)^2}\)
where:

\(g\) represents group, in this paper, every input neuron and its corresponding weights are a group.
\(l\): layer of neural network

Obviously we can see, it does a group lasso in input neuron: do 2-norm at group first, then do 1-norm between groups.

Exclusive Sparsity

\(\Omega (\mathbf{W}) = \frac{1}{2} \sum_g ||\mathbf{W}_g^l||_1^2 = \frac{1}{2} \sum_g (\sum_i |w_{g,i}^l|)^2\)

Some thinkings

It is easy to knwo using group lasso can result in group sparsity. But it confuses me why using (1,2)-norm can “make output neurons compete for inputs”.

As is said in paper:

Applying 2-norm over these 1-norm groups will result in even weights among the groups; that is, all groups should have similar number of non-sparse weights, and thus no group can have large number of non-sparse weight.

Also, exclusive sparsity was first introduced in Multi-task learning: Hierarchical Classification via Orthogonal Transfer Maybe can find more details in this paper.

Optimization

It uses Proximal Gradient Descent:

First obtains the intermediate solution \(\mathbf{W}_{t+\frac{1}{2}}\) by taking a gradient step using the gradient computed on the loss only.
Then optimize for the regularization term while performing Euclidean projection of it to the solution space: \(\min_{\mathbf{W}_{t+1}} = \{\Omega (\mathbf{W}_{t+\frac{1}{2}}) + \frac{1}{2\lambda} || \mathbf{W}_{t+1} - \mathbf{W}_{t+\frac{1}{2} }||_2^2 \}\)

Group Sparsity proximal step

Set \(f(\mathbf{W}_{t+1}) = \frac{1}{2} ||\mathbf{W}_{t+1} - \mathbf{W}_{t + \frac{1}{2}}||^2_2 + \underbrace{\lambda [\mathbf{W}_{t+ 1}^1, \mathbf{W}_{t+ 1}^2, ... ,\mathbf{W}_{t+ 1}^g]}_{\text{Regularizer}}\)

\[\frac{\partial f(\mathbf{W}_{t+1})}{\partial \mathbf{W}_{t+1}} = \mathbf{W}_{t+1} - \mathbf{W}_{t + \frac{1}{2}} + \lambda [\frac{\mathbf{W}_{t+1}^1}{||\mathbf{W}_{t+1}^1||_2}, \frac{\mathbf{W}_{t+ 1}^2}{||\mathbf{W}_{t+ 1}^2||_2}, ... ,\frac{\mathbf{W}_{t+ 1}^g}{||\mathbf{W}_{t+ 1}^g||}] = 0\]

Finally we get: \(\mathbf{W}^{t+1}_{g,i} = \left(1 - \frac{\lambda}{||\mathbf{W}_g||_2} \right)_{+} \mathbf{W}^{t + \frac{1}{2}}_{g,i}\)

Exclusive Sparsity

\(\mathbf{W}^{t+1}_{g,i} = \left(1 - \frac{\lambda ||\mathbf{W}_g||_1}{|w_{g,i}|} \right)_{+} \mathbf{W}^{t+\frac{1}{2}}_{g,i}\)

Experiments

Visualization–Fully Connected

Group sparsity regularizer results in the total elimination of certain features.
Exclusive sparsity regularizer, when used on its own, results in disjoint feature selection for each class.
When combined, it allows certain degree of feature reuse

Visualization–Convolution

Combined group and exclusive sparsity regularizer results in filters that are much sharper than others.
Some spatial features dropped altogether from the competition with other filters.

Chen Shangyu

Network Sparsity

Combined Group and Exclusive Sparisty for Deep Neural Network

Paper in one sentence

Motivation

Idea Visiualization

Formulation

Group Sparsity

Exclusive Sparsity

Some thinkings

Optimization

Group Sparsity proximal step

Exclusive Sparsity

Experiments

Visualization–Fully Connected

Visualization–Convolution

Related Posts

Summary of Deep Neural Network Optimization on Resource-Constraint Devices 11 Feb 2018

BirdEye - an Automatic Method for Inverse Perspective Transformation of Road Image without Calibration 09 Jul 2015

A Saliency SIFT Feature-Based Method for Image Recommendation 08 Jul 2015