Name: TCL
Paper: [Link]
Code: N/A
Features:
- First work on applying tensor decomposition as a general layer or replace fully-connected layers.
- Use TTM operations
- Tensor modes meaning: height, width, channel in order.
- In TCL layers, height size (128-512 in the paper) is much larger than width size (always 3 in paper).
- Good references
Findings:
- TCLs reduce the dimensionality of the activation tensors (only for two spacial modes of images, leaving the modes corresponding to the channel and the batch size untouched) and thus the number of model parameters, at the same time, preserve high accuracy.
- Optimize fully-connected layers using tensor factorizations, using two approaches:
- TCL as an additional layer: reducing the dimensionality of the activation tensor before feeding it to the subsequent two (or more) fully-connected layers and softmax output of the network. This approach preserves or even increase the accuracy.
- TCL as replacement of a fully-connected layer (partial or full replacement of fully-connected layers): this approach affect the accuracy a bit, but significantly reducing the number of parameters
- Take the input to the fully-connected layers as an activation tensor X of size (D1, …, DN), we seek a low dimensional core tensor G of sealers size (R1, … RN).
- Both number of parameters and time complexity of a TCL is smaller then a fully-connect layer. (Detailed comparison of the complexity and number of parameters is in the paper.)
-
To avoid vanishing or exploding gradients, and to make the TCL more robust to changes in the initialization of the factors, we added a batch normalization layer [8] before and after the TCL.
- Future work
-
we plan to extend our work to more net- work architectures, especially in settings where raw data or learned representations exhibit natural multi-modal structure that we might capture via high-order tensors.
-
We also endeavor to advance our experimental study of TCLS for large-scale, high-resolutions vision datasets.
- Plan to integrate new extended BLAS primitives which can avoid transpositions needed to compute the tensor contractions.
-
we will look into methods to induce and exploit sparsity in the TCL, to understand the parameter reductions this method can yield over existing state-of-the-art pruning methods.
-
we are working on an extension to the TCL: a tensor regression layer to replace both the fully-connected and final output layers, potentially yielding in- creased accuracy with even greater parameter reductions.
-
Other Knowledge:
- Fully-connected layers hold over 80% of the parameters.
Useful reference:
- Recently, tensor methods have been used in attempts to better understand the success of deep neural networks [On the Expressive Power of Deep Learning: A Tensor Analysis] [Global optimality in tensor factorization, deep learning, and beyond.]
- Other lines of research have investigated practical applications of tensor decomposition to deep neural networks with aims including multi-task learning [Deep Multi-task Representation Learning: A Tensor Factorisation Approach], sharing residual units [Sharing residual units through collective tensor factorization in deep neural networks], and speeding up convolutional neural networks [Post: SPEEDING-UP CONVOLUTIONAL NEURAL NETWORKS USING FINE-TUNED CP-DECOMPOSITION]. Several recent papers apply decompositions for either initialization [Deep Multi-task Representation Learning: A Tensor Factorisation Approach] or post-training [Post: Tensorizing Neural Networks]. These techniques then often require additional fine-tuning to compensate for the loss of information.
- However, to our knowledge, no attempt has been made to apply tensor contractions as a generic layer directly on the activations or weights of a deep neural network and to train the resulting network end-to-end.
Software:
- AlexNet
- VGG
Dataset:
- CIFAR100
- ImageNet