Name: TCL
Paper: [Link]
Code: N/A
Features:
 First work on applying tensor decomposition as a general layer or replace fullyconnected layers.
 Use TTM operations
 Tensor modes meaning: height, width, channel in order.
 In TCL layers, height size (128512 in the paper) is much larger than width size (always 3 in paper).
 Good references
Findings:
 TCLs reduce the dimensionality of the activation tensors (only for two spacial modes of images, leaving the modes corresponding to the channel and the batch size untouched) and thus the number of model parameters, at the same time, preserve high accuracy.
 Optimize fullyconnected layers using tensor factorizations, using two approaches:
 TCL as an additional layer: reducing the dimensionality of the activation tensor before feeding it to the subsequent two (or more) fullyconnected layers and softmax output of the network. This approach preserves or even increase the accuracy.
 TCL as replacement of a fullyconnected layer (partial or full replacement of fullyconnected layers): this approach affect the accuracy a bit, but significantly reducing the number of parameters
 Take the input to the fullyconnected layers as an activation tensor X of size (D1, …, DN), we seek a low dimensional core tensor G of sealers size (R1, … RN).
 Both number of parameters and time complexity of a TCL is smaller then a fullyconnect layer. (Detailed comparison of the complexity and number of parameters is in the paper.)

To avoid vanishing or exploding gradients, and to make the TCL more robust to changes in the initialization of the factors, we added a batch normalization layer [8] before and after the TCL.
 Future work

we plan to extend our work to more net work architectures, especially in settings where raw data or learned representations exhibit natural multimodal structure that we might capture via highorder tensors.

We also endeavor to advance our experimental study of TCLS for largescale, highresolutions vision datasets.
 Plan to integrate new extended BLAS primitives which can avoid transpositions needed to compute the tensor contractions.

we will look into methods to induce and exploit sparsity in the TCL, to understand the parameter reductions this method can yield over existing stateoftheart pruning methods.

we are working on an extension to the TCL: a tensor regression layer to replace both the fullyconnected and final output layers, potentially yielding in creased accuracy with even greater parameter reductions.

Other Knowledge:
 Fullyconnected layers hold over 80% of the parameters.
Useful reference:
 Recently, tensor methods have been used in attempts to better understand the success of deep neural networks [On the Expressive Power of Deep Learning: A Tensor Analysis] [Global optimality in ten sor factorization, deep learning, and beyond.]
 Other lines of research have investigated practical applications of tensor decomposition to deep neural networks with aims including multitask learning [Deep Multitask Representation Learning: A Tensor Factorisation Approach], sharing residual units [Sharing residual units through collective tensor factorization in deep neural networks], and speeding up convolutional neural networks [Post: SPEEDINGUP CONVOLUTIONAL NEURAL NETWORKS USING FINETUNED CPDECOMPOSITION]. Several recent papers apply decompositions for either initialization [Deep Multitask Representation Learning: A Tensor Factorisation Approach] or posttraining [Post: Tensorizing Neural Networks]. These techniques then often require additional finetuning to compensate for the loss of information.
 However, to our knowledge, no attempt has been made to apply tensor contractions as a generic layer directly on the activations or weights of a deep neural network and to train the resulting network endtoend.
Software:
 AlexNet
 VGG
Dataset:
 CIFAR100
 ImageNet