“Tensor Contraction Layers for Parsimonious Deep Nets” — Jean Kossaifi et al. 2017

Name: TCL

Paper: [Link]

Code: N/A

Features:

  • First work on applying tensor decomposition as a general layer or replace fully-connected layers.
  • Use TTM operations
  • Tensor modes meaning: height, width, channel in order.
  • In TCL layers, height size (128-512 in the paper) is much larger than width size (always 3 in paper).
  • Good references

Findings:

  • TCLs reduce the dimensionality of the activation tensors (only for two spacial modes of images, leaving the modes corresponding to the channel and the batch size untouched) and thus the number of model parameters, at the same time, preserve high accuracy.
  • Optimize fully-connected layers using tensor factorizations, using two approaches:
    • TCL as an additional layer: reducing the dimensionality of the activation tensor before feeding it to the subsequent two (or more) fully-connected layers and softmax output of the network. This approach preserves or even increase the accuracy.
    • TCL as replacement of a fully-connected layer (partial or full replacement of fully-connected layers): this approach affect the accuracy a bit, but significantly reducing the number of parameters
    • Take the input to the fully-connected layers as an activation tensor X of size (D1, …, DN), we seek a low dimensional core tensor G of sealers size (R1, … RN).
  • Both number of parameters and time complexity of a TCL is smaller then a fully-connect layer. (Detailed comparison of the complexity and number of parameters is in the paper.)
  • To avoid vanishing or exploding gradients, and to make the TCL more robust to changes in the initialization of the factors, we added a batch normalization layer [8] before and after the TCL.

  • Future work
    • we plan to extend our work to more net- work architectures, especially in settings where raw data or learned representations exhibit natural multi-modal structure that we might capture via high-order tensors.

    • We also endeavor to advance our experimental study of TCLS for large-scale, high-resolutions vision datasets.

    • Plan to integrate new extended BLAS primitives which can avoid transpositions needed to compute the tensor contractions.
    • we will look into methods to induce and exploit sparsity in the TCL, to understand the parameter reductions this method can yield over existing state-of-the-art pruning methods.

    • we are working on an extension to the TCL: a tensor regression layer to replace both the fully-connected and final output layers, potentially yielding in- creased accuracy with even greater parameter reductions.

Other Knowledge:

  • Fully-connected layers hold over 80% of the parameters.

Useful reference:

Software:

  • AlexNet
  • VGG

Dataset:

  • CIFAR100
  • ImageNet

Study of Deep Learning

Materials:

  • Video:

Knowledge

  • Softmax layer
    • the output layer
    • output a probability for each class
  • forward evaluation
  • backward propagation
    • Update weights
    • E.g. Gradient descent
  • ground truth
  • FFN: Feed Forward Neural Net
  • Set initial weights
    • Auto encoder
  • Data representation
    •  features
      • categorical features
        • no intrinsic ordering
        • require additional encoding, usually one-hot encoding (illustration)
      • ordinal features
    • Pre-process dataset:
      • min-max normalization
        • when know min and max
        • Learn faster
        • prevent numerical error
      • Standardization
        • when don’t know min and max
    • Overfitting
      • Avoid:
        • Dropout: only for deep learning
        • regularization
  • Network structure
    • Depth: #Hidden layers
      • width and #parameters determine the depth
    • Width: the dimension of each layer
      • < 1000 usually, max few hundreds neurons per hidden layers
    • Connectivity: how neurons are connected among each other
    • #Parameters: determined by the above three factors.
      • Too many will overfit.
      • “Sample/parameter” ratio: usually between 5 to 30.
    • Shape: “tower” vs “pyramid” shape
      • Usually “pyramid” shape
      • Deeper is better.
      • Thin-tall is better than fat-short.
  • Activation function
    • Like a switch
    • Usually non-linear functions
    • E.g.
      • sigmoid, ranging from 0 to 1
        • Deep network: vanishing gradient
        • Used in Recurrent NN (RNN), RSTM, not in feed forward
      • ReLU, ranging from 0 to x
        • Avoid vanishing gradient
        • Mostly commonly-used
        • Used in feed-forward NN
      • tanh, ranging from -1 to 1
        • Commonly-used when the features range from negatives
        • In NLP
  • Loss (or cost) function
    • cross-entropy
      • More suitable for predicting categorical labels
    • squared error
      • More suitable for predicting continuous values
    • Why
      • Compare the surface of different loss functions
      • The difference of “steepness”

“SPEEDING-UP CONVOLUTIONAL NEURAL NETWORKS USING FINE-TUNED CP-DECOMPOSITION” — Vadim Lebedev et al. 2015

Paper: [Link]

Code: N/A

Features:

  • Use CP decomposition with NLS (non-linear least squares) method
    • minimizes the L2-norm of the approximation residual (for a user-defined fixed R) using Gauss-Newton optimization.
  • Decompose the 4D kernel tensor
    • The convolution kernel itself constitutes a 4D tensor with dimensions corresponding to the two spatial dimensions, the input image maps, and the output image maps.

Findings:

  • CP-decomposition approximates the convolution with a 4D kernel tensor by the sequence of four convolutions with small 2D kernel tensors. This decomposition is used to replace the original convolutional layer with a sequence of four convolutional layers with small kernels.
  • fine-tune the entire network on training data using back-propagation.
    • This discriminative fine-tuning works well, even when CP-decomposition has large approximation error.

Other Knowledge:

  • On the theoretical side, these results confirm the intuition that modern CNNs are over-parameterized, i.e. that the sheer number of parameters in the modern CNNs are not needed to store the information about the classification task but, rather, serve to facilitate convergence to good local minima of the loss function

Useful reference:

  • Suggested a scheme based on the CP-decomposition of parts of the kernel tensor obtained by biclustering (alongside with a different decompositions for the first convolutional layer and the fully-connected layers). CP-decompositions of the kernel tensor parts have been computed with the greedy approach. Only fine-tunes the layers above the approximated one.
    • Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736, 2014.
  • Effectively approximate the 4D kernel tensor as a composition (product) of two 3D tensors, perform “local” fine-tuning that minimizes the deviation between the full and the approximated convolutions outputs on the training data.
    • Jaderberg, Max, Vedaldi, Andrea, and Zisserman, Andrew. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference (BMVC), 2014a.
  • there is no finite algorithm for determining canonical rank of a tensor. [Paper]

Dataset:

  • ImageNet

“Tensorizing Neural Networks” — Alexander Novikov et al. 2015

Paper: [Link]

TensorNet Code: [Link]

Features:

  • Use Tensorization to build a tensor from a vector and a matrix
  • Optimize fully-connected layers

Findings:

  • Convert the dense weight matrices of the fully-connected layers to the Tensor Train format. Use TT-format to do TT-layer and learning steps.
  • Potentially address the two issues of wide and shallow network. (see below for the issues)
  • Result: for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

Other Knowledge:

  • These advances of Deep neural networks have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
  • State-of-the-art neural networks reached the hardware limits both in terms the computational power and the memory. A large number of works tried to reduce both hardware requirements (e. g. memory demands) and running times.
  • One of the most straightforward approaches is to use a low-rank representation of the weight matrices. Recent studies show that the weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank it is possible to greatly reduce the number of parameters without significant drop in the predictive accuracy.
  • Matrix and tensor decompositions were recently used to speed up the inference time of CNNs
    • Matrix: E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014, pp. 1269–1277.
    • Tensor: V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” in International Conference on Learning Representations (ICLR), 2014.
  • TT-format is immune to the curse of dimensionality and its algorithms are robust.
  • An arbitrary tensor A a TT-representation exists but is not unique.
  • An attractive property of the TT-decomposition is the ability to efficiently perform several types of operations on tensors if they are in the TT-format:
    • basic linear algebra operations, such as the addition of a constant and the multiplication by a constant, the summation and the entrywise product of tensors (the results of these operations are tensors in the TT-format generally with the increased ranks); computation of global characteristics of a tensor, such as the sum of all elements and the Frobenius norm.
  • Traditionally, very wide shallow networks are not considered because of high computational and memory demands and the over-fitting risk.

Useful reference:

  • Tensor Train paper: I. V. Oseledets, “Tensor-Train decomposition,” SIAM J. Scientific Computing, vol. 33, no. 5, pp. 2295– 2317, 2011.
    • Application: A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov, “Putting MRFs on a Tensor Train,” in International Conference on Machine Learning (ICML), 2014, pp. 811–819.
  • Hierarchical Tucker paper: W. Hackbusch and S. Kuhn, “A new scheme for the tensor representation,” J. Fourier Anal. Appl., vol. 15, pp. 706–722, 2009.

Dataset:

  • MNIST: small, handwritten-digit recognition
    • Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” 1998.
  • CIFAR-10: small, 50,000 train and 10,000 test 32*32 3-channel images, assigned to 10 different classes
    • A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Computer Science Department, University of Toronto, 2009
  • ImageNet: large, 1000-class ImageNet ILSVRC-2012 dataset, 1.2 million training images and 50,000 validation images.
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.

“Learning Spatiotemporal Features with 3D Convolutional Networks” — Du Tran et al. 2015

Paper: Learning Spatiotemporal Features with 3D Convolutional Networks

Website: C3D

Features:

  • 3D
  • Supervised learning
  • For video dataset

Findings:

  • 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets.
  • 3D ConvNet has the ability to model temporal information better owing to 3D convolution and 3D pooling operations.
  • A homogeneous architecture with small 3*3*3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets.
    • Fix the spatial receptive field to 3*3 and vary only the temporal depth of the 3D convolution kernels.

Useful reference:

  • 3D ConvNets: S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013. 1, 2
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
    • “slow fusion model” uses 3D convolutions and averaging pooling in its first 3 convolution layers. It still loses all temporal information after the third convolution layer.

Dataset:

  • UCF101: medium-scale
    • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 5, 7
  • Sports-1M: the largest video classification benchmark
    • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
  • ASLAN: 3631 videos from 432 action classes, for action similarity labeling
  • YUPENN: 420 videos of 14 scene categories
    • K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In CVPR, 2012. 8
  • Maryland:
    • N. Shroff, P. K. Turaga, and R. Chellappa. Moving vistas: Exploiting motion for describing scenes. In CVPR, 2010. 8
  • egocentric: 42 types of everyday objects
    • X. Ren and M. Philipose. Egocentric recognition of handled objects: Benchmark and analysis. In Egocentric Vision workshop, 2009. 2, 8