“Tensor Contraction Layers for Parsimonious Deep Nets” — Jean Kossaifi et al. 2017

Name: TCL

Paper: [Link]

Code: N/A


  • First work on applying tensor decomposition as a general layer or replace fully-connected layers.
  • Use TTM operations
  • Tensor modes meaning: height, width, channel in order.
  • In TCL layers, height size (128-512 in the paper) is much larger than width size (always 3 in paper).
  • Good references


  • TCLs reduce the dimensionality of the activation tensors (only for two spacial modes of images, leaving the modes corresponding to the channel and the batch size untouched) and thus the number of model parameters, at the same time, preserve high accuracy.
  • Optimize fully-connected layers using tensor factorizations, using two approaches:
    • TCL as an additional layer: reducing the dimensionality of the activation tensor before feeding it to the subsequent two (or more) fully-connected layers and softmax output of the network. This approach preserves or even increase the accuracy.
    • TCL as replacement of a fully-connected layer (partial or full replacement of fully-connected layers): this approach affect the accuracy a bit, but significantly reducing the number of parameters
    • Take the input to the fully-connected layers as an activation tensor X of size (D1, …, DN), we seek a low dimensional core tensor G of sealers size (R1, … RN).
  • Both number of parameters and time complexity of a TCL is smaller then a fully-connect layer. (Detailed comparison of the complexity and number of parameters is in the paper.)
  • To avoid vanishing or exploding gradients, and to make the TCL more robust to changes in the initialization of the factors, we added a batch normalization layer [8] before and after the TCL.

  • Future work
    • we plan to extend our work to more net- work architectures, especially in settings where raw data or learned representations exhibit natural multi-modal structure that we might capture via high-order tensors.

    • We also endeavor to advance our experimental study of TCLS for large-scale, high-resolutions vision datasets.

    • Plan to integrate new extended BLAS primitives which can avoid transpositions needed to compute the tensor contractions.
    • we will look into methods to induce and exploit sparsity in the TCL, to understand the parameter reductions this method can yield over existing state-of-the-art pruning methods.

    • we are working on an extension to the TCL: a tensor regression layer to replace both the fully-connected and final output layers, potentially yielding in- creased accuracy with even greater parameter reductions.

Other Knowledge:

  • Fully-connected layers hold over 80% of the parameters.

Useful reference:


  • AlexNet
  • VGG


  • CIFAR100
  • ImageNet

Review of Numerical Methods


  •  Vectors:
    • plus, minus, scalar multiplication
    • magnitude and direction
    • inner product and angle
      • Cauchy-Schwarz inequality
    • parallel and orthogonal
    • projection
    • cross product
      • orthogonal to both the two input vectors
      • right-hand rule
      • magnitude is the area size of the parallelogram, or twice of the area size of the triangle
  • Linear equations
    • Lines in two-dimensions
      • Parameterization: define a line: basepoint, direction vector
      • Direction vector for Ax+By=k is [B, -A]. A normal vector is [A, B].
      • Two lines are parallel if their normal vectors are parallel vectors.
      • If two lines are not parallel, then they have a unique intersection.
      • If two lines are parallel, they may not have intersection at all or the same line with infinitely many intersections.
      • Two parallel lines are equal, <=> the vector connecting one point on each line is orthogonal to the lines’ normal vectors.
      • If two non-parallel lines Ax+By=k1, Cx+Dy=k2; then A and C have one zero at most, AD-BC!=0.
        • The intersection is x=(Dk1-Bk2)/(AD-BC); y=(-Ck1+Ak2)/(AD-BC).
      • Use normal vectors is better for high dimensions.
    • Planes in three dimensions
      • Ax+By+Cz=k,
      • Normal vector: [A,B,C]
      • If two planes are equal <=> the vector connecting one point on each plane is orthogonal to the planes’ normal vectors.
      • Given Ax+By+Cz=k1, Dx+Ey+Fz=k2, possible solutions sets are:
        • a line with direction vector [A,B,C] x [D,E,F], if planes are not parallel;
        • no solutions, if planes are parallel but not equal;
        • a plane, if the planes are the same.
      • More planes could intersect in a single point.
      • We need at least two lines in two variables to obtain a unique intersection; We need at least three planes in three variables to obtain a unique intersection.
      • Rules for manipulating equations
        • Should preserve the solution, should be reversible
        • swap order of equations
        • multiply an equation by a nonzero number
        • Add a multiple of an equation to another
      • A system is inconsistent <=> we find 0=k for k nonzero during Gaussian elimination.
      • It’s not enough to count # of equations (usually) or look for 0=0 to determine if infinitely many solutions.
      • A consistent system has a unique solution <=> each variable is a pivot variable.
      • #Free variables = dimension of solution set

Study of Deep Learning


  • Video:


  • Softmax layer
    • the output layer
    • output a probability for each class
  • forward evaluation
  • backward propagation
    • Update weights
    • E.g. Gradient descent
  • ground truth
  • FFN: Feed Forward Neural Net
  • Set initial weights
    • Auto encoder
  • Data representation
    •  features
      • categorical features
        • no intrinsic ordering
        • require additional encoding, usually one-hot encoding (illustration)
      • ordinal features
    • Pre-process dataset:
      • min-max normalization
        • when know min and max
        • Learn faster
        • prevent numerical error
      • Standardization
        • when don’t know min and max
    • Overfitting
      • Avoid:
        • Dropout: only for deep learning
        • regularization
  • Network structure
    • Depth: #Hidden layers
      • width and #parameters determine the depth
    • Width: the dimension of each layer
      • < 1000 usually, max few hundreds neurons per hidden layers
    • Connectivity: how neurons are connected among each other
    • #Parameters: determined by the above three factors.
      • Too many will overfit.
      • “Sample/parameter” ratio: usually between 5 to 30.
    • Shape: “tower” vs “pyramid” shape
      • Usually “pyramid” shape
      • Deeper is better.
      • Thin-tall is better than fat-short.
  • Activation function
    • Like a switch
    • Usually non-linear functions
    • E.g.
      • sigmoid, ranging from 0 to 1
        • Deep network: vanishing gradient
        • Used in Recurrent NN (RNN), RSTM, not in feed forward
      • ReLU, ranging from 0 to x
        • Avoid vanishing gradient
        • Mostly commonly-used
        • Used in feed-forward NN
      • tanh, ranging from -1 to 1
        • Commonly-used when the features range from negatives
        • In NLP
  • Loss (or cost) function
    • cross-entropy
      • More suitable for predicting categorical labels
    • squared error
      • More suitable for predicting continuous values
    • Why
      • Compare the surface of different loss functions
      • The difference of “steepness”

“Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory” — Shaden Smith et al. 2017

Paper: TODO



  • maintain load balance and low synchronization
  • explore of architectural features, e.g. vectorization, synchronization (mutexes, compare-and-swap, transactional memory, privatization), managing high-bandwidth memory (MCDRAM).
  • Platform: One KNL processor
  • Speedup: 1.8x speedup over a dual socket Intel Xeon 44-core system.


Other Knowledge:

  • HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands.
  • sparse, unstructured tensors
  • Challenges of optimization algorithms on many-core processors:
    a high degree of parallelism, load balance tens to hundreds of parallel threads, and effectively utilize the high-bandwidth memory.

Useful reference:



Paper: [Link]

Code: N/A


  • Use CP decomposition with NLS (non-linear least squares) method
    • minimizes the L2-norm of the approximation residual (for a user-defined fixed R) using Gauss-Newton optimization.
  • Decompose the 4D kernel tensor
    • The convolution kernel itself constitutes a 4D tensor with dimensions corresponding to the two spatial dimensions, the input image maps, and the output image maps.


  • CP-decomposition approximates the convolution with a 4D kernel tensor by the sequence of four convolutions with small 2D kernel tensors. This decomposition is used to replace the original convolutional layer with a sequence of four convolutional layers with small kernels.
  • fine-tune the entire network on training data using back-propagation.
    • This discriminative fine-tuning works well, even when CP-decomposition has large approximation error.

Other Knowledge:

  • On the theoretical side, these results confirm the intuition that modern CNNs are over-parameterized, i.e. that the sheer number of parameters in the modern CNNs are not needed to store the information about the classification task but, rather, serve to facilitate convergence to good local minima of the loss function

Useful reference:

  • Suggested a scheme based on the CP-decomposition of parts of the kernel tensor obtained by biclustering (alongside with a different decompositions for the first convolutional layer and the fully-connected layers). CP-decompositions of the kernel tensor parts have been computed with the greedy approach. Only fine-tunes the layers above the approximated one.
    • Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736, 2014.
  • Effectively approximate the 4D kernel tensor as a composition (product) of two 3D tensors, perform “local” fine-tuning that minimizes the deviation between the full and the approximated convolutions outputs on the training data.
    • Jaderberg, Max, Vedaldi, Andrea, and Zisserman, Andrew. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference (BMVC), 2014a.
  • there is no finite algorithm for determining canonical rank of a tensor. [Paper]


  • ImageNet

“Tensorizing Neural Networks” — Alexander Novikov et al. 2015

Paper: [Link]

TensorNet Code: [Link]


  • Use Tensorization to build a tensor from a vector and a matrix
  • Optimize fully-connected layers


  • Convert the dense weight matrices of the fully-connected layers to the Tensor Train format. Use TT-format to do TT-layer and learning steps.
  • Potentially address the two issues of wide and shallow network. (see below for the issues)
  • Result: for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

Other Knowledge:

  • These advances of Deep neural networks have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
  • State-of-the-art neural networks reached the hardware limits both in terms the computational power and the memory. A large number of works tried to reduce both hardware requirements (e. g. memory demands) and running times.
  • One of the most straightforward approaches is to use a low-rank representation of the weight matrices. Recent studies show that the weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank it is possible to greatly reduce the number of parameters without significant drop in the predictive accuracy.
  • Matrix and tensor decompositions were recently used to speed up the inference time of CNNs
    • Matrix: E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014, pp. 1269–1277.
    • Tensor: V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” in International Conference on Learning Representations (ICLR), 2014.
  • TT-format is immune to the curse of dimensionality and its algorithms are robust.
  • An arbitrary tensor A a TT-representation exists but is not unique.
  • An attractive property of the TT-decomposition is the ability to efficiently perform several types of operations on tensors if they are in the TT-format:
    • basic linear algebra operations, such as the addition of a constant and the multiplication by a constant, the summation and the entrywise product of tensors (the results of these operations are tensors in the TT-format generally with the increased ranks); computation of global characteristics of a tensor, such as the sum of all elements and the Frobenius norm.
  • Traditionally, very wide shallow networks are not considered because of high computational and memory demands and the over-fitting risk.

Useful reference:

  • Tensor Train paper: I. V. Oseledets, “Tensor-Train decomposition,” SIAM J. Scientific Computing, vol. 33, no. 5, pp. 2295– 2317, 2011.
    • Application: A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov, “Putting MRFs on a Tensor Train,” in International Conference on Machine Learning (ICML), 2014, pp. 811–819.
  • Hierarchical Tucker paper: W. Hackbusch and S. Kuhn, “A new scheme for the tensor representation,” J. Fourier Anal. Appl., vol. 15, pp. 706–722, 2009.


  • MNIST: small, handwritten-digit recognition
    • Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” 1998.
  • CIFAR-10: small, 50,000 train and 10,000 test 32*32 3-channel images, assigned to 10 different classes
    • A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Computer Science Department, University of Toronto, 2009
  • ImageNet: large, 1000-class ImageNet ILSVRC-2012 dataset, 1.2 million training images and 50,000 validation images.
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.

Tensor Summary

Why use tensors:

  • natural data representation
  • get better compression when constructing a tensor from a vector or matrix, and efficient operation on the compressed tensor formats (e.g. canonical, Tucker, TT formats)


  • No library to support fast tensor operations and tensor decompositions [source]
  • Dimensionality curse (Need to be more clear)
    • Space
    • Running time

Tensor Decompositions:

  • CP Decomposition:
    • The decomposition of tensor T is unique (up to scaling and permutation) if none of the vector pairs are co-linear.
    • Matrix decomposition (e.g. SVD) is not unique.
    • Algorithm: CP-ALS, CP-APR
  • Tucker Decomposition:
  • tensor power method:
  • Tensor Train: [Paper]
  • Hierarchical Tucker: [Paper]

Tensor Decomposition Applications:

  • Healthcare:
  • Deep Learning:
  • Machine Learning:
    • design learning algorithms for estimating parameters of latent variable models like Hidden Markov Model, Mixture of Gaussians and Latent Dirichlet Allocation, community models, probabilistic Context-Free-Grammars, and two-layer neural networks. [source]
    • Tensor methods are very competitive for unsupervised learning of large-scale probabilistic latent variable models, as opposed to traditional methods such as expectation maximization (EM) or Markov chain Monte Carlo (MCMC). The main gain is in terms of computation: (i) tensor methods are embarrassingly parallel and scalable to  large problems, (ii) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods. On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic. [source]

  • Data compression

Build tensors:

  • Build tensors from algorithm property, then do tensor decomposition
  • Build tensors from applications nature, then do tensor approximation
  • Build tensors from vectors or matrices, then do tensor approximation for data compression

Tensor Researchers

Machine Learning:

  • Animashree Anandkumar, UC Irvine
  • Jimeng Sun, GaTech
  • Joyce Ho, Emory
  • Nicholas D. Sidiropoulos, UMN
  • Christos Faloutsos, CMU
  • Lieven De Lathauwer, KU Leuven
  • Evangelos E. Papalexakis, UC Riverside


  • Tamara Kolda, Sandia NL


  • George Karypis, UMN
  • Bora Uçar, Inria and LIP