“SPEEDING-UP CONVOLUTIONAL NEURAL NETWORKS USING FINE-TUNED CP-DECOMPOSITION” — Vadim Lebedev et al. 2015

Paper: [Link]

Code: N/A

Features:

  • Use CP decomposition with NLS (non-linear least squares) method
    • minimizes the L2-norm of the approximation residual (for a user-defined fixed R) using Gauss-Newton optimization.
  • Decompose the 4D kernel tensor
    • The convolution kernel itself constitutes a 4D tensor with dimensions corresponding to the two spatial dimensions, the input image maps, and the output image maps.

Findings:

  • CP-decomposition approximates the convolution with a 4D kernel tensor by the sequence of four convolutions with small 2D kernel tensors. This decomposition is used to replace the original convolutional layer with a sequence of four convolutional layers with small kernels.
  • fine-tune the entire network on training data using back-propagation.
    • This discriminative fine-tuning works well, even when CP-decomposition has large approximation error.

Other Knowledge:

  • On the theoretical side, these results confirm the intuition that modern CNNs are over-parameterized, i.e. that the sheer number of parameters in the modern CNNs are not needed to store the information about the classification task but, rather, serve to facilitate convergence to good local minima of the loss function

Useful reference:

  • Suggested a scheme based on the CP-decomposition of parts of the kernel tensor obtained by biclustering (alongside with a different decompositions for the first convolutional layer and the fully-connected layers). CP-decompositions of the kernel tensor parts have been computed with the greedy approach. Only fine-tunes the layers above the approximated one.
    • Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736, 2014.
  • Effectively approximate the 4D kernel tensor as a composition (product) of two 3D tensors, perform “local” fine-tuning that minimizes the deviation between the full and the approximated convolutions outputs on the training data.
    • Jaderberg, Max, Vedaldi, Andrea, and Zisserman, Andrew. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference (BMVC), 2014a.
  • there is no finite algorithm for determining canonical rank of a tensor. [Paper]

Dataset:

  • ImageNet
Advertisements

“Tensorizing Neural Networks” — Alexander Novikov et al. 2015

Paper: [Link]

TensorNet Code: [Link]

Features:

  • Use Tensorization to build a tensor from a vector and a matrix
  • Optimize fully-connected layers

Findings:

  • Convert the dense weight matrices of the fully-connected layers to the Tensor Train format. Use TT-format to do TT-layer and learning steps.
  • Potentially address the two issues of wide and shallow network. (see below for the issues)
  • Result: for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

Other Knowledge:

  • These advances of Deep neural networks have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
  • State-of-the-art neural networks reached the hardware limits both in terms the computational power and the memory. A large number of works tried to reduce both hardware requirements (e. g. memory demands) and running times.
  • One of the most straightforward approaches is to use a low-rank representation of the weight matrices. Recent studies show that the weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank it is possible to greatly reduce the number of parameters without significant drop in the predictive accuracy.
  • Matrix and tensor decompositions were recently used to speed up the inference time of CNNs
    • Matrix: E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014, pp. 1269–1277.
    • Tensor: V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” in International Conference on Learning Representations (ICLR), 2014.
  • TT-format is immune to the curse of dimensionality and its algorithms are robust.
  • An arbitrary tensor A a TT-representation exists but is not unique.
  • An attractive property of the TT-decomposition is the ability to efficiently perform several types of operations on tensors if they are in the TT-format:
    • basic linear algebra operations, such as the addition of a constant and the multiplication by a constant, the summation and the entrywise product of tensors (the results of these operations are tensors in the TT-format generally with the increased ranks); computation of global characteristics of a tensor, such as the sum of all elements and the Frobenius norm.
  • Traditionally, very wide shallow networks are not considered because of high computational and memory demands and the over-fitting risk.

Useful reference:

  • Tensor Train paper: I. V. Oseledets, “Tensor-Train decomposition,” SIAM J. Scientific Computing, vol. 33, no. 5, pp. 2295– 2317, 2011.
    • Application: A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov, “Putting MRFs on a Tensor Train,” in International Conference on Machine Learning (ICML), 2014, pp. 811–819.
  • Hierarchical Tucker paper: W. Hackbusch and S. Kuhn, “A new scheme for the tensor representation,” J. Fourier Anal. Appl., vol. 15, pp. 706–722, 2009.

Dataset:

  • MNIST: small, handwritten-digit recognition
    • Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” 1998.
  • CIFAR-10: small, 50,000 train and 10,000 test 32*32 3-channel images, assigned to 10 different classes
    • A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Computer Science Department, University of Toronto, 2009
  • ImageNet: large, 1000-class ImageNet ILSVRC-2012 dataset, 1.2 million training images and 50,000 validation images.
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.

Tensor Summary

Why use tensors:

  • natural data representation
  • get better compression when constructing a tensor from a vector or matrix, and efficient operation on the compressed tensor formats (e.g. canonical, Tucker, TT formats)

Problems:

  • No library to support fast tensor operations and tensor decompositions [source]
  • Dimensionality curse (Need to be more clear)
    • Space
    • Running time

Tensor Decompositions:

  • CP Decomposition:
    • The decomposition of tensor T is unique (up to scaling and permutation) if none of the vector pairs are co-linear.
    • Matrix decomposition (e.g. SVD) is not unique.
    • Algorithm: CP-ALS, CP-APR
  • Tucker Decomposition:
  • tensor power method:
  • Tensor Train: [Paper]
  • Hierarchical Tucker: [Paper]

Tensor Decomposition Applications:

  • Healthcare:
  • Deep Learning:
  • Machine Learning:
    • design learning algorithms for estimating parameters of latent variable models like Hidden Markov Model, Mixture of Gaussians and Latent Dirichlet Allocation, community models, probabilistic Context-Free-Grammars, and two-layer neural networks. [source]
    • Tensor methods are very competitive for unsupervised learning of large-scale probabilistic latent variable models, as opposed to traditional methods such as expectation maximization (EM) or Markov chain Monte Carlo (MCMC). The main gain is in terms of computation: (i) tensor methods are embarrassingly parallel and scalable to  large problems, (ii) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods. On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic. [source]

  • Data compression

Build tensors:

  • Build tensors from algorithm property, then do tensor decomposition
  • Build tensors from applications nature, then do tensor approximation
  • Build tensors from vectors or matrices, then do tensor approximation for data compression

Tensor Researchers

Machine Learning:

  • Animashree Anandkumar, UC Irvine
  • Jimeng Sun, GaTech
  • Joyce Ho, Emory
  • Nicholas D. Sidiropoulos, UMN
  • Christos Faloutsos, CMU
  • Lieven De Lathauwer, KU Leuven
  • Evangelos E. Papalexakis, UC Riverside

Numerical:

  • Tamara Kolda, Sandia NL

HPC:

  • George Karypis, UMN
  • Bora Uçar, Inria and LIP

“Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization” — Joyce Ho et al. 2014

Paper: link

Slide: link

Features:

  • sparse tensor
  • sparse factor matrices
  • Poisson regression, based on CP-APR
  • For count data

Phenotyping application background:

  • A major limitation of existing phenotype efforts is the need for human annotation of case and control samples, which require substantial time, effort, and expert knowledge to develop.
  • phenotyping can be viewed as a form of dimensionality reduction, where each phenotype forms a latent space
    • citation: G. Hripcsak and D. J. Albers. Next-generation phenotyping of electronic health records. JAMIA, 20(1):117–121, Dec. 2012.

Tensor factorization vs matrix factorization:

  • Matrix factorization, a common dimensionality reduction approach, is insufficient as it cannot concisely capture structured EHR source interactions, such as multiple medications prescribed to treat a single disease

Findings:

  • Constraints on the factor matrices to minimize the number of non-zero elements
  • augmentation of the tensor approximation
    • Marble decomposes an observed tensor into two terms, a bias (or offset) tensor and an interaction (or signal) tensor (similar to CP-APR factorized tensor).
    • The bias tensor represents the baseline characteristics common amongst the overall population and also provides computational stability. The interaction term is compromised of concise, intuitive, and interpretable phenotypes in the data.
  • Marble achieves at least a 42.8% reduction in the number of non-zero elements compared to CP-APR without sacrificing the quality of the tensor decomposition.

CP Applications:

  • concept discovery: U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: Scaling tensor analysis up by 100 times-algorithms and discoveries. In KDD 2012, pages 316–324, 2012.
  • network analysis of fMRI data: I. Davidson, S. Gilpin, O. Carmichael, and P. Walker. Network discovery via constrained tensor analysis of fMRI data. In KDD 2013, Aug. 2013.
  • community discovery: Y.-R. Lin, J. Sun, H. Sundaram, A. Kelliher, P. Castro, and R. Konuru. Community discovery via metagraph factorization. ACM Transactions on Knowledge Discovery from Data, 5(3), Aug. 2011

Sparsity constrained factor matrices for sparse tensor decomposition:

  • Traditional sparsity-inducing penalties such as ℓ1 and ℓ2 regularization only deal with the standard least-squares minimization.
    • D. Wang and S. Kong. Feature selection from high-order tensorial data via sparse decomposition. Pattern Recognition Letters, 33(13):1695–1702, 2012.
  • Non-parametric Bayesian approaches to sparse Tucker decomposition
    • Z. Xu, F. Yan, Yuan, and Qi. Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis. In ICML 2012, pages 1023–1030. Alan, 2012.
  • A multi-layer NTF has been proposed to achieve sparse representations for various cost functions including KL divergence using a nonlinearly transformed gradient decent approach
    • Proposed sHOPCA, based on HOOI algorithm
    • A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari. Novel multi-layer non-negative tensor factorization with sparsity constraints. In ICANNGA 2007, pages 271–280. Springer, 2007.

Useful reference:

  • CP-APR: E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272–1299, 2012.
  • Survey: A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari. Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. Wiley, 2009.

Dataset:

  • CMS data: the Centers for Medicare and Medicaid Services (CMS) provides the CMS Linkable 2008-2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF), a public available dataset.
    • 10,000 pateints, 129 diagnoses, 115 procedures

“Learning Spatiotemporal Features with 3D Convolutional Networks” — Du Tran et al. 2015

Paper: Learning Spatiotemporal Features with 3D Convolutional Networks

Website: C3D

Features:

  • 3D
  • Supervised learning
  • For video dataset

Findings:

  • 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets.
  • 3D ConvNet has the ability to model temporal information better owing to 3D convolution and 3D pooling operations.
  • A homogeneous architecture with small 3*3*3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets.
    • Fix the spatial receptive field to 3*3 and vary only the temporal depth of the 3D convolution kernels.

Useful reference:

  • 3D ConvNets: S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013. 1, 2
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
    • “slow fusion model” uses 3D convolutions and averaging pooling in its first 3 convolution layers. It still loses all temporal information after the third convolution layer.

Dataset:

  • UCF101: medium-scale
    • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 5, 7
  • Sports-1M: the largest video classification benchmark
    • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
  • ASLAN: 3631 videos from 432 action classes, for action similarity labeling
  • YUPENN: 420 videos of 14 scene categories
    • K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In CVPR, 2012. 8
  • Maryland:
    • N. Shroff, P. K. Turaga, and R. Chellappa. Moving vistas: Exploiting motion for describing scenes. In CVPR, 2010. 8
  • egocentric: 42 types of everyday objects
    • X. Ren and M. Philipose. Egocentric recognition of handled objects: Benchmark and analysis. In Egocentric Vision workshop, 2009. 2, 8