Review of Numerical Methods


  •  Vectors:
    • plus, minus, scalar multiplication
    • magnitude and direction
    • inner product and angle
      • Cauchy-Schwarz inequality
    • parallel and orthogonal
    • projection
    • cross product
      • orthogonal to both the two input vectors
      • right-hand rule
      • magnitude is the area size of the parallelogram, or twice of the area size of the triangle
  • Linear equations
    • Lines in two-dimensions
      • Parameterization: define a line: basepoint, direction vector
      • Direction vector for Ax+By=k is [B, -A]. A normal vector is [A, B].
      • Two lines are parallel if their normal vectors are parallel vectors.
      • If two lines are not parallel, then they have a unique intersection.
      • If two lines are parallel, they may not have intersection at all or the same line with infinitely many intersections.
      • Two parallel lines are equal, <=> the vector connecting one point on each line is orthogonal to the lines’ normal vectors.
      • If two non-parallel lines Ax+By=k1, Cx+Dy=k2; then A and C have one zero at most, AD-BC!=0.
        • The intersection is x=(Dk1-Bk2)/(AD-BC); y=(-Ck1+Ak2)/(AD-BC).
      • Use normal vectors is better for high dimensions.
    • Planes in three dimensions
      • Ax+By+Cz=k,
      • Normal vector: [A,B,C]
      • If two planes are equal <=> the vector connecting one point on each plane is orthogonal to the planes’ normal vectors.
      • Given Ax+By+Cz=k1, Dx+Ey+Fz=k2, possible solutions sets are:
        • a line with direction vector [A,B,C] x [D,E,F], if planes are not parallel;
        • no solutions, if planes are parallel but not equal;
        • a plane, if the planes are the same.
      • More planes could intersect in a single point.
      • We need at least two lines in two variables to obtain a unique intersection; We need at least three planes in three variables to obtain a unique intersection.
      • Rules for manipulating equations
        • Should preserve the solution, should be reversible
        • swap order of equations
        • multiply an equation by a nonzero number
        • Add a multiple of an equation to another
      • A system is inconsistent <=> we find 0=k for k nonzero during Gaussian elimination.
      • It’s not enough to count # of equations (usually) or look for 0=0 to determine if infinitely many solutions.
      • A consistent system has a unique solution <=> each variable is a pivot variable.
      • #Free variables = dimension of solution set

Study of Deep Learning


  • Video:


  • Softmax layer
    • the output layer
    • output a probability for each class
  • forward evaluation
  • backward propagation
    • Update weights
    • E.g. Gradient descent
  • ground truth
  • FFN: Feed Forward Neural Net
  • Set initial weights
    • Auto encoder
  • Data representation
    •  features
      • categorical features
        • no intrinsic ordering
        • require additional encoding, usually one-hot encoding (illustration)
      • ordinal features
    • Pre-process dataset:
      • min-max normalization
        • when know min and max
        • Learn faster
        • prevent numerical error
      • Standardization
        • when don’t know min and max
    • Overfitting
      • Avoid:
        • Dropout: only for deep learning
        • regularization
  • Network structure
    • Depth: #Hidden layers
      • width and #parameters determine the depth
    • Width: the dimension of each layer
      • < 1000 usually, max few hundreds neurons per hidden layers
    • Connectivity: how neurons are connected among each other
    • #Parameters: determined by the above three factors.
      • Too many will overfit.
      • “Sample/parameter” ratio: usually between 5 to 30.
    • Shape: “tower” vs “pyramid” shape
      • Usually “pyramid” shape
      • Deeper is better.
      • Thin-tall is better than fat-short.
  • Activation function
    • Like a switch
    • Usually non-linear functions
    • E.g.
      • sigmoid, ranging from 0 to 1
        • Deep network: vanishing gradient
        • Used in Recurrent NN (RNN), RSTM, not in feed forward
      • ReLU, ranging from 0 to x
        • Avoid vanishing gradient
        • Mostly commonly-used
        • Used in feed-forward NN
      • tanh, ranging from -1 to 1
        • Commonly-used when the features range from negatives
        • In NLP
  • Loss (or cost) function
    • cross-entropy
      • More suitable for predicting categorical labels
    • squared error
      • More suitable for predicting continuous values
    • Why
      • Compare the surface of different loss functions
      • The difference of “steepness”

“Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory” — Shaden Smith et al. 2017

Paper: TODO



  • maintain load balance and low synchronization
  • explore of architectural features, e.g. vectorization, synchronization (mutexes, compare-and-swap, transactional memory, privatization), managing high-bandwidth memory (MCDRAM).
  • Platform: One KNL processor
  • Speedup: 1.8x speedup over a dual socket Intel Xeon 44-core system.


Other Knowledge:

  • HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands.
  • sparse, unstructured tensors
  • Challenges of optimization algorithms on many-core processors:
    a high degree of parallelism, load balance tens to hundreds of parallel threads, and effectively utilize the high-bandwidth memory.

Useful reference:



Paper: [Link]

Code: N/A


  • Use CP decomposition with NLS (non-linear least squares) method
    • minimizes the L2-norm of the approximation residual (for a user-defined fixed R) using Gauss-Newton optimization.
  • Decompose the 4D kernel tensor
    • The convolution kernel itself constitutes a 4D tensor with dimensions corresponding to the two spatial dimensions, the input image maps, and the output image maps.


  • CP-decomposition approximates the convolution with a 4D kernel tensor by the sequence of four convolutions with small 2D kernel tensors. This decomposition is used to replace the original convolutional layer with a sequence of four convolutional layers with small kernels.
  • fine-tune the entire network on training data using back-propagation.
    • This discriminative fine-tuning works well, even when CP-decomposition has large approximation error.

Other Knowledge:

  • On the theoretical side, these results confirm the intuition that modern CNNs are over-parameterized, i.e. that the sheer number of parameters in the modern CNNs are not needed to store the information about the classification task but, rather, serve to facilitate convergence to good local minima of the loss function

Useful reference:

  • Suggested a scheme based on the CP-decomposition of parts of the kernel tensor obtained by biclustering (alongside with a different decompositions for the first convolutional layer and the fully-connected layers). CP-decompositions of the kernel tensor parts have been computed with the greedy approach. Only fine-tunes the layers above the approximated one.
    • Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736, 2014.
  • Effectively approximate the 4D kernel tensor as a composition (product) of two 3D tensors, perform “local” fine-tuning that minimizes the deviation between the full and the approximated convolutions outputs on the training data.
    • Jaderberg, Max, Vedaldi, Andrea, and Zisserman, Andrew. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference (BMVC), 2014a.
  • there is no finite algorithm for determining canonical rank of a tensor. [Paper]


  • ImageNet

“Tensorizing Neural Networks” — Alexander Novikov et al. 2015

Paper: [Link]

TensorNet Code: [Link]


  • Use Tensorization to build a tensor from a vector and a matrix
  • Optimize fully-connected layers


  • Convert the dense weight matrices of the fully-connected layers to the Tensor Train format. Use TT-format to do TT-layer and learning steps.
  • Potentially address the two issues of wide and shallow network. (see below for the issues)
  • Result: for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

Other Knowledge:

  • These advances of Deep neural networks have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
  • State-of-the-art neural networks reached the hardware limits both in terms the computational power and the memory. A large number of works tried to reduce both hardware requirements (e. g. memory demands) and running times.
  • One of the most straightforward approaches is to use a low-rank representation of the weight matrices. Recent studies show that the weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank it is possible to greatly reduce the number of parameters without significant drop in the predictive accuracy.
  • Matrix and tensor decompositions were recently used to speed up the inference time of CNNs
    • Matrix: E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014, pp. 1269–1277.
    • Tensor: V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” in International Conference on Learning Representations (ICLR), 2014.
  • TT-format is immune to the curse of dimensionality and its algorithms are robust.
  • An arbitrary tensor A a TT-representation exists but is not unique.
  • An attractive property of the TT-decomposition is the ability to efficiently perform several types of operations on tensors if they are in the TT-format:
    • basic linear algebra operations, such as the addition of a constant and the multiplication by a constant, the summation and the entrywise product of tensors (the results of these operations are tensors in the TT-format generally with the increased ranks); computation of global characteristics of a tensor, such as the sum of all elements and the Frobenius norm.
  • Traditionally, very wide shallow networks are not considered because of high computational and memory demands and the over-fitting risk.

Useful reference:

  • Tensor Train paper: I. V. Oseledets, “Tensor-Train decomposition,” SIAM J. Scientific Computing, vol. 33, no. 5, pp. 2295– 2317, 2011.
    • Application: A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov, “Putting MRFs on a Tensor Train,” in International Conference on Machine Learning (ICML), 2014, pp. 811–819.
  • Hierarchical Tucker paper: W. Hackbusch and S. Kuhn, “A new scheme for the tensor representation,” J. Fourier Anal. Appl., vol. 15, pp. 706–722, 2009.


  • MNIST: small, handwritten-digit recognition
    • Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” 1998.
  • CIFAR-10: small, 50,000 train and 10,000 test 32*32 3-channel images, assigned to 10 different classes
    • A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Computer Science Department, University of Toronto, 2009
  • ImageNet: large, 1000-class ImageNet ILSVRC-2012 dataset, 1.2 million training images and 50,000 validation images.
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.

Tensor Summary

Why use tensors:

  • natural data representation
  • get better compression when constructing a tensor from a vector or matrix, and efficient operation on the compressed tensor formats (e.g. canonical, Tucker, TT formats)


  • No library to support fast tensor operations and tensor decompositions [source]
  • Dimensionality curse (Need to be more clear)
    • Space
    • Running time

Tensor Decompositions:

  • CP Decomposition:
    • The decomposition of tensor T is unique (up to scaling and permutation) if none of the vector pairs are co-linear.
    • Matrix decomposition (e.g. SVD) is not unique.
    • Algorithm: CP-ALS, CP-APR
  • Tucker Decomposition:
  • tensor power method:
  • Tensor Train: [Paper]
  • Hierarchical Tucker: [Paper]

Tensor Decomposition Applications:

  • Healthcare:
  • Deep Learning:
  • Machine Learning:
    • design learning algorithms for estimating parameters of latent variable models like Hidden Markov Model, Mixture of Gaussians and Latent Dirichlet Allocation, community models, probabilistic Context-Free-Grammars, and two-layer neural networks. [source]
    • Tensor methods are very competitive for unsupervised learning of large-scale probabilistic latent variable models, as opposed to traditional methods such as expectation maximization (EM) or Markov chain Monte Carlo (MCMC). The main gain is in terms of computation: (i) tensor methods are embarrassingly parallel and scalable to  large problems, (ii) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods. On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic. [source]

  • Data compression

Build tensors:

  • Build tensors from algorithm property, then do tensor decomposition
  • Build tensors from applications nature, then do tensor approximation
  • Build tensors from vectors or matrices, then do tensor approximation for data compression

Tensor Researchers

Machine Learning:

  • Animashree Anandkumar, UC Irvine
  • Jimeng Sun, GaTech
  • Joyce Ho, Emory
  • Nicholas D. Sidiropoulos, UMN
  • Christos Faloutsos, CMU
  • Lieven De Lathauwer, KU Leuven
  • Evangelos E. Papalexakis, UC Riverside


  • Tamara Kolda, Sandia NL


  • George Karypis, UMN
  • Bora Uçar, Inria and LIP

“Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization” — Joyce Ho et al. 2014

Paper: link

Slide: link


  • sparse tensor
  • sparse factor matrices
  • Poisson regression, based on CP-APR
  • For count data

Phenotyping application background:

  • A major limitation of existing phenotype efforts is the need for human annotation of case and control samples, which require substantial time, effort, and expert knowledge to develop.
  • phenotyping can be viewed as a form of dimensionality reduction, where each phenotype forms a latent space
    • citation: G. Hripcsak and D. J. Albers. Next-generation phenotyping of electronic health records. JAMIA, 20(1):117–121, Dec. 2012.

Tensor factorization vs matrix factorization:

  • Matrix factorization, a common dimensionality reduction approach, is insufficient as it cannot concisely capture structured EHR source interactions, such as multiple medications prescribed to treat a single disease


  • Constraints on the factor matrices to minimize the number of non-zero elements
  • augmentation of the tensor approximation
    • Marble decomposes an observed tensor into two terms, a bias (or offset) tensor and an interaction (or signal) tensor (similar to CP-APR factorized tensor).
    • The bias tensor represents the baseline characteristics common amongst the overall population and also provides computational stability. The interaction term is compromised of concise, intuitive, and interpretable phenotypes in the data.
  • Marble achieves at least a 42.8% reduction in the number of non-zero elements compared to CP-APR without sacrificing the quality of the tensor decomposition.

CP Applications:

  • concept discovery: U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: Scaling tensor analysis up by 100 times-algorithms and discoveries. In KDD 2012, pages 316–324, 2012.
  • network analysis of fMRI data: I. Davidson, S. Gilpin, O. Carmichael, and P. Walker. Network discovery via constrained tensor analysis of fMRI data. In KDD 2013, Aug. 2013.
  • community discovery: Y.-R. Lin, J. Sun, H. Sundaram, A. Kelliher, P. Castro, and R. Konuru. Community discovery via metagraph factorization. ACM Transactions on Knowledge Discovery from Data, 5(3), Aug. 2011

Sparsity constrained factor matrices for sparse tensor decomposition:

  • Traditional sparsity-inducing penalties such as ℓ1 and ℓ2 regularization only deal with the standard least-squares minimization.
    • D. Wang and S. Kong. Feature selection from high-order tensorial data via sparse decomposition. Pattern Recognition Letters, 33(13):1695–1702, 2012.
  • Non-parametric Bayesian approaches to sparse Tucker decomposition
    • Z. Xu, F. Yan, Yuan, and Qi. Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis. In ICML 2012, pages 1023–1030. Alan, 2012.
  • A multi-layer NTF has been proposed to achieve sparse representations for various cost functions including KL divergence using a nonlinearly transformed gradient decent approach
    • Proposed sHOPCA, based on HOOI algorithm
    • A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari. Novel multi-layer non-negative tensor factorization with sparsity constraints. In ICANNGA 2007, pages 271–280. Springer, 2007.

Useful reference:

  • CP-APR: E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272–1299, 2012.
  • Survey: A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari. Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. Wiley, 2009.


  • CMS data: the Centers for Medicare and Medicaid Services (CMS) provides the CMS Linkable 2008-2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF), a public available dataset.
    • 10,000 pateints, 129 diagnoses, 115 procedures