“Accelerating the Tucker Decomposition with Compressed Sparse Tensors” — Shaden Smith et al., 2017

Paper: N/A

Code: Will be in SPLATT

Features:

  • sparse tensors
  • TTMc in CSF format on multi-core CPU architecture (no Tucker decomposition result shown)
  • Get 20.7x speedup while mostly using less memory (up to 28.5x)
  • Fig. 7: good analysis for time and memory!

Findings:

  • Similar with SPLATT, decrease the time complexity along with tree levels.
  • The intermediate memory is limited to a single row of Y(n).
  • Parallelized by distributing the I1 (first mode) trees to threads.
  • Present synchronization using mutexes for write conflicts when updating any modes other than the first one.
  • Using multiple CSF representations
  • Heuristically sorting mode order, assume K CSF representations can be stored,
    • For X1, sort the modes by their lengths, with the shortest mode placed at the top level.
    • The remaining K−1 representations are selected in a greedy fashion: at step k, use (4) to examine the costs associated with TTMc for each mode when provided with X 1, . . . , X k−1. The mode with the highest cost is placed at the top level of Xk, and the remaining modes are sorted by increasing length.
    • At the end of this procedure, each mode has the choice of K representations to use for TTMc computation. We assign each mode to the representation with the lowest cost, and use that representation for TTMc.
    • Importantly, if ties are broken in a consistent manner, then it happens in practice that several modes can be assigned to the same X k , meaning that fewer than K representations need be kept in memory for computation.

Useful References:

Dataset:

  • Nell-2, Enron [FROSTT]
  • Netflix (not public), Alzheimer (From Muthu Baskaran)
  • Synthetic: Possion3D, Possion4D

Platform & Software:

  • Intel Haswell E5-2680v3
  • OpenMP
  • Intel compiler with Intel MKL

Comparison Software:

  • HyperTensor: HT-FLAT, HT-BTREE

 

“GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014

Paper: [Link]

Code: N/A

Features:

  • This work extends CSB format to GPU
  • Construct an eCSB (expanded CSB) format
  • eCSB SpMV and SpMTV performance is close.
  • Speedup
    • Compared to multicore CSB, 13x faster.
    • eCSB-SpMV is comparable to CUSP’s HYB-SpMV, and outperforms cuSPARSE’s CSR-SpMV by 20%. While for SpMTV, eCSB behaves much better than both of them by 31%-365%.
    • For Bi-conjugate gradient (BCG), eCSb outperforms cuSPARSE and CUSP by 6% and 25% respectively.
  • eCSB-SpMV shows a higher utilization ratio of GPU.
  • This paper doesn’t express the work well, the work deserves better.

Findings:

  • A straightforward porting of CSB to GPUs is inefficient, because
    • “CSB algorithm performs a complex recursive partition process, uses multiple forms of dynamic parallelism existing among different rows, sub-rows, and sub-blocks.” This may be just programming problems. “Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs.”
    • Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted.
    • “It depends on Cilk++ to distribute computing tasks to cores at runtime for dynamic load balancing.” How does GPU dynamic parallelism perform?
  • eCSB format
    • blk_ptr (new added): The offset of the first nonzero in each blockrow; When COO format is used for blocks, no blk_ptr needed.
    • block_ind: Concatenated block row’s index and block column’s index as a single variable
    • comb_lowbits: The intra-block row and column indices for nonzeros concatenated as a single variable.
    • val: nonzeros
    • Its memory footprint is larger than CSB’s according to the paper. But maybe not, when the number of blocks is much smaller than n^2/β^2.
  • hybrid storage format to organize sub-blocks
    • can be selected from ELL, COO, and HYB dynamically at runtime
  • Z-Morton ordering (bit-interleaving) is used inside each block.
  • GPU implementation
    • one thread block to handle one blockrow
    • The threads in a block integrate through the single blockrow.
    • Use atomic operations on shared memory (SpMV) or global memory (SpMTV)
  • Give how to determine the storage format in Sec 5.3
  • Matrices gets lower performance on eCSB is because their most populated rows are much higher than the average value, leading to more frequent collisions in the atomic additions.

Other Knowledge:

  • Existing CSR, ELL, COO formats cannot enable efficient access to nonzeros along both row and column directions.

Useful References:

Dataset:

  • 11 matrices from UF collection

Platform & Software:

  • Intel Core i7-3770
  • Nvidia GTX Titan
  • Nvidia C2075

Comparison Software:

  • CSB for multicore CPU
  • Nvidia cuSPARSE’s CSR-SpMV
  • Nvidia CUSP’s HYB-SpMV
  • SpMTV is implemented by modifying the code of CUSP

“Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks” — Aydın Buluç et al, 2009

Paper: [Link]

Code: [Link]

Slides: [Link]

Features:

  • Introduce CSB (Compressed Sparse Blocks) format
    • CSB does not favor rows over columns or vice versa.
    • Compared to BCSR which stores a sparse collection of (small) dense blocks, whereas CSB stores a dense collection of sparse blocks.
    • It is motivated by multicore and manycore architectures, in which parallelism and memory bandwidth are key resources.
    • Storage: The storage requirement for CSB is esssentially the same as that for CSR format, for which computing Ax in parallel is easy but AT x (or equivalently computing xT A) is difficult.
  • CSB-SpMV Performance:
    • On one processor, the CSB algorithms for Ax and AT x run just as fast as the CSR algorithm (from OSKI) for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.
    • On multicore platforms, CSB-SpMV runs faster for all the matrices we tested, including the structured ones over Star-P.
    • “Its performance is limited by the memory-system bandwidth, not by lack of parallelism.”
    • The parallel performance of CSB_SPMV and CSB_SPMV_T is generally not affected by highly uneven row and column nonzero counts.
  • Did not employ speculative low-level optimizations such as software prefetching, pipelining, or matrix-specific optimizations such as index and/or value compression, but we believe that CSB and our algorithms should not adversely affect incorporation of these approaches.

Findings:

  • CSR-SpMV:
    • work: Θ(nnz);
    • span: Θ(nr+lgn), nr: the average number of nonzeros per row;
    • storage: n lg nnz + nnz lg n bits for index data, for n*n sparse matrix with nnz nonzeros.
  • CSB format
    • block-size parameter β:
      • a block size slightly larger than √n delivers reasonable performance, which is at most 10%-15% degradation from the best performance.
      • the overall best performance was achieved when β satisfies the equation ⌈lg√n⌉ ≤ lgβ ≤ 3+⌈lg√n⌉.
      • CSB constructor can automatically select a reasonable β in this range.
    • val[nnz]:
      • Store nonzero values in the order of contiguous blocks.
      • The ordering among blocks is flexible, row-major ordering is used in the experiments.
      • Within a block, Z-Morton ordering is used.
        • The choice of layout within the block (Z-Morton or row/column orderings) becomes less significant for sparse blocks as they already do not take full advantage of such features.
        • More importantly, a recursive ordering allows us to efficiently determine the four quadrants of a block using binary search, which is crucial for parallelizing individual blocks.
    • row_ind[nnz]
    • col_ind[nnz]
      • They range from 0 to β-1
      • As a practical matter, we can pack a corresponding pair of elements of row_ind and col_ind into a single integer word of 2lgβ bits so that they make a single array of length nnz, which is comparable to the storage needed by CSR for the col_ind array
    • blk_ptr[n^2/β^2]
      • stores the index of each block in the val array, which is analogous to the row_ptr array for CSR
    • storage: (n^2/β^2) lg nnz+2 nnz lgβ bits for index data
  • Parallelization of CSB-SpMV
    • blockrow splitting, with further dividing long blockrows.
    • Specifically, when a blockrow contains Ω(β) nonzeros, we re- cursively divide it “in half,” yielding two subblockrows, each containing roughly half the nonzeros.
      • To facilitate fast subblockrow divisions, we first partition the blockrow into “chunks” of consecutive blocks. By dividing a blockrow “in half,” we mean assigning to each subblockrow roughly half the chunks.
      • If this chunk contains many blocks, it is guaranteed to contain at most Θ(β) nonzeros, which is sufficiently sparse to perform the serial multiplication;
      • If, on the other hand, the chunk is a single block, it may contain as many as β2 ≈ n nonzeros. Instead, we perform the parallel block-vector multiplication CSB_BLOCKV.
    • For each block, use function CSB_BLOCKV
      • The block-vector multiplication proceeds by recursively dividing the (sub)block M into quadrants M00, M01, M10, and M11, each of which is conveniently stored contiguously in the Z-Morton-ordered val, row_ind, and col_ind arrays between indices start and end. We perform binary searches to find the appropriate dividing points in the array.
    • To avoid the races that might arise due to write conflicts between the subblockrows, we allocate a temporary vector to store the result of one of the subblockrows and allow the other subblockrow to use the output vector. After both subblockrow multiplications complete, we serially add the tempo- rary vector into the output vector.
    • when the nonzeros are evenly distributed across the block- rows, our CSB-SpMV algorithm directly calls the serial multiplication instead.
  • CSB-SpMV
    • work: Θ(n^2/β^2 + nnz)
    • span: O(β lg(n/β) + n/β)
      • choosing β = Θ(√n), CSB_SPMV runs with work Θ(nnz) and span O(√nlgn), achieving a parallelism of Ω(nnz /√n lg n). storage:
    • extra storage (for chunk array and the temporary vector z): O(P√nlgn) space when β = Θ(√n)

Other Knowledge:

  • Introduce CSR format history in P234 right column third paragraph
  • Review CSR-SpMV parallelization strategies on Page 235 left column last five paragraphs and right column first three paragraphs.
  • Many or most of the individual blocks Aij are hypersparse, that the ratio of nonzeros to matrix dimension is asymptotically 0. Thus, the space to store a block should therefore depend only on its nonzero count, not on its dimension.
  • Morton-hybrid layout: stores elements in row-major order within blocks and a Morton ordering among blocks. [Paper: Analyzing block locality in Morton-order and Morton-hybrid matrices]

Useful References:

Dataset:

  • 14 matrices from UF collection

Platform & Software:

  • Platforms: AMD Opteron 8214 (Santa Rosa), Intel Xeon X5460 (Harpertown), Intel Core i7 920 (Nehalem)
  • values: double-precision floating points; indices: 32-bit unsigned integers.
  • Language: Cilk++

Comparison Software:

  • “pure” serial OSKI: without enabling OSKI’s preprocessing step, which chooses blockings for cache and register usage that are tuned to a specific matrix.
  • Star-P: distributed-memory code

Cited by:

“A Unified Optimization Approach for Sparse Tensor Operations on GPUs” — Bangtian Liu et al. 2017

Paper: [Link]

Code: N/A

Features:

  • Optimized sparse MTTKRP and TTM on GPUs
  • Can be extended to high-order tensor computations
  • Speedup:
    • SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs over ParTI-GPU.
    • CANDECOMP/PARAFAC (CP) decomposition achieves up to 14.9 times speedup over SPLATT.
  • My thought:
    • F-COO cannot replace COO format, need F-COO for each mode.
    • sf may need to record the every starting index in F-COO format.

Findings:

  • propose a unified sparse tensor representation called F-COO (Flagged-COOrdinate), which is based on the tensor modes for sparse tensor computations.
    • memory efficient
    • can be used as a unified storage format across different sparse tensor operations
    • Each computation need to construct a F-COO format correspondingly.
    • F-COO is preprocessed for different modes on the host and will only be transferred once in the beginning to the GPU.
  • F-COO details:
    • take mode-3 SpTTM as an example
      • “val”: store nonzero values
      • “k”: Product modes’ indices are stored as in COO format, to guide the Kronecker or Hadamard product operations
      • bf (bit-flag): represent any changes in the index modes which consequently shows the computation has switched to another fiber (in SpTTM) or to another slice (in SpMTTKRP).
        • bf value will change from 1 to 0 when a change in the i or j value occurs.
        • The number of non-zeros processed per thread depends on the data type selected for the bf array. For example, if we use uint8 t or unsigned char to bf, the number of non- zeros processed per thread will be 8.
      • sf (start-flag): indicate whether partitions processed by the current thread start new indices on index mode over previous thread.
  • Unified parallel algorithms for different sparse tensor operations, not sensitive to mode changes
    • Seems The one-shot algorithm for MTTKRP is already used before.
    • They don’t have large intermediate data is because they deal with factor matrices one column for each thread block, instead of a row.
    • And each thread block writes to different global memory locations.
  • Use segmented scan with bf and sf information.
  • GPU parallelization:
    • launch two-dimensional thread grids with one-dimensional thread blocks, instead of one-dimensional thread grids with two-dimensional thread blocks (as in ParTI), to avoid thread divergence inside a warp and cause strided memory accesses.
    • This partitioning strategy makes the computation imbalance exist between thread blocks instead of inside threads.
  • Use GPU-specific optimizations: kernel fusion, warp shuffle, and data reuse
    • Use Read-Only Data-Cache for factor matrices
    • fuse product kernel, segmented scan, and the accumulation kernels together to increase data reuse and keep intermediate data in shared memory
    • Warp shuffle is used to increase data sharing inside a warp. Warp shuffle enables register to register data exchange and thus reduces the shared memory footprint and avoids overusing shared memory.

Other Knowledge:

  • Challenges in optimizing sparse tensor operations on GPUs, such as i) finding a good parallelization granularity, ii) reducing storage costs (large intermediate data) and irregularities in memory accesses, and iii) dealing with atomic updates.
  • Prior work has used fiber- or slice-level computations as the granularity for parallelization.
    • However, such approaches lead to noticeable load imbalance between threads on the GPU because of the sparse nature of the tensors.
    • Also the optimizations do not deliver consistent performance for different modes and ranks

Dataset:

  •  FROSTT: brainq, nell1, nell2, delicious

Platform:

  • Intel Core i7-5820K CPU
  • NVIDIA GeForce GTX Titan X platform

Comparison Software:

  • ParTI
  • SPLATT

“Merge-based Parallel Sparse Matrix-Vector Multiplication” — Duane Merrill et al. 2016

Paper: [Link]

Slide: [Link]

Code: [Link]

Features:

  • Overall, better average performance compared to Nvidia cuSPARSE, Intel MKL and other implementations. Performance is worse for some matrices with even rows.
  • The first work adapted merge-based parallel decomposition for computing a well-balanced SpMV directly on CSR matrices without offline analysis, format conversion, or the construction of side-band data (aka preprocessing, inspection, reformatting, or supplemental encoding). (nz_indices used in the merge-based algorithm is nnz location indices, which is actually natural numbers from 0 to nnz-1, not extra storage is needed for it.) This decomposition is particularly useful for bounding workloads across multi-scale processors and systems with fixed-size local memories.
  • The merge-based decomposition is orthogonal to (and yet supportive of) bandwidth optimization strategies such as index compression, blocking, and relabeling. In fact, the elimination of workload imbalance from our method is likely to amplify the benefits of these techniques.
  • Explore NUMA opportunities for multi-socket systems.
  • Merge-based method allows us to make maximal utilization of the GPU’s fixed-size shared memory resources, regardless of the ratio between row-offsets and nonzeros consumed during each path-chunk. This is not possible for row-based or nonzero-splitting strategies that are unable to provide tight bounds on local workloads.
  • Binary searching and fix-up are counted as part of SpMV running time, while NUMA page migration as preprocessing overhead.
  • The row-length imperviousness (the correlation between GFLOPs throughput vs row-length variation) and the performance predictability (the correlation between elapsed running time vs matrix size nnz) are GOOD (Fig. 9).

Findings:

  • Adapt a simple parallelization strategy originally developed for efficient merging of sorted sequences [Paper]
  • The central idea is to frame the parallel CsrMV decomposition as a logical merger of two lists: the row descriptors and the indices of the nonzero values.
  • Merge-based CsrMV
    • Two stages: partitioning and merging stages.
    • The two elements Ai and Bj scheduled to be compared at diagonal k can be found via constrained binary search along that diagonal: find the first (i,j) where Ai is greater than all of the items consumed before Bj, given that i+j=k.
    • this parallelization strategy is that the decision path can be partitioned hierarchically, trivially enabling parallelization across large, multi-scale systems. work complexity: O(N+plogN); time complexity: O(N/p+logN).

Other Knowledge:

  • Specialized or supplementary formats ultimately burden the application with both
    • (1) preprocessing time for inspection and formatting, which may be tens of thousands of times greater than the SpMV operation itself;
    • (2) excess storage overhead because the original CSR matrix is likely required by other routines and cannot be discarded.
  • Specialized sparse matrix formats:
    • padding (e.g. ELL) and/or reorganization strategies (map similarly-sized rows onto co-scheduled thread groups, e.g. PKT, SELL),
    • reduce memory I/O via index compression (e.g. blocking, bit-encoding, BCCOO),
    • hybrid and multi-level formats (e.g. pOSKI, CSB, HYB, Cocktail).
  • CsrMV parallelization strategies:
    • Row splitting:
      • The splitting of long rows can be done statically via a preprocessing step that encodes the dataset into an alternative format, or dynamically using a work-sharing runtime. The dynamic variant requires runtime task distribu- tion, a behavior likely to incur processor contention and limited scalability on massively parallel systems.
      • Vectorization is a common variant of row-splitting in which a group of threads is assigned to process each row. Maybe significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths. While adaptively vectorize based on row length require significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths.
    • Nonzero splitting: need to perform the searching within the row-offsets array as an offline preprocessing step.

COO SpMV optimizations:

Dataset:

  • 4201 non-vector, non-complex sparse matrices from University of Florida Sparse Matrix Collection

Platform:

  • dual-socket NUMA CPU platform, with two Intel Xeon E5-2690v2 CPUs
  • one NVIDIA Tesla K40 GPU
  • Test bandwidth: stream triad score [Webpage]

Comparison Software:

  • Intel MKL v11.3
  • NVIDIA’s cuSPARSE v7.5
  • Other non-CSR formats expressly designed to tolerate row-length variation:
    • CSB (Compressed Sparse Blocks) [Paper]
    • HYB, also from NVIDIA’s cuSPARSE library, HybMV [Paper]
    • pOSKI [Webpage]
    • yaSpMV: BCCOO (blocked compressed common coordinate) [Paper]

“Tensor Contraction Layers for Parsimonious Deep Nets” — Jean Kossaifi et al. 2017

Name: TCL

Paper: [Link]

Code: N/A

Features:

  • First work on applying tensor decomposition as a general layer or replace fully-connected layers.
  • Use TTM operations
  • Tensor modes meaning: height, width, channel in order.
  • In TCL layers, height size (128-512 in the paper) is much larger than width size (always 3 in paper).
  • Good references

Findings:

  • TCLs reduce the dimensionality of the activation tensors (only for two spacial modes of images, leaving the modes corresponding to the channel and the batch size untouched) and thus the number of model parameters, at the same time, preserve high accuracy.
  • Optimize fully-connected layers using tensor factorizations, using two approaches:
    • TCL as an additional layer: reducing the dimensionality of the activation tensor before feeding it to the subsequent two (or more) fully-connected layers and softmax output of the network. This approach preserves or even increase the accuracy.
    • TCL as replacement of a fully-connected layer (partial or full replacement of fully-connected layers): this approach affect the accuracy a bit, but significantly reducing the number of parameters
    • Take the input to the fully-connected layers as an activation tensor X of size (D1, …, DN), we seek a low dimensional core tensor G of sealers size (R1, … RN).
  • Both number of parameters and time complexity of a TCL is smaller then a fully-connect layer. (Detailed comparison of the complexity and number of parameters is in the paper.)
  • To avoid vanishing or exploding gradients, and to make the TCL more robust to changes in the initialization of the factors, we added a batch normalization layer [8] before and after the TCL.

  • Future work
    • we plan to extend our work to more net- work architectures, especially in settings where raw data or learned representations exhibit natural multi-modal structure that we might capture via high-order tensors.

    • We also endeavor to advance our experimental study of TCLS for large-scale, high-resolutions vision datasets.

    • Plan to integrate new extended BLAS primitives which can avoid transpositions needed to compute the tensor contractions.
    • we will look into methods to induce and exploit sparsity in the TCL, to understand the parameter reductions this method can yield over existing state-of-the-art pruning methods.

    • we are working on an extension to the TCL: a tensor regression layer to replace both the fully-connected and final output layers, potentially yielding in- creased accuracy with even greater parameter reductions.

Other Knowledge:

  • Fully-connected layers hold over 80% of the parameters.

Useful reference:

Software:

  • AlexNet
  • VGG

Dataset:

  • CIFAR100
  • ImageNet

“SPEEDING-UP CONVOLUTIONAL NEURAL NETWORKS USING FINE-TUNED CP-DECOMPOSITION” — Vadim Lebedev et al. 2015

Paper: [Link]

Code: N/A

Features:

  • Use CP decomposition with NLS (non-linear least squares) method
    • minimizes the L2-norm of the approximation residual (for a user-defined fixed R) using Gauss-Newton optimization.
  • Decompose the 4D kernel tensor
    • The convolution kernel itself constitutes a 4D tensor with dimensions corresponding to the two spatial dimensions, the input image maps, and the output image maps.

Findings:

  • CP-decomposition approximates the convolution with a 4D kernel tensor by the sequence of four convolutions with small 2D kernel tensors. This decomposition is used to replace the original convolutional layer with a sequence of four convolutional layers with small kernels.
  • fine-tune the entire network on training data using back-propagation.
    • This discriminative fine-tuning works well, even when CP-decomposition has large approximation error.

Other Knowledge:

  • On the theoretical side, these results confirm the intuition that modern CNNs are over-parameterized, i.e. that the sheer number of parameters in the modern CNNs are not needed to store the information about the classification task but, rather, serve to facilitate convergence to good local minima of the loss function

Useful reference:

  • Suggested a scheme based on the CP-decomposition of parts of the kernel tensor obtained by biclustering (alongside with a different decompositions for the first convolutional layer and the fully-connected layers). CP-decompositions of the kernel tensor parts have been computed with the greedy approach. Only fine-tunes the layers above the approximated one.
    • Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736, 2014.
  • Effectively approximate the 4D kernel tensor as a composition (product) of two 3D tensors, perform “local” fine-tuning that minimizes the deviation between the full and the approximated convolutions outputs on the training data.
    • Jaderberg, Max, Vedaldi, Andrea, and Zisserman, Andrew. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference (BMVC), 2014a.
  • there is no finite algorithm for determining canonical rank of a tensor. [Paper]

Dataset:

  • ImageNet

“Tensorizing Neural Networks” — Alexander Novikov et al. 2015

Paper: [Link]

TensorNet Code: [Link]

Features:

  • Use Tensorization to build a tensor from a vector and a matrix
  • Optimize fully-connected layers

Findings:

  • Convert the dense weight matrices of the fully-connected layers to the Tensor Train format. Use TT-format to do TT-layer and learning steps.
  • Potentially address the two issues of wide and shallow network. (see below for the issues)
  • Result: for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

Other Knowledge:

  • These advances of Deep neural networks have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
  • State-of-the-art neural networks reached the hardware limits both in terms the computational power and the memory. A large number of works tried to reduce both hardware requirements (e. g. memory demands) and running times.
  • One of the most straightforward approaches is to use a low-rank representation of the weight matrices. Recent studies show that the weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank it is possible to greatly reduce the number of parameters without significant drop in the predictive accuracy.
  • Matrix and tensor decompositions were recently used to speed up the inference time of CNNs
    • Matrix: E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014, pp. 1269–1277.
    • Tensor: V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” in International Conference on Learning Representations (ICLR), 2014.
  • TT-format is immune to the curse of dimensionality and its algorithms are robust.
  • An arbitrary tensor A a TT-representation exists but is not unique.
  • An attractive property of the TT-decomposition is the ability to efficiently perform several types of operations on tensors if they are in the TT-format:
    • basic linear algebra operations, such as the addition of a constant and the multiplication by a constant, the summation and the entrywise product of tensors (the results of these operations are tensors in the TT-format generally with the increased ranks); computation of global characteristics of a tensor, such as the sum of all elements and the Frobenius norm.
  • Traditionally, very wide shallow networks are not considered because of high computational and memory demands and the over-fitting risk.

Useful reference:

  • Tensor Train paper: I. V. Oseledets, “Tensor-Train decomposition,” SIAM J. Scientific Computing, vol. 33, no. 5, pp. 2295– 2317, 2011.
    • Application: A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov, “Putting MRFs on a Tensor Train,” in International Conference on Machine Learning (ICML), 2014, pp. 811–819.
  • Hierarchical Tucker paper: W. Hackbusch and S. Kuhn, “A new scheme for the tensor representation,” J. Fourier Anal. Appl., vol. 15, pp. 706–722, 2009.

Dataset:

  • MNIST: small, handwritten-digit recognition
    • Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” 1998.
  • CIFAR-10: small, 50,000 train and 10,000 test 32*32 3-channel images, assigned to 10 different classes
    • A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Computer Science Department, University of Toronto, 2009
  • ImageNet: large, 1000-class ImageNet ILSVRC-2012 dataset, 1.2 million training images and 50,000 validation images.
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.

“Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization” — Joyce Ho et al. 2014

Paper: link

Slide: link

Features:

  • sparse tensor
  • sparse factor matrices
  • Poisson regression, based on CP-APR
  • For count data

Phenotyping application background:

  • A major limitation of existing phenotype efforts is the need for human annotation of case and control samples, which require substantial time, effort, and expert knowledge to develop.
  • phenotyping can be viewed as a form of dimensionality reduction, where each phenotype forms a latent space
    • citation: G. Hripcsak and D. J. Albers. Next-generation phenotyping of electronic health records. JAMIA, 20(1):117–121, Dec. 2012.

Tensor factorization vs matrix factorization:

  • Matrix factorization, a common dimensionality reduction approach, is insufficient as it cannot concisely capture structured EHR source interactions, such as multiple medications prescribed to treat a single disease

Findings:

  • Constraints on the factor matrices to minimize the number of non-zero elements
  • augmentation of the tensor approximation
    • Marble decomposes an observed tensor into two terms, a bias (or offset) tensor and an interaction (or signal) tensor (similar to CP-APR factorized tensor).
    • The bias tensor represents the baseline characteristics common amongst the overall population and also provides computational stability. The interaction term is compromised of concise, intuitive, and interpretable phenotypes in the data.
  • Marble achieves at least a 42.8% reduction in the number of non-zero elements compared to CP-APR without sacrificing the quality of the tensor decomposition.

CP Applications:

  • concept discovery: U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: Scaling tensor analysis up by 100 times-algorithms and discoveries. In KDD 2012, pages 316–324, 2012.
  • network analysis of fMRI data: I. Davidson, S. Gilpin, O. Carmichael, and P. Walker. Network discovery via constrained tensor analysis of fMRI data. In KDD 2013, Aug. 2013.
  • community discovery: Y.-R. Lin, J. Sun, H. Sundaram, A. Kelliher, P. Castro, and R. Konuru. Community discovery via metagraph factorization. ACM Transactions on Knowledge Discovery from Data, 5(3), Aug. 2011

Sparsity constrained factor matrices for sparse tensor decomposition:

  • Traditional sparsity-inducing penalties such as ℓ1 and ℓ2 regularization only deal with the standard least-squares minimization.
    • D. Wang and S. Kong. Feature selection from high-order tensorial data via sparse decomposition. Pattern Recognition Letters, 33(13):1695–1702, 2012.
  • Non-parametric Bayesian approaches to sparse Tucker decomposition
    • Z. Xu, F. Yan, Yuan, and Qi. Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis. In ICML 2012, pages 1023–1030. Alan, 2012.
  • A multi-layer NTF has been proposed to achieve sparse representations for various cost functions including KL divergence using a nonlinearly transformed gradient decent approach
    • Proposed sHOPCA, based on HOOI algorithm
    • A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari. Novel multi-layer non-negative tensor factorization with sparsity constraints. In ICANNGA 2007, pages 271–280. Springer, 2007.

Useful reference:

  • CP-APR: E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272–1299, 2012.
  • Survey: A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari. Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. Wiley, 2009.

Dataset:

  • CMS data: the Centers for Medicare and Medicaid Services (CMS) provides the CMS Linkable 2008-2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF), a public available dataset.
    • 10,000 pateints, 129 diagnoses, 115 procedures