“Parallel Tensor Compression for Large-Scale Scientific Data” — Woody Austin et al., 2016

Paper: [Link]

Code: [Link]


  • dense tensors
  • The first distributed Tucker decomposition using MPI
  • Tucker decomposition (optimized ST-HOSVD and HOOI algorithms)
  • Use nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel.
  • It achieves near peak performance, as high as 66%, on a single node consisting of 24 cores and up to 17% of peak on over 1000 nodes.
  • Give thorough computation and communication time analysis


  • The algorithm works for dense tensors of any order (i.e., number of dimensions) and size, given adequate memory, e.g., three times the size of the data.

Useful References:


Platform & Software:

Comparison Software:

“Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory” — Shaden Smith et al. 2017

Paper: TODO



  • maintain load balance and low synchronization
  • explore of architectural features, e.g. vectorization, synchronization (mutexes, compare-and-swap, transactional memory, privatization), managing high-bandwidth memory (MCDRAM).
  • Platform: One KNL processor
  • Speedup: 1.8x speedup over a dual socket Intel Xeon 44-core system.


Other Knowledge:

  • HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands.
  • sparse, unstructured tensors
  • Challenges of optimization algorithms on many-core processors:
    a high degree of parallelism, load balance tens to hundreds of parallel threads, and effectively utilize the high-bandwidth memory.

Useful reference: