“The Tensor Algebra Compiler” — Fredrik Kjolstad et.al, 2017

Paper: [Link]

Code: [Link]

Features:

  • For sparse, semi-sparse tensors and sparse, semi-sparse matrices
    • semi-sparse: one mode is either sparse or dense.
  • Support CSR, DCSR, BCSR (for matrices) and CSF (for tensors)
    • Need to specify the order of dimensions
    • Can support MTTKRP with sparse matrices, this can be used later.
  • No auto-tuning now, decision is left to the user.
  • Evaluate SpMV, compound linear algebra, high-order tensor algebra (e.g. MTTKRP, TTM, TTV, INNERPROD, ELEMENTWISE-PLUS)
  • Doesn’t support parallel execution if the output is sparse.
  • “assemble” method is to assemble the sparse index structure of the output tensor and preallocate memory for it.

Findings:

  • “tensor storage formats that separately designate each dimension as dense or sparse and specify their storage order, which can describe several widely used formats but generalizes to many more (Section 3);
  • iteration graphs that describe how to co-iterate through the multi-level index data structures of the sparse operands in a compound tensor expression (Section 4);
    • Each node corresponds to a loop in a loop nest.
    • A tensor path is a tuple of index variables.
    • Do not support cycles, because back edges require traversal in the wrong direction of a format. The solution now is to provide a function to change tensor formats that can be used to break cycles and transpose tensors.
  • merge lattices that describe how to merge the index data structures of sparse tensors that are used in the same tensor expression (Section 5);
  • a code generation algorithm that uses the above concepts to generate efficient code that computes a tensor expression with dense, sparse, and mixed operands (Section 6).”

Motivation:

  • “First, sparse input tensors are compressed into indexed data structures that the kernels must operate on. Second, only non-zero output values should be produced, and the calculations differ between computed components depending on the input tensors that contribute non-zero values. Finally, sparse tensor data structures typically do not support O (1) random access, so kernels must carefully orchestrate co-iteration over multiple tensors.”
    • “Sparse data structures can only be accessed efficiently in one direction.” (Why SPLATT CSF code achieve similar performance for each mode?)
    • “The number of formats increases exponentially as dimensionality increases by d!2^d.”
    • “A sparse storage dimension does not allow efficient random access to indices and values but is optimized for iteration in a specific order.”
  • “Libraries provide a limited set of hand-optimized operations and programmers compute compound operations through a sequence of supported operations using temporary tensors. This reduces locality and effciency, and for some compound operations the temporaries are much larger than the kernel’s inputs and output. On the other hand, it is infeasible to write optimized code for every compound operation by hand because of the combinatorial explosion of all possible compound operations, tensor orders, and tensor formats.”
  • “Our technique generates code entirely from tensor index notation and simple format descriptors that fully specify the compressed data structures. Thus, we do not perform pointer alias or dependency analysis at compile time, nor at runtime as in inspector-executor systems.”

Useful References:

Dataset:

  • SuiteSparse Matrix Collection
  • FROSTT

Platform & Software:

  • C++ library

Comparison Software:

  • A batch of sparse matrix libraries (e.g. OSKI, pOSKI, MKL, Eigen, uBLAS, Gmm++)
  • Tensor: SPLATT, Tensor Toolbox

Questions:

  • why “384 possible kernels for all the combinations of formats and implementations” (p.3)?
  • Why serial CSR SpMV code from taco is faster than other libraries? taco didn’t do any particular optimization.
  • Why Fig. 14 serial, dense-blocked SpMV taco code is slower than OSKI, but its parallel code is faster?
Advertisements

“Parallel Tensor Compression for Large-Scale Scientific Data” — Woody Austin et al., 2016

Paper: [Link]

Code: [Link]

Features:

  • dense tensors
  • The first distributed Tucker decomposition using MPI
  • Tucker decomposition (optimized ST-HOSVD and HOOI algorithms)
  • Use nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel.
  • It achieves near peak performance, as high as 66%, on a single node consisting of 24 cores and up to 17% of peak on over 1000 nodes.
  • Give thorough computation and communication time analysis

Findings:

  • The algorithm works for dense tensors of any order (i.e., number of dimensions) and size, given adequate memory, e.g., three times the size of the data.

Useful References:

Dataset:

Platform & Software:

Comparison Software:

“Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory” — Shaden Smith et al. 2017

Paper: TODO

Code:[Link]

Features:

  • maintain load balance and low synchronization
  • explore of architectural features, e.g. vectorization, synchronization (mutexes, compare-and-swap, transactional memory, privatization), managing high-bandwidth memory (MCDRAM).
  • Platform: One KNL processor
  • Speedup: 1.8x speedup over a dual socket Intel Xeon 44-core system.

Findings:

Other Knowledge:

  • HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands.
  • sparse, unstructured tensors
  • Challenges of optimization algorithms on many-core processors:
    a high degree of parallelism, load balance tens to hundreds of parallel threads, and effectively utilize the high-bandwidth memory.

Useful reference:

Dataset: