“An Efficient Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats” — Peter Ahrens et al., 2017

Paper: [Link]

Code: [Link]

Features:

Findings:

Useful References:

Dataset:

Platform & Software:

Comparison Software:

  • OSKI
Advertisements

“Parallel Tensor Compression for Large-Scale Scientific Data” — Woody Austin et al., 2016

Paper: [Link]

Code: [Link]

Features:

  • dense tensors
  • The first distributed Tucker decomposition using MPI
  • Tucker decomposition (optimized ST-HOSVD and HOOI algorithms)
  • Use nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel.
  • It achieves near peak performance, as high as 66%, on a single node consisting of 24 cores and up to 17% of peak on over 1000 nodes.
  • Give thorough computation and communication time analysis

Findings:

  • The algorithm works for dense tensors of any order (i.e., number of dimensions) and size, given adequate memory, e.g., three times the size of the data.

Useful References:

Dataset:

Platform & Software:

Comparison Software:

“Accelerating the Tucker Decomposition with Compressed Sparse Tensors” — Shaden Smith et al., 2017

Paper: [Link]

Code: Will be in SPLATT

Features:

  • sparse tensors
  • TTMc in CSF format on multi-core CPU architecture (no Tucker decomposition result shown)
  • Get 20.7x speedup while mostly using less memory (up to 28.5x)
  • Fig. 7: good analysis for time and memory!

Findings:

  • Similar with SPLATT, decrease the time complexity along with tree levels.
  • The intermediate memory is limited to a single row of Y(n).
  • Parallelized by distributing the I1 (first mode) trees to threads.
  • Present synchronization using mutexes for write conflicts when updating any modes other than the first one.
  • Using multiple CSF representations
  • Heuristically sorting mode order, assume K CSF representations can be stored,
    • For X1, sort the modes by their lengths, with the shortest mode placed at the top level.
    • The remaining K−1 representations are selected in a greedy fashion: at step k, use (4) to examine the costs associated with TTMc for each mode when provided with X 1, . . . , X k−1. The mode with the highest cost is placed at the top level of Xk, and the remaining modes are sorted by increasing length.
    • At the end of this procedure, each mode has the choice of K representations to use for TTMc computation. We assign each mode to the representation with the lowest cost, and use that representation for TTMc.
    • Importantly, if ties are broken in a consistent manner, then it happens in practice that several modes can be assigned to the same X k , meaning that fewer than K representations need be kept in memory for computation.

Useful References:

Dataset:

  • Nell-2, Enron [FROSTT]
  • Netflix (not public), Alzheimer (From Muthu Baskaran)
  • Synthetic: Possion3D, Possion4D

Platform & Software:

  • Intel Haswell E5-2680v3
  • OpenMP
  • Intel compiler with Intel MKL

Comparison Software:

  • HyperTensor: HT-FLAT, HT-BTREE

 

“GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014

Paper: [Link]

Code: N/A

Features:

  • This work extends CSB format to GPU
  • Construct an eCSB (expanded CSB) format
  • eCSB SpMV and SpMTV performance is close.
  • Speedup
    • Compared to multicore CSB, 13x faster.
    • eCSB-SpMV is comparable to CUSP’s HYB-SpMV, and outperforms cuSPARSE’s CSR-SpMV by 20%. While for SpMTV, eCSB behaves much better than both of them by 31%-365%.
    • For Bi-conjugate gradient (BCG), eCSb outperforms cuSPARSE and CUSP by 6% and 25% respectively.
  • eCSB-SpMV shows a higher utilization ratio of GPU.
  • This paper doesn’t express the work well, the work deserves better.

Findings:

  • A straightforward porting of CSB to GPUs is inefficient, because
    • “CSB algorithm performs a complex recursive partition process, uses multiple forms of dynamic parallelism existing among different rows, sub-rows, and sub-blocks.” This may be just programming problems. “Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs.”
    • Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted.
    • “It depends on Cilk++ to distribute computing tasks to cores at runtime for dynamic load balancing.” How does GPU dynamic parallelism perform?
  • eCSB format
    • blk_ptr (new added): The offset of the first nonzero in each blockrow; When COO format is used for blocks, no blk_ptr needed.
    • block_ind: Concatenated block row’s index and block column’s index as a single variable
    • comb_lowbits: The intra-block row and column indices for nonzeros concatenated as a single variable.
    • val: nonzeros
    • Its memory footprint is larger than CSB’s according to the paper. But maybe not, when the number of blocks is much smaller than n^2/β^2.
  • hybrid storage format to organize sub-blocks
    • can be selected from ELL, COO, and HYB dynamically at runtime
  • Z-Morton ordering (bit-interleaving) is used inside each block.
  • GPU implementation
    • one thread block to handle one blockrow
    • The threads in a block integrate through the single blockrow.
    • Use atomic operations on shared memory (SpMV) or global memory (SpMTV)
  • Give how to determine the storage format in Sec 5.3
  • Matrices gets lower performance on eCSB is because their most populated rows are much higher than the average value, leading to more frequent collisions in the atomic additions.

Other Knowledge:

  • Existing CSR, ELL, COO formats cannot enable efficient access to nonzeros along both row and column directions.

Useful References:

Dataset:

  • 11 matrices from UF collection

Platform & Software:

  • Intel Core i7-3770
  • Nvidia GTX Titan
  • Nvidia C2075

Comparison Software:

  • CSB for multicore CPU
  • Nvidia cuSPARSE’s CSR-SpMV
  • Nvidia CUSP’s HYB-SpMV
  • SpMTV is implemented by modifying the code of CUSP

“Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks” — Aydın Buluç et al, 2009

Paper: [Link]

Code: [Link]

Slides: [Link]

Features:

  • Introduce CSB (Compressed Sparse Blocks) format
    • CSB does not favor rows over columns or vice versa.
    • Compared to BCSR which stores a sparse collection of (small) dense blocks, whereas CSB stores a dense collection of sparse blocks.
    • It is motivated by multicore and manycore architectures, in which parallelism and memory bandwidth are key resources.
    • Storage: The storage requirement for CSB is esssentially the same as that for CSR format, for which computing Ax in parallel is easy but AT x (or equivalently computing xT A) is difficult.
  • CSB-SpMV Performance:
    • On one processor, the CSB algorithms for Ax and AT x run just as fast as the CSR algorithm (from OSKI) for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.
    • On multicore platforms, CSB-SpMV runs faster for all the matrices we tested, including the structured ones over Star-P.
    • “Its performance is limited by the memory-system bandwidth, not by lack of parallelism.”
    • The parallel performance of CSB_SPMV and CSB_SPMV_T is generally not affected by highly uneven row and column nonzero counts.
  • Did not employ speculative low-level optimizations such as software prefetching, pipelining, or matrix-specific optimizations such as index and/or value compression, but we believe that CSB and our algorithms should not adversely affect incorporation of these approaches.

Findings:

  • CSR-SpMV:
    • work: Θ(nnz);
    • span: Θ(nr+lgn), nr: the average number of nonzeros per row;
    • storage: n lg nnz + nnz lg n bits for index data, for n*n sparse matrix with nnz nonzeros.
  • CSB format
    • block-size parameter β:
      • a block size slightly larger than √n delivers reasonable performance, which is at most 10%-15% degradation from the best performance.
      • the overall best performance was achieved when β satisfies the equation ⌈lg√n⌉ ≤ lgβ ≤ 3+⌈lg√n⌉.
      • CSB constructor can automatically select a reasonable β in this range.
    • val[nnz]:
      • Store nonzero values in the order of contiguous blocks.
      • The ordering among blocks is flexible, row-major ordering is used in the experiments.
      • Within a block, Z-Morton ordering is used.
        • The choice of layout within the block (Z-Morton or row/column orderings) becomes less significant for sparse blocks as they already do not take full advantage of such features.
        • More importantly, a recursive ordering allows us to efficiently determine the four quadrants of a block using binary search, which is crucial for parallelizing individual blocks.
    • row_ind[nnz]
    • col_ind[nnz]
      • They range from 0 to β-1
      • As a practical matter, we can pack a corresponding pair of elements of row_ind and col_ind into a single integer word of 2lgβ bits so that they make a single array of length nnz, which is comparable to the storage needed by CSR for the col_ind array
    • blk_ptr[n^2/β^2]
      • stores the index of each block in the val array, which is analogous to the row_ptr array for CSR
    • storage: (n^2/β^2) lg nnz+2 nnz lgβ bits for index data
  • Parallelization of CSB-SpMV
    • blockrow splitting, with further dividing long blockrows.
    • Specifically, when a blockrow contains Ω(β) nonzeros, we re- cursively divide it “in half,” yielding two subblockrows, each containing roughly half the nonzeros.
      • To facilitate fast subblockrow divisions, we first partition the blockrow into “chunks” of consecutive blocks. By dividing a blockrow “in half,” we mean assigning to each subblockrow roughly half the chunks.
      • If this chunk contains many blocks, it is guaranteed to contain at most Θ(β) nonzeros, which is sufficiently sparse to perform the serial multiplication;
      • If, on the other hand, the chunk is a single block, it may contain as many as β2 ≈ n nonzeros. Instead, we perform the parallel block-vector multiplication CSB_BLOCKV.
    • For each block, use function CSB_BLOCKV
      • The block-vector multiplication proceeds by recursively dividing the (sub)block M into quadrants M00, M01, M10, and M11, each of which is conveniently stored contiguously in the Z-Morton-ordered val, row_ind, and col_ind arrays between indices start and end. We perform binary searches to find the appropriate dividing points in the array.
    • To avoid the races that might arise due to write conflicts between the subblockrows, we allocate a temporary vector to store the result of one of the subblockrows and allow the other subblockrow to use the output vector. After both subblockrow multiplications complete, we serially add the tempo- rary vector into the output vector.
    • when the nonzeros are evenly distributed across the block- rows, our CSB-SpMV algorithm directly calls the serial multiplication instead.
  • CSB-SpMV
    • work: Θ(n^2/β^2 + nnz)
    • span: O(β lg(n/β) + n/β)
      • choosing β = Θ(√n), CSB_SPMV runs with work Θ(nnz) and span O(√nlgn), achieving a parallelism of Ω(nnz /√n lg n). storage:
    • extra storage (for chunk array and the temporary vector z): O(P√nlgn) space when β = Θ(√n)

Other Knowledge:

  • Introduce CSR format history in P234 right column third paragraph
  • Review CSR-SpMV parallelization strategies on Page 235 left column last five paragraphs and right column first three paragraphs.
  • Many or most of the individual blocks Aij are hypersparse, that the ratio of nonzeros to matrix dimension is asymptotically 0. Thus, the space to store a block should therefore depend only on its nonzero count, not on its dimension.
  • Morton-hybrid layout: stores elements in row-major order within blocks and a Morton ordering among blocks. [Paper: Analyzing block locality in Morton-order and Morton-hybrid matrices]

Useful References:

Dataset:

  • 14 matrices from UF collection

Platform & Software:

  • Platforms: AMD Opteron 8214 (Santa Rosa), Intel Xeon X5460 (Harpertown), Intel Core i7 920 (Nehalem)
  • values: double-precision floating points; indices: 32-bit unsigned integers.
  • Language: Cilk++

Comparison Software:

  • “pure” serial OSKI: without enabling OSKI’s preprocessing step, which chooses blockings for cache and register usage that are tuned to a specific matrix.
  • Star-P: distributed-memory code

Cited by:

Summary of Tensor Decomposition from High Performance Computing View (Updating)

Survey:

Tensor Formats:

Sparse Tensor Parallelization:

Current Development:

  • Fundamental tensor operations
    • TTM
    • MTTKRP
    • Tensor contraction
  • Tensor Decompositions
    • CP decomposition
    • Tucker decomposition
    • (Anima’s )
    • Tensor Train decomposition
    • Hierarchical Tucker decomposition

Applications:

  • Healthcare
  • Deep Learning
  • Traditional Machine Learning
  • Social Networks

Software:

  • Matlab
    • Tensor Toolbox
    • N-way Toolbox
  • C++
    • CTF (Cyclops Tensor Framework)
    • SPLATT
    • ParTI

“A Unified Optimization Approach for Sparse Tensor Operations on GPUs” — Bangtian Liu et al. 2017

Paper: [Link]

Code: N/A

Features:

  • Optimized sparse MTTKRP and TTM on GPUs
  • Can be extended to high-order tensor computations
  • Speedup:
    • SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs over ParTI-GPU.
    • CANDECOMP/PARAFAC (CP) decomposition achieves up to 14.9 times speedup over SPLATT.
  • My thought:
    • F-COO cannot replace COO format, need F-COO for each mode.
    • sf may need to record the every starting index in F-COO format.

Findings:

  • propose a unified sparse tensor representation called F-COO (Flagged-COOrdinate), which is based on the tensor modes for sparse tensor computations.
    • memory efficient
    • can be used as a unified storage format across different sparse tensor operations
    • Each computation need to construct a F-COO format correspondingly.
    • F-COO is preprocessed for different modes on the host and will only be transferred once in the beginning to the GPU.
  • F-COO details:
    • take mode-3 SpTTM as an example
      • “val”: store nonzero values
      • “k”: Product modes’ indices are stored as in COO format, to guide the Kronecker or Hadamard product operations
      • bf (bit-flag): represent any changes in the index modes which consequently shows the computation has switched to another fiber (in SpTTM) or to another slice (in SpMTTKRP).
        • bf value will change from 1 to 0 when a change in the i or j value occurs.
        • The number of non-zeros processed per thread depends on the data type selected for the bf array. For example, if we use uint8 t or unsigned char to bf, the number of non- zeros processed per thread will be 8.
      • sf (start-flag): indicate whether partitions processed by the current thread start new indices on index mode over previous thread.
  • Unified parallel algorithms for different sparse tensor operations, not sensitive to mode changes
    • Seems The one-shot algorithm for MTTKRP is already used before.
    • They don’t have large intermediate data is because they deal with factor matrices one column for each thread block, instead of a row.
    • And each thread block writes to different global memory locations.
  • Use segmented scan with bf and sf information.
  • GPU parallelization:
    • launch two-dimensional thread grids with one-dimensional thread blocks, instead of one-dimensional thread grids with two-dimensional thread blocks (as in ParTI), to avoid thread divergence inside a warp and cause strided memory accesses.
    • This partitioning strategy makes the computation imbalance exist between thread blocks instead of inside threads.
  • Use GPU-specific optimizations: kernel fusion, warp shuffle, and data reuse
    • Use Read-Only Data-Cache for factor matrices
    • fuse product kernel, segmented scan, and the accumulation kernels together to increase data reuse and keep intermediate data in shared memory
    • Warp shuffle is used to increase data sharing inside a warp. Warp shuffle enables register to register data exchange and thus reduces the shared memory footprint and avoids overusing shared memory.

Other Knowledge:

  • Challenges in optimizing sparse tensor operations on GPUs, such as i) finding a good parallelization granularity, ii) reducing storage costs (large intermediate data) and irregularities in memory accesses, and iii) dealing with atomic updates.
  • Prior work has used fiber- or slice-level computations as the granularity for parallelization.
    • However, such approaches lead to noticeable load imbalance between threads on the GPU because of the sparse nature of the tensors.
    • Also the optimizations do not deliver consistent performance for different modes and ranks

Dataset:

  •  FROSTT: brainq, nell1, nell2, delicious

Platform:

  • Intel Core i7-5820K CPU
  • NVIDIA GeForce GTX Titan X platform

Comparison Software:

  • ParTI
  • SPLATT

“Merge-based Parallel Sparse Matrix-Vector Multiplication” — Duane Merrill et al. 2016

Paper: [Link]

Slide: [Link]

Code: [Link]

Features:

  • Overall, better average performance compared to Nvidia cuSPARSE, Intel MKL and other implementations. Performance is worse for some matrices with even rows.
  • The first work adapted merge-based parallel decomposition for computing a well-balanced SpMV directly on CSR matrices without offline analysis, format conversion, or the construction of side-band data (aka preprocessing, inspection, reformatting, or supplemental encoding). (nz_indices used in the merge-based algorithm is nnz location indices, which is actually natural numbers from 0 to nnz-1, not extra storage is needed for it.) This decomposition is particularly useful for bounding workloads across multi-scale processors and systems with fixed-size local memories.
  • The merge-based decomposition is orthogonal to (and yet supportive of) bandwidth optimization strategies such as index compression, blocking, and relabeling. In fact, the elimination of workload imbalance from our method is likely to amplify the benefits of these techniques.
  • Explore NUMA opportunities for multi-socket systems.
  • Merge-based method allows us to make maximal utilization of the GPU’s fixed-size shared memory resources, regardless of the ratio between row-offsets and nonzeros consumed during each path-chunk. This is not possible for row-based or nonzero-splitting strategies that are unable to provide tight bounds on local workloads.
  • Binary searching and fix-up are counted as part of SpMV running time, while NUMA page migration as preprocessing overhead.
  • The row-length imperviousness (the correlation between GFLOPs throughput vs row-length variation) and the performance predictability (the correlation between elapsed running time vs matrix size nnz) are GOOD (Fig. 9).

Findings:

  • Adapt a simple parallelization strategy originally developed for efficient merging of sorted sequences [Paper]
  • The central idea is to frame the parallel CsrMV decomposition as a logical merger of two lists: the row descriptors and the indices of the nonzero values.
  • Merge-based CsrMV
    • Two stages: partitioning and merging stages.
    • The two elements Ai and Bj scheduled to be compared at diagonal k can be found via constrained binary search along that diagonal: find the first (i,j) where Ai is greater than all of the items consumed before Bj, given that i+j=k.
    • this parallelization strategy is that the decision path can be partitioned hierarchically, trivially enabling parallelization across large, multi-scale systems. work complexity: O(N+plogN); time complexity: O(N/p+logN).

Other Knowledge:

  • Specialized or supplementary formats ultimately burden the application with both
    • (1) preprocessing time for inspection and formatting, which may be tens of thousands of times greater than the SpMV operation itself;
    • (2) excess storage overhead because the original CSR matrix is likely required by other routines and cannot be discarded.
  • Specialized sparse matrix formats:
    • padding (e.g. ELL) and/or reorganization strategies (map similarly-sized rows onto co-scheduled thread groups, e.g. PKT, SELL),
    • reduce memory I/O via index compression (e.g. blocking, bit-encoding, BCCOO),
    • hybrid and multi-level formats (e.g. pOSKI, CSB, HYB, Cocktail).
  • CsrMV parallelization strategies:
    • Row splitting:
      • The splitting of long rows can be done statically via a preprocessing step that encodes the dataset into an alternative format, or dynamically using a work-sharing runtime. The dynamic variant requires runtime task distribu- tion, a behavior likely to incur processor contention and limited scalability on massively parallel systems.
      • Vectorization is a common variant of row-splitting in which a group of threads is assigned to process each row. Maybe significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths. While adaptively vectorize based on row length require significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths.
    • Nonzero splitting: need to perform the searching within the row-offsets array as an offline preprocessing step.

COO SpMV optimizations:

Dataset:

  • 4201 non-vector, non-complex sparse matrices from University of Florida Sparse Matrix Collection

Platform:

  • dual-socket NUMA CPU platform, with two Intel Xeon E5-2690v2 CPUs
  • one NVIDIA Tesla K40 GPU
  • Test bandwidth: stream triad score [Webpage]

Comparison Software:

  • Intel MKL v11.3
  • NVIDIA’s cuSPARSE v7.5
  • Other non-CSR formats expressly designed to tolerate row-length variation:
    • CSB (Compressed Sparse Blocks) [Paper]
    • HYB, also from NVIDIA’s cuSPARSE library, HybMV [Paper]
    • pOSKI [Webpage]
    • yaSpMV: BCCOO (blocked compressed common coordinate) [Paper]

“Tensor Contraction Layers for Parsimonious Deep Nets” — Jean Kossaifi et al. 2017

Name: TCL

Paper: [Link]

Code: N/A

Features:

  • First work on applying tensor decomposition as a general layer or replace fully-connected layers.
  • Use TTM operations
  • Tensor modes meaning: height, width, channel in order.
  • In TCL layers, height size (128-512 in the paper) is much larger than width size (always 3 in paper).
  • Good references

Findings:

  • TCLs reduce the dimensionality of the activation tensors (only for two spacial modes of images, leaving the modes corresponding to the channel and the batch size untouched) and thus the number of model parameters, at the same time, preserve high accuracy.
  • Optimize fully-connected layers using tensor factorizations, using two approaches:
    • TCL as an additional layer: reducing the dimensionality of the activation tensor before feeding it to the subsequent two (or more) fully-connected layers and softmax output of the network. This approach preserves or even increase the accuracy.
    • TCL as replacement of a fully-connected layer (partial or full replacement of fully-connected layers): this approach affect the accuracy a bit, but significantly reducing the number of parameters
    • Take the input to the fully-connected layers as an activation tensor X of size (D1, …, DN), we seek a low dimensional core tensor G of sealers size (R1, … RN).
  • Both number of parameters and time complexity of a TCL is smaller then a fully-connect layer. (Detailed comparison of the complexity and number of parameters is in the paper.)
  • To avoid vanishing or exploding gradients, and to make the TCL more robust to changes in the initialization of the factors, we added a batch normalization layer [8] before and after the TCL.

  • Future work
    • we plan to extend our work to more net- work architectures, especially in settings where raw data or learned representations exhibit natural multi-modal structure that we might capture via high-order tensors.

    • We also endeavor to advance our experimental study of TCLS for large-scale, high-resolutions vision datasets.

    • Plan to integrate new extended BLAS primitives which can avoid transpositions needed to compute the tensor contractions.
    • we will look into methods to induce and exploit sparsity in the TCL, to understand the parameter reductions this method can yield over existing state-of-the-art pruning methods.

    • we are working on an extension to the TCL: a tensor regression layer to replace both the fully-connected and final output layers, potentially yielding in- creased accuracy with even greater parameter reductions.

Other Knowledge:

  • Fully-connected layers hold over 80% of the parameters.

Useful reference:

Software:

  • AlexNet
  • VGG

Dataset:

  • CIFAR100
  • ImageNet