“A Unified Optimization Approach for Sparse Tensor Operations on GPUs” — Bangtian Liu et al. 2017

Paper: [Link]

Code: N/A

Features:

  • Optimized sparse MTTKRP and TTM on GPUs
  • Can be extended to high-order tensor computations
  • Speedup:
    • SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs over ParTI-GPU.
    • CANDECOMP/PARAFAC (CP) decomposition achieves up to 14.9 times speedup over SPLATT.
  • My thought:
    • F-COO cannot replace COO format, need F-COO for each mode.
    • sf may need to record the every starting index in F-COO format.

Findings:

  • propose a unified sparse tensor representation called F-COO (Flagged-COOrdinate), which is based on the tensor modes for sparse tensor computations.
    • memory efficient
    • can be used as a unified storage format across different sparse tensor operations
    • Each computation need to construct a F-COO format correspondingly.
    • F-COO is preprocessed for different modes on the host and will only be transferred once in the beginning to the GPU.
  • F-COO details:
    • take mode-3 SpTTM as an example
      • “val”: store nonzero values
      • “k”: Product modes’ indices are stored as in COO format, to guide the Kronecker or Hadamard product operations
      • bf (bit-flag): represent any changes in the index modes which consequently shows the computation has switched to another fiber (in SpTTM) or to another slice (in SpMTTKRP).
        • bf value will change from 1 to 0 when a change in the i or j value occurs.
        • The number of non-zeros processed per thread depends on the data type selected for the bf array. For example, if we use uint8 t or unsigned char to bf, the number of non- zeros processed per thread will be 8.
      • sf (start-flag): indicate whether partitions processed by the current thread start new indices on index mode over previous thread.
  • Unified parallel algorithms for different sparse tensor operations, not sensitive to mode changes
    • Seems The one-shot algorithm for MTTKRP is already used before.
    • They don’t have large intermediate data is because they deal with factor matrices one column for each thread block, instead of a row.
    • And each thread block writes to different global memory locations.
  • Use segmented scan with bf and sf information.
  • GPU parallelization:
    • launch two-dimensional thread grids with one-dimensional thread blocks, instead of one-dimensional thread grids with two-dimensional thread blocks (as in ParTI), to avoid thread divergence inside a warp and cause strided memory accesses.
    • This partitioning strategy makes the computation imbalance exist between thread blocks instead of inside threads.
  • Use GPU-specific optimizations: kernel fusion, warp shuffle, and data reuse
    • Use Read-Only Data-Cache for factor matrices
    • fuse product kernel, segmented scan, and the accumulation kernels together to increase data reuse and keep intermediate data in shared memory
    • Warp shuffle is used to increase data sharing inside a warp. Warp shuffle enables register to register data exchange and thus reduces the shared memory footprint and avoids overusing shared memory.

Other Knowledge:

  • Challenges in optimizing sparse tensor operations on GPUs, such as i) finding a good parallelization granularity, ii) reducing storage costs (large intermediate data) and irregularities in memory accesses, and iii) dealing with atomic updates.
  • Prior work has used fiber- or slice-level computations as the granularity for parallelization.
    • However, such approaches lead to noticeable load imbalance between threads on the GPU because of the sparse nature of the tensors.
    • Also the optimizations do not deliver consistent performance for different modes and ranks

Dataset:

  •  FROSTT: brainq, nell1, nell2, delicious

Platform:

  • Intel Core i7-5820K CPU
  • NVIDIA GeForce GTX Titan X platform

Comparison Software:

  • ParTI
  • SPLATT
Advertisements

One thought on ““A Unified Optimization Approach for Sparse Tensor Operations on GPUs” — Bangtian Liu et al. 2017

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s