- Optimized sparse MTTKRP and TTM on GPUs
- Can be extended to high-order tensor computations
- SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs over ParTI-GPU.
- CANDECOMP/PARAFAC (CP) decomposition achieves up to 14.9 times speedup over SPLATT.
- My thought:
- F-COO cannot replace COO format, need F-COO for each mode.
- sf may need to record the every starting index in F-COO format.
- propose a unified sparse tensor representation called F-COO (Flagged-COOrdinate), which is based on the tensor modes for sparse tensor computations.
- memory efficient
- can be used as a unified storage format across different sparse tensor operations
- Each computation need to construct a F-COO format correspondingly.
- F-COO is preprocessed for different modes on the host and will only be transferred once in the beginning to the GPU.
- F-COO details:
- take mode-3 SpTTM as an example
- “val”: store nonzero values
- “k”: Product modes’ indices are stored as in COO format, to guide the Kronecker or Hadamard product operations
- bf (bit-flag): represent any changes in the index modes which consequently shows the computation has switched to another fiber (in SpTTM) or to another slice (in SpMTTKRP).
- bf value will change from 1 to 0 when a change in the i or j value occurs.
- The number of non-zeros processed per thread depends on the data type selected for the bf array. For example, if we use uint8 t or unsigned char to bf, the number of non- zeros processed per thread will be 8.
- sf (start-flag): indicate whether partitions processed by the current thread start new indices on index mode over previous thread.
- take mode-3 SpTTM as an example
- Unified parallel algorithms for different sparse tensor operations, not sensitive to mode changes
- Seems The one-shot algorithm for MTTKRP is already used before.
- They don’t have large intermediate data is because they deal with factor matrices one column for each thread block, instead of a row.
- And each thread block writes to different global memory locations.
- Use segmented scan with bf and sf information.
- GPU parallelization:
- launch two-dimensional thread grids with one-dimensional thread blocks, instead of one-dimensional thread grids with two-dimensional thread blocks (as in ParTI), to avoid thread divergence inside a warp and cause strided memory accesses.
- This partitioning strategy makes the computation imbalance exist between thread blocks instead of inside threads.
- Use GPU-specific optimizations: kernel fusion, warp shuffle, and data reuse
- Use Read-Only Data-Cache for factor matrices
- fuse product kernel, segmented scan, and the accumulation kernels together to increase data reuse and keep intermediate data in shared memory
- Warp shuffle is used to increase data sharing inside a warp. Warp shuffle enables register to register data exchange and thus reduces the shared memory footprint and avoids overusing shared memory.
- Challenges in optimizing sparse tensor operations on GPUs, such as i) finding a good parallelization granularity, ii) reducing storage costs (large intermediate data) and irregularities in memory accesses, and iii) dealing with atomic updates.
- Prior work has used fiber- or slice-level computations as the granularity for parallelization.
- However, such approaches lead to noticeable load imbalance between threads on the GPU because of the sparse nature of the tensors.
- Also the optimizations do not deliver consistent performance for different modes and ranks
- FROSTT: brainq, nell1, nell2, delicious
- Intel Core i7-5820K CPU
- NVIDIA GeForce GTX Titan X platform