“GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014

Paper: [Link]

Code: N/A


  • This work extends CSB format to GPU
  • Construct an eCSB (expanded CSB) format
  • eCSB SpMV and SpMTV performance is close.
  • Speedup
    • Compared to multicore CSB, 13x faster.
    • eCSB-SpMV is comparable to CUSP’s HYB-SpMV, and outperforms cuSPARSE’s CSR-SpMV by 20%. While for SpMTV, eCSB behaves much better than both of them by 31%-365%.
    • For Bi-conjugate gradient (BCG), eCSb outperforms cuSPARSE and CUSP by 6% and 25% respectively.
  • eCSB-SpMV shows a higher utilization ratio of GPU.
  • This paper doesn’t express the work well, the work deserves better.


  • A straightforward porting of CSB to GPUs is inefficient, because
    • “CSB algorithm performs a complex recursive partition process, uses multiple forms of dynamic parallelism existing among different rows, sub-rows, and sub-blocks.” This may be just programming problems. “Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs.”
    • Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted.
    • “It depends on Cilk++ to distribute computing tasks to cores at runtime for dynamic load balancing.” How does GPU dynamic parallelism perform?
  • eCSB format
    • blk_ptr (new added): The offset of the first nonzero in each blockrow; When COO format is used for blocks, no blk_ptr needed.
    • block_ind: Concatenated block row’s index and block column’s index as a single variable
    • comb_lowbits: The intra-block row and column indices for nonzeros concatenated as a single variable.
    • val: nonzeros
    • Its memory footprint is larger than CSB’s according to the paper. But maybe not, when the number of blocks is much smaller than n^2/β^2.
  • hybrid storage format to organize sub-blocks
    • can be selected from ELL, COO, and HYB dynamically at runtime
  • Z-Morton ordering (bit-interleaving) is used inside each block.
  • GPU implementation
    • one thread block to handle one blockrow
    • The threads in a block integrate through the single blockrow.
    • Use atomic operations on shared memory (SpMV) or global memory (SpMTV)
  • Give how to determine the storage format in Sec 5.3
  • Matrices gets lower performance on eCSB is because their most populated rows are much higher than the average value, leading to more frequent collisions in the atomic additions.

Other Knowledge:

  • Existing CSR, ELL, COO formats cannot enable efficient access to nonzeros along both row and column directions.

Useful References:


  • 11 matrices from UF collection

Platform & Software:

  • Intel Core i7-3770
  • Nvidia GTX Titan
  • Nvidia C2075

Comparison Software:

  • CSB for multicore CPU
  • Nvidia cuSPARSE’s CSR-SpMV
  • Nvidia CUSP’s HYB-SpMV
  • SpMTV is implemented by modifying the code of CUSP

One thought on ““GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s