Paper: [Link]

Code: N/A

**Features:**

- This work extends CSB format to GPU
- Construct an eCSB (expanded CSB) format
- eCSB SpMV and SpMTV performance is close.
- Speedup
- Compared to multicore CSB, 13x faster.
- eCSB-SpMV is comparable to CUSP’s HYB-SpMV, and outperforms cuSPARSE’s CSR-SpMV by 20%. While for SpMTV, eCSB behaves much better than both of them by 31%-365%.
- For Bi-conjugate gradient (BCG), eCSb outperforms cuSPARSE and CUSP by 6% and 25% respectively.

- eCSB-SpMV shows a higher utilization ratio of GPU.
- This paper doesn’t express the work well, the work deserves better.

**Findings:**

- A straightforward porting of CSB to GPUs is inefficient, because
- “CSB algorithm performs a complex recursive partition process, uses multiple forms of dynamic parallelism existing among different rows, sub-rows, and sub-blocks.” This may be just programming problems. “Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs.”
- Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted.
- “It depends on Cilk++ to distribute computing tasks to cores at runtime for dynamic load balancing.” How does GPU dynamic parallelism perform?

- eCSB format
- blk_ptr (new added): The offset of the first nonzero in each blockrow; When COO format is used for blocks, no blk_ptr needed.
- block_ind: Concatenated block row’s index and block column’s index as a single variable
- comb_lowbits: The intra-block row and column indices for nonzeros concatenated as a single variable.
- val: nonzeros
- Its memory footprint is larger than CSB’s according to the paper. But maybe not, when the number of blocks is much smaller than n^2/β^2.

- hybrid storage format to organize sub-blocks
- can be selected from ELL, COO, and HYB dynamically at runtime

- Z-Morton ordering (bit-interleaving) is used inside each block.
- GPU implementation
- one thread block to handle one blockrow
- The threads in a block integrate through the single blockrow.
- Use atomic operations on shared memory (SpMV) or global memory (SpMTV)

- Give how to determine the storage format in Sec 5.3
- Matrices gets lower performance on eCSB is because their most populated rows are much higher than the average value, leading to more frequent collisions in the atomic additions.

**Other Knowledge:**

- Existing CSR, ELL, COO formats cannot enable efficient access to nonzeros along both row and column directions.

**Useful References:**

- Data reordering before SpMV for very sparse matrices: [Paper: Taming irregular, EDA applications on GPUs], 2009
- compress row/column index to reduce memory traffic: [Paper: Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes], SC, 2013

**Dataset:**

- 11 matrices from UF collection

**Platform & Software:**

- Intel Core i7-3770
- Nvidia GTX Titan
- Nvidia C2075

**Comparison Software:**

- CSB for multicore CPU
- Nvidia cuSPARSE’s CSR-SpMV
- Nvidia CUSP’s HYB-SpMV
- SpMTV is implemented by modifying the code of CUSP

Advertisements

## One thought on ““GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014”