Paper: [Link]

Code: [Link]

Slides: [Link]

**Features:**

- Introduce CSB (Compressed Sparse Blocks) format
- CSB does not favor rows over columns or vice versa.
- Compared to BCSR which stores a sparse collection of (small) dense blocks, whereas CSB stores a dense collection of sparse blocks.
- It is motivated by multicore and manycore architectures, in which parallelism and memory bandwidth are key resources.
- Storage: The storage requirement for CSB is esssentially the same as that for CSR format, for which computing Ax in parallel is easy but AT x (or equivalently computing xT A) is difficult.

- CSB-SpMV Performance:
- On one processor, the CSB algorithms for Ax and AT x run just as fast as the CSR algorithm (from OSKI) for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.
- On multicore platforms, CSB-SpMV runs faster for all the matrices we tested, including the structured ones over Star-P.
- “Its performance is limited by the memory-system bandwidth, not by lack of parallelism.”
- The parallel performance of CSB_SPMV and CSB_SPMV_T is generally not affected by highly uneven row and column nonzero counts.

- Did not employ speculative low-level optimizations such as software prefetching, pipelining, or matrix-specific optimizations such as index and/or value compression, but we believe that CSB and our algorithms should not adversely affect incorporation of these approaches.

**Findings:**

- CSR-SpMV:
- work: Θ(nnz);
- span: Θ(nr+lgn), nr: the average number of nonzeros per row;
- storage: n lg nnz + nnz lg n bits for index data, for n*n sparse matrix with nnz nonzeros.

- CSB format
- block-size parameter β:
- a block size slightly larger than √n delivers reasonable performance, which is at most 10%-15% degradation from the best performance.
- the overall best performance was achieved when β satisfies the equation ⌈lg√n⌉ ≤ lgβ ≤ 3+⌈lg√n⌉.
- CSB constructor can automatically select a reasonable β in this range.

- val[nnz]:
- Store nonzero values in the order of contiguous blocks.
- The ordering among blocks is flexible, row-major ordering is used in the experiments.
- Within a block, Z-Morton ordering is used.
- The choice of layout within the block (Z-Morton or row/column orderings) becomes less significant for sparse blocks as they already do not take full advantage of such features.
- More importantly, a recursive ordering allows us to efficiently determine the four quadrants of a block using binary search, which is crucial for parallelizing individual blocks.

- row_ind[nnz]
- col_ind[nnz]
- They range from 0 to β-1
- As a practical matter, we can pack a corresponding pair of elements of row_ind and col_ind into a single integer word of 2lgβ bits so that they make a single array of length nnz, which is comparable to the storage needed by CSR for the col_ind array

- blk_ptr[n^2/β^2]
- stores the index of each block in the val array, which is analogous to the row_ptr array for CSR

- storage: (n^2/β^2) lg nnz+2 nnz lgβ bits for index data

- block-size parameter β:
- Parallelization of CSB-SpMV
- blockrow splitting, with further dividing long blockrows.
- Specifically, when a blockrow contains Ω(β) nonzeros, we re- cursively divide it “in half,” yielding two subblockrows, each containing roughly half the nonzeros.
- To facilitate fast subblockrow divisions, we first partition the blockrow into “chunks” of consecutive blocks. By dividing a blockrow “in half,” we mean assigning to each subblockrow roughly half the chunks.
- If this chunk contains many blocks, it is guaranteed to contain at most Θ(β) nonzeros, which is sufficiently sparse to perform the serial multiplication;
- If, on the other hand, the chunk is a single block, it may contain as many as β2 ≈ n nonzeros. Instead, we perform the parallel block-vector multiplication CSB_BLOCKV.

- For each block, use function CSB_BLOCKV
- The block-vector multiplication proceeds by recursively dividing the (sub)block M into quadrants M00, M01, M10, and M11, each of which is conveniently stored contiguously in the Z-Morton-ordered val, row_ind, and col_ind arrays between indices start and end. We perform binary searches to find the appropriate dividing points in the array.

- To avoid the races that might arise due to write conflicts between the subblockrows, we allocate a temporary vector to store the result of one of the subblockrows and allow the other subblockrow to use the output vector. After both subblockrow multiplications complete, we serially add the tempo- rary vector into the output vector.
- when the nonzeros are evenly distributed across the block- rows, our CSB-SpMV algorithm directly calls the serial multiplication instead.

- CSB-SpMV
- work: Θ(n^2/β^2 + nnz)
- span: O(β lg(n/β) + n/β)
- choosing β = Θ(√n), CSB_SPMV runs with work Θ(nnz) and span O(√nlgn), achieving a parallelism of Ω(nnz /√n lg n). storage:

- extra storage (for chunk array and the temporary vector z): O(P√nlgn) space when β = Θ(√n)

**Other Knowledge:**

- Introduce CSR format history in P234 right column third paragraph
- Review CSR-SpMV parallelization strategies on Page 235 left column last five paragraphs and right column first three paragraphs.
- Many or most of the individual blocks Aij are hypersparse, that the ratio of nonzeros to matrix dimension is asymptotically 0. Thus, the space to store a block should therefore depend only on its nonzero count, not on its dimension.
- Morton-hybrid layout: stores elements in row-major order within blocks and a Morton ordering among blocks. [Paper: Analyzing block locality in Morton-order and Morton-hybrid matrices]

**Useful References:**

- A data structure similar to CSB in the context of cache blocking: [Paper: When cache blocking of sparse matrix vector multiply works and why]

**Dataset:**

- 14 matrices from UF collection

**Platform & Software:**

- Platforms: AMD Opteron 8214 (Santa Rosa), Intel Xeon X5460 (Harpertown), Intel Core i7 920 (Nehalem)
- values: double-precision floating points; indices: 32-bit unsigned integers.
- Language: Cilk++

**Comparison Software:**

- “pure” serial OSKI: without enabling OSKI’s preprocessing step, which chooses blockings for cache and register usage that are tuned to a specific matrix.
- Star-P: distributed-memory code

**Cited by:**

- “Merge-based Parallel Sparse Matrix-Vector Multiplication” — Duane Merrill et al. 2016
- “GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014
- Aydin Buluc et al., [Paper: Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication], IPDPS, 2011
- modify CSB by using bitmasked register blocks
- pack the nonzeros of a register block contiguously, but use a bitmask to mark their positions in the dense block
- perform flops on explicit zeros, but never transfer them from memory, => increase the number of arithmetic operations, decrease the bandwidth requirements, the latency of accessing vectors stays constant.
- data structure changes:
- low_ind holds only one entry per register block, irrespective of the number of actual nonzeros it contains
- An additional array with the same length, names mask_arr, hols the bit mask structure of the register block

- Use SSE to implement this.
- Testing matrices get similar performance or performance improvement from bitmasked register blocks by up to 3x for non-symmetric matrices.

- modify CSB by using bitmasked register blocks

Advertisements