“Merge-based Parallel Sparse Matrix-Vector Multiplication” — Duane Merrill et al. 2016

Paper: [Link]

Slide: [Link]

Code: [Link]


  • Overall, better average performance compared to Nvidia cuSPARSE, Intel MKL and other implementations. Performance is worse for some matrices with even rows.
  • The first work adapted merge-based parallel decomposition for computing a well-balanced SpMV directly on CSR matrices without offline analysis, format conversion, or the construction of side-band data (aka preprocessing, inspection, reformatting, or supplemental encoding). (nz_indices used in the merge-based algorithm is nnz location indices, which is actually natural numbers from 0 to nnz-1, not extra storage is needed for it.) This decomposition is particularly useful for bounding workloads across multi-scale processors and systems with fixed-size local memories.
  • The merge-based decomposition is orthogonal to (and yet supportive of) bandwidth optimization strategies such as index compression, blocking, and relabeling. In fact, the elimination of workload imbalance from our method is likely to amplify the benefits of these techniques.
  • Explore NUMA opportunities for multi-socket systems.
  • Merge-based method allows us to make maximal utilization of the GPU’s fixed-size shared memory resources, regardless of the ratio between row-offsets and nonzeros consumed during each path-chunk. This is not possible for row-based or nonzero-splitting strategies that are unable to provide tight bounds on local workloads.
  • Binary searching and fix-up are counted as part of SpMV running time, while NUMA page migration as preprocessing overhead.
  • The row-length imperviousness (the correlation between GFLOPs throughput vs row-length variation) and the performance predictability (the correlation between elapsed running time vs matrix size nnz) are GOOD (Fig. 9).


  • Adapt a simple parallelization strategy originally developed for efficient merging of sorted sequences [Paper]
  • The central idea is to frame the parallel CsrMV decomposition as a logical merger of two lists: the row descriptors and the indices of the nonzero values.
  • Merge-based CsrMV
    • Two stages: partitioning and merging stages.
    • The two elements Ai and Bj scheduled to be compared at diagonal k can be found via constrained binary search along that diagonal: find the first (i,j) where Ai is greater than all of the items consumed before Bj, given that i+j=k.
    • this parallelization strategy is that the decision path can be partitioned hierarchically, trivially enabling parallelization across large, multi-scale systems. work complexity: O(N+plogN); time complexity: O(N/p+logN).

Other Knowledge:

  • Specialized or supplementary formats ultimately burden the application with both
    • (1) preprocessing time for inspection and formatting, which may be tens of thousands of times greater than the SpMV operation itself;
    • (2) excess storage overhead because the original CSR matrix is likely required by other routines and cannot be discarded.
  • Specialized sparse matrix formats:
    • padding (e.g. ELL) and/or reorganization strategies (map similarly-sized rows onto co-scheduled thread groups, e.g. PKT, SELL),
    • reduce memory I/O via index compression (e.g. blocking, bit-encoding, BCCOO),
    • hybrid and multi-level formats (e.g. pOSKI, CSB, HYB, Cocktail).
  • CsrMV parallelization strategies:
    • Row splitting:
      • The splitting of long rows can be done statically via a preprocessing step that encodes the dataset into an alternative format, or dynamically using a work-sharing runtime. The dynamic variant requires runtime task distribu- tion, a behavior likely to incur processor contention and limited scalability on massively parallel systems.
      • Vectorization is a common variant of row-splitting in which a group of threads is assigned to process each row. Maybe significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths. While adaptively vectorize based on row length require significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths.
    • Nonzero splitting: need to perform the searching within the row-offsets array as an offline preprocessing step.

COO SpMV optimizations:


  • 4201 non-vector, non-complex sparse matrices from University of Florida Sparse Matrix Collection


  • dual-socket NUMA CPU platform, with two Intel Xeon E5-2690v2 CPUs
  • one NVIDIA Tesla K40 GPU
  • Test bandwidth: stream triad score [Webpage]

Comparison Software:

  • Intel MKL v11.3
  • NVIDIA’s cuSPARSE v7.5
  • Other non-CSR formats expressly designed to tolerate row-length variation:
    • CSB (Compressed Sparse Blocks) [Paper]
    • HYB, also from NVIDIA’s cuSPARSE library, HybMV [Paper]
    • pOSKI [Webpage]
    • yaSpMV: BCCOO (blocked compressed common coordinate) [Paper]

One thought on ““Merge-based Parallel Sparse Matrix-Vector Multiplication” — Duane Merrill et al. 2016

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s