“Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture” — Timothy Dysart et al., 2016

Paper: [Link]

Optimization Techniques:

  • launch plenty number of threads
  • good thread control: spawn and synchronization
  • Load balance
  • program hotspot
  • Manage the data locality
  • Each thread migration to handle fair amount of data.

Features (Introduce EMU architecture):

  • System architecture
    • Nodelet architecture
      • multiple Gossamer Cores (GCs)
        • supports 64 concurrent threads
        • OS launches main() user program on a GC, heads return to main() upon completion, which then returns to OS
        • Internally, the GCs implement fine-grained multithreading and issue a single instruction each cycle. Each thread is limited to one active instruction at any given time.
          • deeply pipelined
          • accumulator + 16 registers
          • Since the system is expected to maintain large numbers of active threads, limiting the execution rate of a single thread should have a minimal performance impact.
          • Limiting each thread to a single active instruction reduces the amount of logic needed by each GC, as features like branch prediction and data hazard mitigation are not required.
        • Do not include data caches; memory transactions are either atomics or done as loads and stores between the thread registers and memory.
      • a Nodelet Queue Manager (NQM)
        • moving packets to and from the Migration Engine, GCs, and Memory Front End.
        • All packets coming into the nodelet are processed by the NQM and placed into one of three queues: Run, Migrate, or Memory.
          • Run Queue holds threads awaiting processing on the nodelet. The Run Queue is continuously sorted by the NQM to keep it approximately ordered by thread priority with higher priority threads moving towards the head of the queue.
          • Migrate Queue holds packets leaving the nodelet until accepted by the Migration Engine,
          • Memory Queue holds remote operations until accepted by the Memory Front End. Once a remote operation is accepted by the Memory Front End, the NQM generates the (optional) acknowledgment packet and sends it to the Migration Engine.
      • a Memory Front End (MFE)
        • When a thread performs a memory transaction on a GC, the transaction is submitted to MFE.
        • Atomic operations are performed within the MFE
      • the Narrow Channel DRAM (NCDRAM) system.
        • removes this inefficiency by using an 8-bit interface instead of a 64-bit interface.
    • Node architecture
      • ME: Eight nodelets are replicated and combined with a Migration Engine (ME) to form the majority of the NLS (NodeLet Set) portion of the node.
        • a crossbar with a direct connection to the Nodelet Queue Manager within each nodelet, a Stationary Core Proxy, and six RapidIO interfaces.
      • The NLS also maintains two PCIExpress interfaces where one provides an interface to the Stationary Processor (SP) and the other can be optionally used for connecting other PCIExpress cards (e.g, Infiniband).
      • The Stationary Processor portion maintains the Stationary Cores (SCs), a solid state drive, a boot NVRAM, and a DRAM for code and data that is private to the SP.
        • SP contains conventional 64-bit processing cores run- ning Linux
    • system interconnect and organization
      • The system interconnect between nodes in the Emu architec- ture utilizes the Serial RapidIO (SRIO) network standard
      • The Emu system has been architected to guarantee in-order delivery of remote operations to ensure that remote operations sent by a single thread to the same address occur in the correct order.
    • System specification
      • The NodeLet Set is currently implemented on an Altera Arria10 FPGA with an expectation of 4 Gossamer Cores per nodelet
      • The NCDRAM system is implemented with memory devices using DDR4- 2133 and provides a capacity of 8GB per nodelet (64 GB/node, 512 GB in the 8 node system)
      • The Stationary Processors are NXP (ne ́e Freescale) T1024 processors with PowerPC e5500 cores.
      • The system interconnect uses 4-lane RapidIO Gen2 which provides a full-duplex bandwidth of 5 GB/s per link (2.5 GB/s each direction).
      • Future generations will replace the FPGA with an ASIC, use RapidIO Gen 3 with additional ports per NLS, and use new memory technologies (e.g, Hybrid Memory Cubes).
  • Programming model
    • Address space
      • PGAS (global address space) system with memory contributed by each nodelet
      • “Replicated” space:
        • ensure data locality
        • is particularly valuable for providing local copies of global variables, important constants, and pointers into application data structures that are dispersed across the system.
        • Instruction code is always maintained in replicated memory so that a thread can always find its code locally without knowledge of where the thread is executing
    • Threads and ISA features
      • Thread migration:
        • completely in hardware, normally requires no software intervention
        • one thread returns values to its parent and then dies.
        • #Threads limited only by available memory
        • when a thread blocks, may voluntarily place itself at back of the run queue, instead of busy-waiting.
        • if the memory access is not local to the current nodelet, the thread context is removed from the current nodelet, transmitted across the system interconnect to the correct nodelet, and restarted there.
      • Thread contexts in the Emu system are compact, typically 10-20 64-bit words, and consist of the thread status word (TSW), address register, and assorted data registers.
      • ISA supports
        • single instruction spawns
        • a rich set of atomic instructions
        • Remote operations:
          • Stores and atomic instructions that do not require returning a value to the thread can be handled as remote operations.
          • The thread does not migrate but instead generates a short packet with the data, operation to be executed, and the address to be updated.
          • If desired, the remote nodelet sends an acknowledgment packet back to the sending thread. This is done when the sender wishes to know that an operation or set of operations is complete and globally visible before proceeding to a new phase of the algorithm.
          • Remote operations are especially advantageous for threads that need to update information in a “fire-and-forget” manner, as they do not return any data to the sending thread.
    • Cilk parallelism
      • To fully utilize the Emu system, mechanisms for expressing parallelism and creating large numbers of threads are required.
      • cilk_for: #pragma cilk grainsize = 4
      • can do dynamically spawning
    • Intrinsics and libraries

Benchmarks:

  • Random access (aka “GUPS”), mainly used to show the results.
  • Breadth-First Search (BFS)
  • sparse matrix-vector multiply (SpMV)
  • particle swarm

Supported Language:

  • C, Cilk

Comparison Architectures:

  • LCDM: loosely coupled distributed memory
  • TCDM: tightly coupled distributed memory
  • SM: shared memory
  • GUPS per KWatt, using IBM POWER7, BG/Q, x86.

Questions:

  • floating-point number support
  • question of cilk_for behavior
  • remote operation can be automatically employed?

“The Tensor Algebra Compiler” — Fredrik Kjolstad et.al, 2017

Paper: [Link]

Code: [Link]

Features:

  • For sparse, semi-sparse tensors and sparse, semi-sparse matrices
    • semi-sparse: one mode is either sparse or dense.
  • Support CSR, DCSR, BCSR (for matrices) and CSF (for tensors)
    • Need to specify the order of dimensions
    • Can support MTTKRP with sparse matrices, this can be used later.
  • No auto-tuning now, decision is left to the user.
  • Evaluate SpMV, compound linear algebra, high-order tensor algebra (e.g. MTTKRP, TTM, TTV, INNERPROD, ELEMENTWISE-PLUS)
  • Doesn’t support parallel execution if the output is sparse.
  • “assemble” method is to assemble the sparse index structure of the output tensor and preallocate memory for it.

Findings:

  • “tensor storage formats that separately designate each dimension as dense or sparse and specify their storage order, which can describe several widely used formats but generalizes to many more (Section 3);
  • iteration graphs that describe how to co-iterate through the multi-level index data structures of the sparse operands in a compound tensor expression (Section 4);
    • Each node corresponds to a loop in a loop nest.
    • A tensor path is a tuple of index variables.
    • Do not support cycles, because back edges require traversal in the wrong direction of a format. The solution now is to provide a function to change tensor formats that can be used to break cycles and transpose tensors.
  • merge lattices that describe how to merge the index data structures of sparse tensors that are used in the same tensor expression (Section 5);
  • a code generation algorithm that uses the above concepts to generate efficient code that computes a tensor expression with dense, sparse, and mixed operands (Section 6).”

Motivation:

  • “First, sparse input tensors are compressed into indexed data structures that the kernels must operate on. Second, only non-zero output values should be produced, and the calculations differ between computed components depending on the input tensors that contribute non-zero values. Finally, sparse tensor data structures typically do not support O (1) random access, so kernels must carefully orchestrate co-iteration over multiple tensors.”
    • “Sparse data structures can only be accessed efficiently in one direction.” (Why SPLATT CSF code achieve similar performance for each mode?)
    • “The number of formats increases exponentially as dimensionality increases by d!2^d.”
    • “A sparse storage dimension does not allow efficient random access to indices and values but is optimized for iteration in a specific order.”
  • “Libraries provide a limited set of hand-optimized operations and programmers compute compound operations through a sequence of supported operations using temporary tensors. This reduces locality and effciency, and for some compound operations the temporaries are much larger than the kernel’s inputs and output. On the other hand, it is infeasible to write optimized code for every compound operation by hand because of the combinatorial explosion of all possible compound operations, tensor orders, and tensor formats.”
  • “Our technique generates code entirely from tensor index notation and simple format descriptors that fully specify the compressed data structures. Thus, we do not perform pointer alias or dependency analysis at compile time, nor at runtime as in inspector-executor systems.”

Useful References:

Dataset:

  • SuiteSparse Matrix Collection
  • FROSTT

Platform & Software:

  • C++ library

Comparison Software:

  • A batch of sparse matrix libraries (e.g. OSKI, pOSKI, MKL, Eigen, uBLAS, Gmm++)
  • Tensor: SPLATT, Tensor Toolbox

Questions:

  • why “384 possible kernels for all the combinations of formats and implementations” (p.3)?
  • Why serial CSR SpMV code from taco is faster than other libraries? taco didn’t do any particular optimization.
  • Why Fig. 14 serial, dense-blocked SpMV taco code is slower than OSKI, but its parallel code is faster?

“Parallel Tensor Compression for Large-Scale Scientific Data” — Woody Austin et al., 2016

Paper: [Link]

Code: [Link]

Features:

  • dense tensors
  • The first distributed Tucker decomposition using MPI
  • Tucker decomposition (optimized ST-HOSVD and HOOI algorithms)
  • Use nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel.
  • It achieves near peak performance, as high as 66%, on a single node consisting of 24 cores and up to 17% of peak on over 1000 nodes.
  • Give thorough computation and communication time analysis

Findings:

  • The algorithm works for dense tensors of any order (i.e., number of dimensions) and size, given adequate memory, e.g., three times the size of the data.

Useful References:

Dataset:

Platform & Software:

Comparison Software:

“Accelerating the Tucker Decomposition with Compressed Sparse Tensors” — Shaden Smith et al., 2017

Paper: [Link]

Code: Will be in SPLATT

Features:

  • sparse tensors
  • TTMc in CSF format on multi-core CPU architecture (no Tucker decomposition result shown)
  • Get 20.7x speedup while mostly using less memory (up to 28.5x)
  • Fig. 7: good analysis for time and memory!

Findings:

  • Similar with SPLATT, decrease the time complexity along with tree levels.
  • The intermediate memory is limited to a single row of Y(n).
  • Parallelized by distributing the I1 (first mode) trees to threads.
  • Present synchronization using mutexes for write conflicts when updating any modes other than the first one.
  • Using multiple CSF representations
  • Heuristically sorting mode order, assume K CSF representations can be stored,
    • For X1, sort the modes by their lengths, with the shortest mode placed at the top level.
    • The remaining K−1 representations are selected in a greedy fashion: at step k, use (4) to examine the costs associated with TTMc for each mode when provided with X 1, . . . , X k−1. The mode with the highest cost is placed at the top level of Xk, and the remaining modes are sorted by increasing length.
    • At the end of this procedure, each mode has the choice of K representations to use for TTMc computation. We assign each mode to the representation with the lowest cost, and use that representation for TTMc.
    • Importantly, if ties are broken in a consistent manner, then it happens in practice that several modes can be assigned to the same X k , meaning that fewer than K representations need be kept in memory for computation.

Useful References:

Dataset:

  • Nell-2, Enron [FROSTT]
  • Netflix (not public), Alzheimer (From Muthu Baskaran)
  • Synthetic: Possion3D, Possion4D

Platform & Software:

  • Intel Haswell E5-2680v3
  • OpenMP
  • Intel compiler with Intel MKL

Comparison Software:

  • HyperTensor: HT-FLAT, HT-BTREE

 

“GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication” — Yuan Tao et al., 2014

Paper: [Link]

Code: N/A

Features:

  • This work extends CSB format to GPU
  • Construct an eCSB (expanded CSB) format
  • eCSB SpMV and SpMTV performance is close.
  • Speedup
    • Compared to multicore CSB, 13x faster.
    • eCSB-SpMV is comparable to CUSP’s HYB-SpMV, and outperforms cuSPARSE’s CSR-SpMV by 20%. While for SpMTV, eCSB behaves much better than both of them by 31%-365%.
    • For Bi-conjugate gradient (BCG), eCSb outperforms cuSPARSE and CUSP by 6% and 25% respectively.
  • eCSB-SpMV shows a higher utilization ratio of GPU.
  • This paper doesn’t express the work well, the work deserves better.

Findings:

  • A straightforward porting of CSB to GPUs is inefficient, because
    • “CSB algorithm performs a complex recursive partition process, uses multiple forms of dynamic parallelism existing among different rows, sub-rows, and sub-blocks.” This may be just programming problems. “Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs.”
    • Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted.
    • “It depends on Cilk++ to distribute computing tasks to cores at runtime for dynamic load balancing.” How does GPU dynamic parallelism perform?
  • eCSB format
    • blk_ptr (new added): The offset of the first nonzero in each blockrow; When COO format is used for blocks, no blk_ptr needed.
    • block_ind: Concatenated block row’s index and block column’s index as a single variable
    • comb_lowbits: The intra-block row and column indices for nonzeros concatenated as a single variable.
    • val: nonzeros
    • Its memory footprint is larger than CSB’s according to the paper. But maybe not, when the number of blocks is much smaller than n^2/β^2.
  • hybrid storage format to organize sub-blocks
    • can be selected from ELL, COO, and HYB dynamically at runtime
  • Z-Morton ordering (bit-interleaving) is used inside each block.
  • GPU implementation
    • one thread block to handle one blockrow
    • The threads in a block integrate through the single blockrow.
    • Use atomic operations on shared memory (SpMV) or global memory (SpMTV)
  • Give how to determine the storage format in Sec 5.3
  • Matrices gets lower performance on eCSB is because their most populated rows are much higher than the average value, leading to more frequent collisions in the atomic additions.

Other Knowledge:

  • Existing CSR, ELL, COO formats cannot enable efficient access to nonzeros along both row and column directions.

Useful References:

Dataset:

  • 11 matrices from UF collection

Platform & Software:

  • Intel Core i7-3770
  • Nvidia GTX Titan
  • Nvidia C2075

Comparison Software:

  • CSB for multicore CPU
  • Nvidia cuSPARSE’s CSR-SpMV
  • Nvidia CUSP’s HYB-SpMV
  • SpMTV is implemented by modifying the code of CUSP

“Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks” — Aydın Buluç et al, 2009

Paper: [Link]

Code: [Link]

Slides: [Link]

Features:

  • Introduce CSB (Compressed Sparse Blocks) format
    • CSB does not favor rows over columns or vice versa.
    • Compared to BCSR which stores a sparse collection of (small) dense blocks, whereas CSB stores a dense collection of sparse blocks.
    • It is motivated by multicore and manycore architectures, in which parallelism and memory bandwidth are key resources.
    • Storage: The storage requirement for CSB is esssentially the same as that for CSR format, for which computing Ax in parallel is easy but AT x (or equivalently computing xT A) is difficult.
  • CSB-SpMV Performance:
    • On one processor, the CSB algorithms for Ax and AT x run just as fast as the CSR algorithm (from OSKI) for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.
    • On multicore platforms, CSB-SpMV runs faster for all the matrices we tested, including the structured ones over Star-P.
    • “Its performance is limited by the memory-system bandwidth, not by lack of parallelism.”
    • The parallel performance of CSB_SPMV and CSB_SPMV_T is generally not affected by highly uneven row and column nonzero counts.
  • Did not employ speculative low-level optimizations such as software prefetching, pipelining, or matrix-specific optimizations such as index and/or value compression, but we believe that CSB and our algorithms should not adversely affect incorporation of these approaches.

Findings:

  • CSR-SpMV:
    • work: Θ(nnz);
    • span: Θ(nr+lgn), nr: the average number of nonzeros per row;
    • storage: n lg nnz + nnz lg n bits for index data, for n*n sparse matrix with nnz nonzeros.
  • CSB format
    • block-size parameter β:
      • a block size slightly larger than √n delivers reasonable performance, which is at most 10%-15% degradation from the best performance.
      • the overall best performance was achieved when β satisfies the equation ⌈lg√n⌉ ≤ lgβ ≤ 3+⌈lg√n⌉.
      • CSB constructor can automatically select a reasonable β in this range.
    • val[nnz]:
      • Store nonzero values in the order of contiguous blocks.
      • The ordering among blocks is flexible, row-major ordering is used in the experiments.
      • Within a block, Z-Morton ordering is used.
        • The choice of layout within the block (Z-Morton or row/column orderings) becomes less significant for sparse blocks as they already do not take full advantage of such features.
        • More importantly, a recursive ordering allows us to efficiently determine the four quadrants of a block using binary search, which is crucial for parallelizing individual blocks.
    • row_ind[nnz]
    • col_ind[nnz]
      • They range from 0 to β-1
      • As a practical matter, we can pack a corresponding pair of elements of row_ind and col_ind into a single integer word of 2lgβ bits so that they make a single array of length nnz, which is comparable to the storage needed by CSR for the col_ind array
    • blk_ptr[n^2/β^2]
      • stores the index of each block in the val array, which is analogous to the row_ptr array for CSR
    • storage: (n^2/β^2) lg nnz+2 nnz lgβ bits for index data
  • Parallelization of CSB-SpMV
    • blockrow splitting, with further dividing long blockrows.
    • Specifically, when a blockrow contains Ω(β) nonzeros, we re- cursively divide it “in half,” yielding two subblockrows, each containing roughly half the nonzeros.
      • To facilitate fast subblockrow divisions, we first partition the blockrow into “chunks” of consecutive blocks. By dividing a blockrow “in half,” we mean assigning to each subblockrow roughly half the chunks.
      • If this chunk contains many blocks, it is guaranteed to contain at most Θ(β) nonzeros, which is sufficiently sparse to perform the serial multiplication;
      • If, on the other hand, the chunk is a single block, it may contain as many as β2 ≈ n nonzeros. Instead, we perform the parallel block-vector multiplication CSB_BLOCKV.
    • For each block, use function CSB_BLOCKV
      • The block-vector multiplication proceeds by recursively dividing the (sub)block M into quadrants M00, M01, M10, and M11, each of which is conveniently stored contiguously in the Z-Morton-ordered val, row_ind, and col_ind arrays between indices start and end. We perform binary searches to find the appropriate dividing points in the array.
    • To avoid the races that might arise due to write conflicts between the subblockrows, we allocate a temporary vector to store the result of one of the subblockrows and allow the other subblockrow to use the output vector. After both subblockrow multiplications complete, we serially add the tempo- rary vector into the output vector.
    • when the nonzeros are evenly distributed across the block- rows, our CSB-SpMV algorithm directly calls the serial multiplication instead.
  • CSB-SpMV
    • work: Θ(n^2/β^2 + nnz)
    • span: O(β lg(n/β) + n/β)
      • choosing β = Θ(√n), CSB_SPMV runs with work Θ(nnz) and span O(√nlgn), achieving a parallelism of Ω(nnz /√n lg n). storage:
    • extra storage (for chunk array and the temporary vector z): O(P√nlgn) space when β = Θ(√n)

Other Knowledge:

  • Introduce CSR format history in P234 right column third paragraph
  • Review CSR-SpMV parallelization strategies on Page 235 left column last five paragraphs and right column first three paragraphs.
  • Many or most of the individual blocks Aij are hypersparse, that the ratio of nonzeros to matrix dimension is asymptotically 0. Thus, the space to store a block should therefore depend only on its nonzero count, not on its dimension.
  • Morton-hybrid layout: stores elements in row-major order within blocks and a Morton ordering among blocks. [Paper: Analyzing block locality in Morton-order and Morton-hybrid matrices]

Useful References:

Dataset:

  • 14 matrices from UF collection

Platform & Software:

  • Platforms: AMD Opteron 8214 (Santa Rosa), Intel Xeon X5460 (Harpertown), Intel Core i7 920 (Nehalem)
  • values: double-precision floating points; indices: 32-bit unsigned integers.
  • Language: Cilk++

Comparison Software:

  • “pure” serial OSKI: without enabling OSKI’s preprocessing step, which chooses blockings for cache and register usage that are tuned to a specific matrix.
  • Star-P: distributed-memory code

Cited by:

Summary of Tensor Decomposition from High Performance Computing View (Updating)

Survey:

Tensor Formats:

Sparse Tensor Parallelization:

Current Development:

  • Fundamental tensor operations
    • TTM
    • MTTKRP
    • Tensor contraction
  • Tensor Decompositions
    • CP decomposition
    • Tucker decomposition
    • (Anima’s )
    • Tensor Train decomposition
    • Hierarchical Tucker decomposition

Applications:

  • Healthcare
  • Deep Learning
  • Traditional Machine Learning
  • Social Networks

Software:

  • Matlab
    • Tensor Toolbox
    • N-way Toolbox
  • C++
    • CTF (Cyclops Tensor Framework)
    • SPLATT
    • ParTI

“A Unified Optimization Approach for Sparse Tensor Operations on GPUs” — Bangtian Liu et al. 2017

Paper: [Link]

Code: N/A

Features:

  • Optimized sparse MTTKRP and TTM on GPUs
  • Can be extended to high-order tensor computations
  • Speedup:
    • SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs over ParTI-GPU.
    • CANDECOMP/PARAFAC (CP) decomposition achieves up to 14.9 times speedup over SPLATT.
  • My thought:
    • F-COO cannot replace COO format, need F-COO for each mode.
    • sf may need to record the every starting index in F-COO format.

Findings:

  • propose a unified sparse tensor representation called F-COO (Flagged-COOrdinate), which is based on the tensor modes for sparse tensor computations.
    • memory efficient
    • can be used as a unified storage format across different sparse tensor operations
    • Each computation need to construct a F-COO format correspondingly.
    • F-COO is preprocessed for different modes on the host and will only be transferred once in the beginning to the GPU.
  • F-COO details:
    • take mode-3 SpTTM as an example
      • “val”: store nonzero values
      • “k”: Product modes’ indices are stored as in COO format, to guide the Kronecker or Hadamard product operations
      • bf (bit-flag): represent any changes in the index modes which consequently shows the computation has switched to another fiber (in SpTTM) or to another slice (in SpMTTKRP).
        • bf value will change from 1 to 0 when a change in the i or j value occurs.
        • The number of non-zeros processed per thread depends on the data type selected for the bf array. For example, if we use uint8 t or unsigned char to bf, the number of non- zeros processed per thread will be 8.
      • sf (start-flag): indicate whether partitions processed by the current thread start new indices on index mode over previous thread.
  • Unified parallel algorithms for different sparse tensor operations, not sensitive to mode changes
    • Seems The one-shot algorithm for MTTKRP is already used before.
    • They don’t have large intermediate data is because they deal with factor matrices one column for each thread block, instead of a row.
    • And each thread block writes to different global memory locations.
  • Use segmented scan with bf and sf information.
  • GPU parallelization:
    • launch two-dimensional thread grids with one-dimensional thread blocks, instead of one-dimensional thread grids with two-dimensional thread blocks (as in ParTI), to avoid thread divergence inside a warp and cause strided memory accesses.
    • This partitioning strategy makes the computation imbalance exist between thread blocks instead of inside threads.
  • Use GPU-specific optimizations: kernel fusion, warp shuffle, and data reuse
    • Use Read-Only Data-Cache for factor matrices
    • fuse product kernel, segmented scan, and the accumulation kernels together to increase data reuse and keep intermediate data in shared memory
    • Warp shuffle is used to increase data sharing inside a warp. Warp shuffle enables register to register data exchange and thus reduces the shared memory footprint and avoids overusing shared memory.

Other Knowledge:

  • Challenges in optimizing sparse tensor operations on GPUs, such as i) finding a good parallelization granularity, ii) reducing storage costs (large intermediate data) and irregularities in memory accesses, and iii) dealing with atomic updates.
  • Prior work has used fiber- or slice-level computations as the granularity for parallelization.
    • However, such approaches lead to noticeable load imbalance between threads on the GPU because of the sparse nature of the tensors.
    • Also the optimizations do not deliver consistent performance for different modes and ranks

Dataset:

  •  FROSTT: brainq, nell1, nell2, delicious

Platform:

  • Intel Core i7-5820K CPU
  • NVIDIA GeForce GTX Titan X platform

Comparison Software:

  • ParTI
  • SPLATT

“Merge-based Parallel Sparse Matrix-Vector Multiplication” — Duane Merrill et al. 2016

Paper: [Link]

Slide: [Link]

Code: [Link]

Features:

  • Overall, better average performance compared to Nvidia cuSPARSE, Intel MKL and other implementations. Performance is worse for some matrices with even rows.
  • The first work adapted merge-based parallel decomposition for computing a well-balanced SpMV directly on CSR matrices without offline analysis, format conversion, or the construction of side-band data (aka preprocessing, inspection, reformatting, or supplemental encoding). (nz_indices used in the merge-based algorithm is nnz location indices, which is actually natural numbers from 0 to nnz-1, not extra storage is needed for it.) This decomposition is particularly useful for bounding workloads across multi-scale processors and systems with fixed-size local memories.
  • The merge-based decomposition is orthogonal to (and yet supportive of) bandwidth optimization strategies such as index compression, blocking, and relabeling. In fact, the elimination of workload imbalance from our method is likely to amplify the benefits of these techniques.
  • Explore NUMA opportunities for multi-socket systems.
  • Merge-based method allows us to make maximal utilization of the GPU’s fixed-size shared memory resources, regardless of the ratio between row-offsets and nonzeros consumed during each path-chunk. This is not possible for row-based or nonzero-splitting strategies that are unable to provide tight bounds on local workloads.
  • Binary searching and fix-up are counted as part of SpMV running time, while NUMA page migration as preprocessing overhead.
  • The row-length imperviousness (the correlation between GFLOPs throughput vs row-length variation) and the performance predictability (the correlation between elapsed running time vs matrix size nnz) are GOOD (Fig. 9).

Findings:

  • Adapt a simple parallelization strategy originally developed for efficient merging of sorted sequences [Paper]
  • The central idea is to frame the parallel CsrMV decomposition as a logical merger of two lists: the row descriptors and the indices of the nonzero values.
  • Merge-based CsrMV
    • Two stages: partitioning and merging stages.
    • The two elements Ai and Bj scheduled to be compared at diagonal k can be found via constrained binary search along that diagonal: find the first (i,j) where Ai is greater than all of the items consumed before Bj, given that i+j=k.
    • this parallelization strategy is that the decision path can be partitioned hierarchically, trivially enabling parallelization across large, multi-scale systems. work complexity: O(N+plogN); time complexity: O(N/p+logN).

Other Knowledge:

  • Specialized or supplementary formats ultimately burden the application with both
    • (1) preprocessing time for inspection and formatting, which may be tens of thousands of times greater than the SpMV operation itself;
    • (2) excess storage overhead because the original CSR matrix is likely required by other routines and cannot be discarded.
  • Specialized sparse matrix formats:
    • padding (e.g. ELL) and/or reorganization strategies (map similarly-sized rows onto co-scheduled thread groups, e.g. PKT, SELL),
    • reduce memory I/O via index compression (e.g. blocking, bit-encoding, BCCOO),
    • hybrid and multi-level formats (e.g. pOSKI, CSB, HYB, Cocktail).
  • CsrMV parallelization strategies:
    • Row splitting:
      • The splitting of long rows can be done statically via a preprocessing step that encodes the dataset into an alternative format, or dynamically using a work-sharing runtime. The dynamic variant requires runtime task distribu- tion, a behavior likely to incur processor contention and limited scalability on massively parallel systems.
      • Vectorization is a common variant of row-splitting in which a group of threads is assigned to process each row. Maybe significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths. While adaptively vectorize based on row length require significant underutilization in the form of inactive SIMD lanes when row-lengths do not align with SIMD widths.
    • Nonzero splitting: need to perform the searching within the row-offsets array as an offline preprocessing step.

COO SpMV optimizations:

Dataset:

  • 4201 non-vector, non-complex sparse matrices from University of Florida Sparse Matrix Collection

Platform:

  • dual-socket NUMA CPU platform, with two Intel Xeon E5-2690v2 CPUs
  • one NVIDIA Tesla K40 GPU
  • Test bandwidth: stream triad score [Webpage]

Comparison Software:

  • Intel MKL v11.3
  • NVIDIA’s cuSPARSE v7.5
  • Other non-CSR formats expressly designed to tolerate row-length variation:
    • CSB (Compressed Sparse Blocks) [Paper]
    • HYB, also from NVIDIA’s cuSPARSE library, HybMV [Paper]
    • pOSKI [Webpage]
    • yaSpMV: BCCOO (blocked compressed common coordinate) [Paper]