Paper: [Link]

Code: [Link]

**Features:**

**Findings:**

**Useful References:**

**Dataset:**

**Platform & Software:**

**Comparison Software:**

- OSKI

Skip to content
# Category: Uncategorized

# “An Efficient Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats” — Peter Ahrens et al., 2017

# “Parallel Tensor Compression for Large-Scale Scientific Data” — Woody Austin et al., 2016

# “Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory” — Shaden Smith et al. 2017

Paper: [Link]

Code: [Link]

**Features:**

- dense tensors
- The first distributed Tucker decomposition using MPI
- Tucker decomposition (optimized ST-HOSVD and HOOI algorithms)
- Use nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel.
- It achieves near peak performance, as high as 66%, on a single node consisting of 24 cores and up to 17% of peak on over 1000 nodes.
- Give thorough computation and communication time analysis

**Findings:**

- The algorithm works for dense tensors of any order (i.e., number of dimensions) and size, given adequate memory, e.g., three times the size of the data.

**Useful References:**

**Dataset:**

**Platform & Software:**

**Comparison Software:**

Paper: TODO

Code:[Link]

**Features:**

- maintain load balance and low synchronization
- explore of architectural features, e.g. vectorization, synchronization (mutexes, compare-and-swap, transactional memory, privatization), managing high-bandwidth memory (MCDRAM).
- Platform: One KNL processor
- Speedup: 1.8x speedup over a dual socket Intel Xeon 44-core system.

**Findings:**

**Other Knowledge:**

- HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands.
- sparse, unstructured tensors
- Challenges of optimization algorithms on many-core processors:

a high degree of parallelism, load balance tens to hundreds of parallel threads, and effectively utilize the high-bandwidth memory.

**Useful reference:**

**Dataset:**

- FROSTT [Link]