Paper: [Link]
Optimization Techniques:
- launch plenty number of threads
- good thread control: spawn and synchronization
- Load balance
- program hotspot
- Manage the data locality
- Each thread migration to handle fair amount of data.
Features (Introduce EMU architecture):
- System architecture
- Nodelet architecture
- multiple Gossamer Cores (GCs)
- supports 64 concurrent threads
- OS launches main() user program on a GC, heads return to main() upon completion, which then returns to OS
- Internally, the GCs implement fine-grained multithreading and issue a single instruction each cycle. Each thread is limited to one active instruction at any given time.
- deeply pipelined
- accumulator + 16 registers
- Since the system is expected to maintain large numbers of active threads, limiting the execution rate of a single thread should have a minimal performance impact.
- Limiting each thread to a single active instruction reduces the amount of logic needed by each GC, as features like branch prediction and data hazard mitigation are not required.
- Do not include data caches; memory transactions are either atomics or done as loads and stores between the thread registers and memory.
- a Nodelet Queue Manager (NQM)
- moving packets to and from the Migration Engine, GCs, and Memory Front End.
- All packets coming into the nodelet are processed by the NQM and placed into one of three queues: Run, Migrate, or Memory.
- Run Queue holds threads awaiting processing on the nodelet. The Run Queue is continuously sorted by the NQM to keep it approximately ordered by thread priority with higher priority threads moving towards the head of the queue.
- Migrate Queue holds packets leaving the nodelet until accepted by the Migration Engine,
- Memory Queue holds remote operations until accepted by the Memory Front End. Once a remote operation is accepted by the Memory Front End, the NQM generates the (optional) acknowledgment packet and sends it to the Migration Engine.
- a Memory Front End (MFE)
- When a thread performs a memory transaction on a GC, the transaction is submitted to MFE.
- Atomic operations are performed within the MFE
- the Narrow Channel DRAM (NCDRAM) system.
- removes this inefficiency by using an 8-bit interface instead of a 64-bit interface.
- multiple Gossamer Cores (GCs)
- Node architecture
- ME: Eight nodelets are replicated and combined with a Migration Engine (ME) to form the majority of the NLS (NodeLet Set) portion of the node.
- a crossbar with a direct connection to the Nodelet Queue Manager within each nodelet, a Stationary Core Proxy, and six RapidIO interfaces.
- The NLS also maintains two PCIExpress interfaces where one provides an interface to the Stationary Processor (SP) and the other can be optionally used for connecting other PCIExpress cards (e.g, Infiniband).
- The Stationary Processor portion maintains the Stationary Cores (SCs), a solid state drive, a boot NVRAM, and a DRAM for code and data that is private to the SP.
- SP contains conventional 64-bit processing cores run- ning Linux
- ME: Eight nodelets are replicated and combined with a Migration Engine (ME) to form the majority of the NLS (NodeLet Set) portion of the node.
- system interconnect and organization
- The system interconnect between nodes in the Emu architec- ture utilizes the Serial RapidIO (SRIO) network standard
- The Emu system has been architected to guarantee in-order delivery of remote operations to ensure that remote operations sent by a single thread to the same address occur in the correct order.
- System specification
- The NodeLet Set is currently implemented on an Altera Arria10 FPGA with an expectation of 4 Gossamer Cores per nodelet
- The NCDRAM system is implemented with memory devices using DDR4- 2133 and provides a capacity of 8GB per nodelet (64 GB/node, 512 GB in the 8 node system)
- The Stationary Processors are NXP (ne ́e Freescale) T1024 processors with PowerPC e5500 cores.
- The system interconnect uses 4-lane RapidIO Gen2 which provides a full-duplex bandwidth of 5 GB/s per link (2.5 GB/s each direction).
- Future generations will replace the FPGA with an ASIC, use RapidIO Gen 3 with additional ports per NLS, and use new memory technologies (e.g, Hybrid Memory Cubes).
- Nodelet architecture
- Programming model
- Address space
- PGAS (global address space) system with memory contributed by each nodelet
- “Replicated” space:
- ensure data locality
- is particularly valuable for providing local copies of global variables, important constants, and pointers into application data structures that are dispersed across the system.
- Instruction code is always maintained in replicated memory so that a thread can always find its code locally without knowledge of where the thread is executing
- Threads and ISA features
- Thread migration:
- completely in hardware, normally requires no software intervention
- one thread returns values to its parent and then dies.
- #Threads limited only by available memory
- when a thread blocks, may voluntarily place itself at back of the run queue, instead of busy-waiting.
- if the memory access is not local to the current nodelet, the thread context is removed from the current nodelet, transmitted across the system interconnect to the correct nodelet, and restarted there.
- Thread contexts in the Emu system are compact, typically 10-20 64-bit words, and consist of the thread status word (TSW), address register, and assorted data registers.
- ISA supports
- single instruction spawns
- a rich set of atomic instructions
- Remote operations:
- Stores and atomic instructions that do not require returning a value to the thread can be handled as remote operations.
- The thread does not migrate but instead generates a short packet with the data, operation to be executed, and the address to be updated.
- If desired, the remote nodelet sends an acknowledgment packet back to the sending thread. This is done when the sender wishes to know that an operation or set of operations is complete and globally visible before proceeding to a new phase of the algorithm.
- Remote operations are especially advantageous for threads that need to update information in a “fire-and-forget” manner, as they do not return any data to the sending thread.
- Thread migration:
- Cilk parallelism
- To fully utilize the Emu system, mechanisms for expressing parallelism and creating large numbers of threads are required.
- cilk_for: #pragma cilk grainsize = 4
- can do dynamically spawning
- Intrinsics and libraries
- Address space
Benchmarks:
- Random access (aka “GUPS”), mainly used to show the results.
- Breadth-First Search (BFS)
- sparse matrix-vector multiply (SpMV)
- particle swarm
Supported Language:
- C, Cilk
Comparison Architectures:
- LCDM: loosely coupled distributed memory
- TCDM: tightly coupled distributed memory
- SM: shared memory
- GUPS per KWatt, using IBM POWER7, BG/Q, x86.
Questions:
- floating-point number support
- question of cilk_for behavior
- remote operation can be automatically employed?