1 Why parallelism

CMU 15-418/618 23sp
https://www.bilibili.com/video/BV1SM4y1j7XL/

1 Why parallelism

A Brief History of Process Performance

initial focus:
1.supercomputer for scientific computing(1970s)
2.database(1990s)

Wider data paths: 4 bit to 64 bit
More efficient pipelining: increasing CPI
Exploiting ILP: superscalar processing
example of ILP:
pipelining, superscalar execution, VLIW(Very Long Instruction Word), vector processing, Out of order execution
Faster clock rates: 3GHz

Obstacles:
Power Density Wall, No further benifit from ILP, Processor clock rate stops increasing

What is a parallel computer

definition: a collection processing elements that cooperate to solve problems quickly (efficiency & performance)

Motivation: Speedup
Speedup(P cores)=execution time(1 core)/execution time(P cores)

influence factor:

Communication(dominanting)
Work assignment
Communication interval

Course theme1: Designing and writing parallel programs Parallel thinking
Decomposing work into pieces
Assigning
Managing communication/synchronization
Abstractions
Course theme2: Parallel cpmouter hardware implementation
Mechanisims
Course theme3: Efficiency
FAST!=EFFICIENT
hardware:silicon area, power, efficiency

2 A modern multicore processor

Four key concepts: two about parallel execution, two about accessing memory
superscalar processor: multiple instructions per cycle, ILP

Idea 1: Use increasing transistor count to add more cores to the processor
forall declearation: parallel for loop

Idea 2:Amortize cost/complexity of managing an instruction stream across many ALUs. (e.g. SIMD processing, vector processing)
Note:

Abstraction facilitates automatic generation of multicore parallel code and vertor instructions to make use of SIMD processing capabilities within a core.
vector processing with conditional execution: walk over all the code, lose efficiency

Terminology:

Instruction stream coherence: necessary for efficient use of SIMD processing, inneceesary for efficient parallelization across cores
Divergent execution: a lack of instruction stream coherence

instruction stream coherence vs cache coherence

explicit SIMD: SIMD parallelization is performed at compile time
implicit SIMD(GPU): compiler generates a scalar binary and interface to the hardware is data-parallel

Summary: Several forms of parallel execution:

multicore: use multiple processing cores providing thread-level parallelism (e.g. pthreads)
SIMD: use multiple ALUs
superscalar: exploit ILP within an instruction stream

Part 2: accessing memory

Terminology:

Memory latency: time between a request for a value and the return of that value
Memory bandwidth: the rate at which data can be read from or stored into a semiconductor memory by a processor
stall: a processor is waiting for data to be returned from memory\

Cache: Cache reduce length of stalls
prefetching and multithreading reduce cache misses

Idea1: interleave processing of multiple threads to hide stalls
swticth between threads when one thread is waiting for memory\

Hardware-surpported multi-threading:

core manages execution context of multiple threads
Interleaved multithreading: switch between threads on each cycle
Simultaneous multithreading: execute instructions from multiple threads in the same cycle\

multithreading summary:
benefits: use a core's ALU more effectively
costs: additional storage for thread state, increase run time of single thread, relies heavily on memory bandwidth

CPU vs GPU memory hierarchy:

CPU: L1, L2, L3 cache, main memory
GPU: L1, L2 cache, global memory, texture memory, constant memory, shared memory, register file
although GPU can do the computation faster, it is limited by the memory bandwidth.
Bandwidth is critical!

Summary:

three major ideas that all modern processors employ:
1. employ multiple processing cores
2. amortize instruction stream management across many ALUs
3. use multi-threading to hide memory latency
Due to high arithmetic capability on modern chips, many parallel programs are limited by memory bandwidth
GPU architecture use the same throughput computing principles as multicore CPUs, but GPUs push these ideas to the extreme

3 Parallel programming abstractions

搜尋此網誌

Hugo

15418

1 Why parallelism

A Brief History of Process Performance

What is a parallel computer

2 A modern multicore processor

Part 2: accessing memory

3 Parallel programming abstractions

留言

發佈留言

此網誌的熱門文章

思维前进

再重游

標籤