15418
CMU 15-418/618 23sp
https://www.bilibili.com/video/BV1SM4y1j7XL/
1 Why parallelism
A Brief History of Process Performance
initial focus:
1.supercomputer for scientific computing(1970s)
2.database(1990s)
Wider data paths: 4 bit to 64 bit
More efficient pipelining: increasing CPI
Exploiting ILP: superscalar processing
example of ILP:
pipelining, superscalar execution, VLIW(Very Long Instruction Word), vector processing, Out of order execution
Faster clock rates: 3GHz
Obstacles:
Power Density Wall, No further benifit from ILP, Processor clock rate stops increasing
What is a parallel computer
definition: a collection processing elements that cooperate to solve problems quickly (efficiency & performance)
Motivation: Speedup
Speedup(P cores)=execution time(1 core)/execution time(P cores)
influence factor:
- Communication(dominanting)
- Work assignment
- Communication interval
Course theme1: Designing and writing parallel programs
Parallel thinking
Decomposing work into pieces
Assigning
Managing communication/synchronization
Abstractions
Course theme2: Parallel cpmouter hardware implementation
Mechanisims
Course theme3: Efficiency
FAST!=EFFICIENT
hardware:silicon area, power, efficiency
2 A modern multicore processor
Four key concepts: two about parallel execution, two about accessing memory
superscalar processor: multiple instructions per cycle, ILP
Idea 1:
Use increasing transistor count to add more cores to the processor
forall declearation: parallel for loop
Idea 2:Amortize cost/complexity of managing an instruction stream across many ALUs. (e.g. SIMD processing, vector processing)
Note:
- Abstraction facilitates automatic generation of multicore parallel code and vertor instructions to make use of SIMD processing capabilities within a core.
- vector processing with conditional execution: walk over all the code, lose efficiency
Terminology:
- Instruction stream coherence: necessary for efficient use of SIMD processing, inneceesary for efficient parallelization across cores
- Divergent execution: a lack of instruction stream coherence
instruction stream coherence vs cache coherence
- explicit SIMD: SIMD parallelization is performed at compile time
- implicit SIMD(GPU): compiler generates a scalar binary and interface to the hardware is data-parallel
Summary: Several forms of parallel execution:
- multicore: use multiple processing cores providing thread-level parallelism (e.g. pthreads)
- SIMD: use multiple ALUs
- superscalar: exploit ILP within an instruction stream
Part 2: accessing memory
Terminology:
- Memory latency: time between a request for a value and the return of that value
- Memory bandwidth: the rate at which data can be read from or stored into a semiconductor memory by a processor
- stall: a processor is waiting for data to be returned from memory\
Cache:
Cache reduce length of stalls
prefetching and multithreading reduce cache misses
Idea1: interleave processing of multiple threads to hide stalls
swticth between threads when one thread is waiting for memory\
Hardware-surpported multi-threading:
- core manages execution context of multiple threads
- Interleaved multithreading: switch between threads on each cycle
- Simultaneous multithreading: execute instructions from multiple threads in the same cycle\
multithreading summary:
benefits: use a core's ALU more effectively
costs: additional storage for thread state, increase run time of single thread, relies heavily on memory bandwidth
CPU vs GPU memory hierarchy:
- CPU: L1, L2, L3 cache, main memory
- GPU: L1, L2 cache, global memory, texture memory, constant memory, shared memory, register file
although GPU can do the computation faster, it is limited by the memory bandwidth.
Bandwidth is critical!
Summary:
- three major ideas that all modern processors employ:
- employ multiple processing cores
- amortize instruction stream management across many ALUs
- use multi-threading to hide memory latency
- Due to high arithmetic capability on modern chips, many parallel programs are limited by memory bandwidth
- GPU architecture use the same throughput computing principles as multicore CPUs, but GPUs push these ideas to the extreme
留言
發佈留言