15418

1 Why parallelism

CMU 15-418/618 23sp
https://www.bilibili.com/video/BV1SM4y1j7XL/

1 Why parallelism

A Brief History of Process Performance

initial focus:
1.supercomputer for scientific computing(1970s)
2.database(1990s)

Wider data paths: 4 bit to 64 bit
More efficient pipelining: increasing CPI
Exploiting ILP: superscalar processing
example of ILP:
pipelining, superscalar execution, VLIW(Very Long Instruction Word), vector processing, Out of order execution
Faster clock rates: 3GHz

Obstacles:
Power Density Wall, No further benifit from ILP, Processor clock rate stops increasing

What is a parallel computer

definition: a collection processing elements that cooperate to solve problems quickly (efficiency & performance)

Motivation: Speedup
Speedup(P cores)=execution time(1 core)/execution time(P cores)

influence factor:

  • Communication(dominanting)
  • Work assignment
  • Communication interval

Course theme1: Designing and writing parallel programs Parallel thinking
Decomposing work into pieces
Assigning
Managing communication/synchronization
Abstractions
Course theme2: Parallel cpmouter hardware implementation
Mechanisims
Course theme3: Efficiency
FAST!=EFFICIENT
hardware:silicon area, power, efficiency

2 A modern multicore processor

Four key concepts: two about parallel execution, two about accessing memory
superscalar processor: multiple instructions per cycle, ILP

Idea 1: Use increasing transistor count to add more cores to the processor
forall declearation: parallel for loop

Idea 2:Amortize cost/complexity of managing an instruction stream across many ALUs. (e.g. SIMD processing, vector processing)
Note:

  1. Abstraction facilitates automatic generation of multicore parallel code and vertor instructions to make use of SIMD processing capabilities within a core.
  2. vector processing with conditional execution: walk over all the code, lose efficiency

Terminology:

  • Instruction stream coherence: necessary for efficient use of SIMD processing, inneceesary for efficient parallelization across cores
  • Divergent execution: a lack of instruction stream coherence

instruction stream coherence vs cache coherence

  • explicit SIMD: SIMD parallelization is performed at compile time
  • implicit SIMD(GPU): compiler generates a scalar binary and interface to the hardware is data-parallel

Summary: Several forms of parallel execution:

  • multicore: use multiple processing cores providing thread-level parallelism (e.g. pthreads)
  • SIMD: use multiple ALUs
  • superscalar: exploit ILP within an instruction stream

Part 2: accessing memory

Terminology:

  • Memory latency: time between a request for a value and the return of that value
  • Memory bandwidth: the rate at which data can be read from or stored into a semiconductor memory by a processor
  • stall: a processor is waiting for data to be returned from memory\

Cache: Cache reduce length of stalls
prefetching and multithreading reduce cache misses

Idea1: interleave processing of multiple threads to hide stalls
swticth between threads when one thread is waiting for memory\

Hardware-surpported multi-threading:

  • core manages execution context of multiple threads
  • Interleaved multithreading: switch between threads on each cycle
  • Simultaneous multithreading: execute instructions from multiple threads in the same cycle\

multithreading summary:
benefits: use a core's ALU more effectively
costs: additional storage for thread state, increase run time of single thread, relies heavily on memory bandwidth

CPU vs GPU memory hierarchy:

  • CPU: L1, L2, L3 cache, main memory
  • GPU: L1, L2 cache, global memory, texture memory, constant memory, shared memory, register file
    although GPU can do the computation faster, it is limited by the memory bandwidth.
    Bandwidth is critical!

Summary:

  • three major ideas that all modern processors employ:
    1. employ multiple processing cores
    2. amortize instruction stream management across many ALUs
    3. use multi-threading to hide memory latency
  • Due to high arithmetic capability on modern chips, many parallel programs are limited by memory bandwidth
  • GPU architecture use the same throughput computing principles as multicore CPUs, but GPUs push these ideas to the extreme

3 Parallel programming abstractions

留言

此網誌的熱門文章

思维前进

再重游