Outline of Material for Final
Architectural Models of Parallel Computers
- Flynn's Taxonomy (SISD,SIMD,MIMD)
- Understand discussion on pages 25-27 in text.
- Also understand SPMD (Single Program Multiple Data)
- Memory Access Architectures
Understand:
- Shared Memory Multiprocessor System [Figure 1.3]
- single address space
- uniform memory access (UMA)/ nonuniform memory access (NUMA)
- distributed shared memory system
- Message Passing Multicomputer
- How is this different from a shared memory system?
- What are the advantages/disadvantages of message passing v.s. shared memory
- Network Models
- Be able to compute Diameter, Bisection Width, and
Cost for Linear, Ring, Mesh, Torus, Tree, and
Hypercube networks.
- Understand how to construct a d-dimensional
hypercube recursively.
- Be able to define and compute Diameter,
Connectivity, Bisection Width, Bisection Bandwidth,
and Cost for any arbitrary network.
- Communication Cost Models for Static Interconnection
Networks
Know and Understand:
- communication latency ts+m tw
- startup time, ts
- Per-word transfer time tw
- How does bisection width relate to tw in the above formula?
Performance Models of Parallel Computers
- PRAM Models
- Be able to discuss the PRAM machine architecture and
assumptions (See Handout from Algorithms text)
- Understand Types of PRAM models:
Combining CRCW, Priority CRCW, Arbitrary CRCW,
Common CRCW, CREW, and CREW.
- Understand and be able to apply Brent's theorem.
- Review Homework problem!
- LogP Model
- Be able to define the Model Parameters, L, o, g, and
P.
- Understand the discussion of estimating running time
of the broadcast/summing under the LogP model as discussed in
class and the LogP paper.
- Be able to use the LogP model to compute running
times of programs.
- Review Homework problem!
- BSP Model
- Be able to discuss the BSP execution model. (What is
the BSP execution model and what does it consist of?)
- Be able to use the cost model for BSP [w + hg + l] to
estimate running times of BSP algorithms.
- What is a h-relation? A 1-relation? A p-relation?
- What is a super-step in the BSP model?
- What is parallel slackness?
- What role does the barrier operation play in the BSP model?
- Review Homework problem!
MPI and Message Passing
- Be able to discuss how deadlock is possible when using MPI_Send()
and MPI_Recv() as discussed in class.
- Be able to describe the minimum set of calls required of any MPI program, e.g. MPI_Init, MPI_Finalize, etc.
- Be able to discuss the role of buffering in message passing computations.
- Review Homework problems!
Programming Patterns
- Background Information
- What is a task-dependency graph?
- What is a decomposition of a task-dependency graph?
- What is fine-grained v.s coarse grained?
- What is degree of concurrency?
- What is the critical path of a task graph?
- What is a task-interaction-graph and how does it relate to a task dependency graph?
- What is a process and processor? How are they different?
- Decomposition Techniques
- Be able to define and give examples of:
- recursive decomposition
- data decomposition
- exploratory decomposition
- speculative decomposition
- In data-decomposition, how does data partitioning relate to task partitioning? What is the meaning of partitioning input data or output data? What is the owner computes rule?
- What problems are suitable to exploratory decomposition?
- What problems are suitable for speculative decomposition?
- Static Mapping Techniques
- What are static mapping techniques?
- Be able to define and describe:
- block distributions
- cyclic distributions
- block-cyclic
- random
- How do we partition irregular data structures?
- Dynamic Mapping Techniques
- What is the differences between centralized and distributed schemes? What are the advantages/disadvantages of each?
- What is self scheduling and chunk scheduling?
- Why not use dynamic mapping for all problems?
- Methods for containing overheads
- Be able to discuss, compare, and contrast the following:
- Maximizing Data Locality
- Minimizing Volume of Data-Exchange
- Minimize Frequency of Interactions
- Minimizing Contention and Hot Spots
- Overlapping Computations and communications
- Replicating Data or Computations
- Using optimized collective interaction operations
- Overlapping Interactions with other interactions
- Parallel Programming Patterns
- Be able to discuss, compare, and contrast the following:
- The Data Parallel Model
- The Task Graph Model
- The Work Pool Model
- The Master-Slave Model
- The Pipeline or Producer-Consumer Model
- Hybrid models (be able to give an example of a hybrid approach)
Parallel Collective Communication Primitives
- Be able to define One-to-All, All-to-All, scatter, gather
and All-to-All Personalized communication
primitives. (In other words, be able to reconstruct Figure's
4.1, 4.8, 4.14, and 4.16 from the text)
- One-to-All Broadcast
- Be able to describe and analyze the Hypercube One-to-All
broadcast algorithm.
- Be prepared to answer questions about the algorithms as
listed in program form in Algorithm 4.1, and 4.2 from the text.
- Understand how the One-to-All broadcast algorithm can be
used as a basis for performing the Single-Node
Accumulation.
- Understand and be prepared to discuss the use of relabeling processor ID's using XOR to achieve the generalized algorithm
- All-to-All Broadcast
- Understand the All-to-All broadcast algorithms for the
ring and Hypercube topology. In particular, pay attention to message
sizes at each step of the algorithm.
- Be prepared to answer questions about the algorithms as
listed in program form in Algorithm 4.4, 4.5, and 4.6 from the text.
- Be able to describe how the All-to-All broadcast can be
used to as a basis for performing the Reduction or
Multinode Accumulation operations.
- Know the difference between an All-to-All reduction and an All-Reduce
- One-to-All Personalized Communication
- Understand how this algorithm is a variation of the
One-to-All broadcast except with shrinking message
sizes.
- Be able to derive and analyze this algorithm on a
Hypercube network by generalizing the
One-to-All algorithm results.
- All-to-All Personalized Communications
- Be able to describe and analyze the the Hypercube algorithm.
- Be able to describe and analyze the E-cube routing algorithm.
- Be able to generalize the ts + m tw performance model to take in account network bisection width.
- Know table 4.1 and 4.2 for the algorithms we have covered.
- Be able to derive timing measurements for any of the above algorithms!
- Study the Homework Assignments!
Performance and Scalability of Parallel Systems
- Be able to define Speedup, Efficiency, Cost, Cost Optimal, Problem Size,
Overhead Function, Isoefficiency Function, and sequential fraction.
- Be able to discuss Amdahl's Law in terms of serial fraction.
- Be able to determine if an algorithm is Cost Optimal.
- Be able to determine if an algorithm has an Isoefficiency
function
- Be able to derive the overhead function and use this function
to determine the existence of an Isoefficiency function. Be
able to show that the overhead function is related to the
isoefficiency function when one exists.
- Understand why there is a lower bound on the isoefficiency
function as described in section 5.4.4 from text.
- What is the degree of concurrency of a parallel algorithm?
- What is the relationship between degree of concurrency and the
isoefficiency function? See section 5.4.5 from text.
- What are the sources of parallel overhead? ANS: Communication
time, Load Imbalance, and additional computations performed by
the parallel algorithm.
- What is scaled speedup? How do we measure it?
- What is memory and time-constrained scaled speedup? How do we measure these?
- Study the Homework Assignments!
Parallel Sorting Algorithms
- Fundamentals
- Be able to define internal and external sorting algorithms.
- What is the compare-exchange operation? How is it implemented in
parallel
- What is the compare-split operation? How much time does it
require?
- What is a comparator? How do you draw one? What is a increasing
or decreasing comparator and how do you represent it (draw it).
- What does it mean to enumerate processors with respect to
sorting algorithms? Why is it important for parallel sorting
algorithms?
- Sorting Networks
- Be able to define a bitonic sequence, bitonic split and bitonic
merge.
- What is a bitonic merging network? What does it do? What is its
size? What is its depth? What does it look like?
- What is a bitonic sorting network? What does it do? What is its
size? What is its depth? What does it look like?
- How do we map a bitonic sorting network onto a hypercube? What
does the communication patterns look like?
- Be able to derive execution times for the bitonic sorting algorithm
on a hypercube architecture. Be able to discuss both mapping one
element per processor and many elements per processor.
- Be able to perform a scalability analysis of the bitonic sorting
algorithm on the hypercube architecture.
- Quick-sort
- Why is it difficult to develop an efficient parallel quick-sort
algorithm by simply running independent subproblems in
parallel?
- What steps are involved in the shared memory quick-sort algorithm?
- Why is pivot selection critical for an efficient sorting
algorithm?
- What is the scalability of the quick-sort algorithm?
- Be prepared to estimate parallel running times for shared memory and distributed memory based quick-sort algorithms as discussed in the text.
- Bucket Sort and Sample Sort
- What is the bucket sort? What assumptions does it make? What is
its serial run time?
- What steps are involved in the parallel bucket sort algorithm?
- What is a Sample Sort, how does it differ from Bucket Sort?
- How do we form a sample set in the Sample Sort? How big
is the sample?
- How do we sort the Sample in Sample Sort?
- What is the performance bottleneck for the Sample Sort?
- Be able to estimate running time for these algorithms.
- Be able to perform a scalability analysis of these algorithms.
Dense Matrix Algorithms
- What is one dimensional and two dimensional partitioning?
- Be able to describe and analyze one dimensional partitioned dense matrix vector multiplication for both row and column decomposition.
- Review the concurrency constraints on the isoefficiency function. Be able to derive the isoefficiency bounds due to concurrency for all of the parallel matrix algorithms.
- Be able to describe and analyze the two dimensional partitioned dense matrix vector multiplication algorithm.
- Be able to describe and analyze the simple dense matrix-matrix multiplication algorithm.
- Be able to describe and analyze Cannon's algorithm for dense matrix-matrix multiplication.
- Be able to describe and analyze the DNS algorithm for matrix-matrix multiplication.
- Be able to compare and contrast various algorithms (from the set above) for dense matrix-matrix multiply.
Study Homework Assignments!