Outline of Material for Final

Architectural Models of Parallel Computers

Flynn's Taxonomy (SISD,SIMD,MIMD)

Understand discussion on pages 25-27 in text.
Also understand SPMD (Single Program Multiple Data)

Memory Access Architectures

Shared Memory Multiprocessor System [Figure 1.3]

single address space
uniform memory access (UMA)/ nonuniform memory access (NUMA)
distributed shared memory system

Message Passing Multicomputer

How is this different from a shared memory system?
What are the advantages/disadvantages of message passing v.s. shared memory

Network Models

Be able to compute Diameter, Bisection Width, and Cost for Linear, Ring, Mesh, Torus, Tree, and Hypercube networks.
Understand how to construct a d-dimensional hypercube recursively.
Be able to define and compute Diameter, Connectivity, Bisection Width, Bisection Bandwidth, and Cost for any arbitrary network.
Communication Cost Models for Static Interconnection Networks

communication latency ts+m tw
startup time, ts
Per-word transfer time tw
How does bisection width relate to tw in the above formula?

Performance Models of Parallel Computers

PRAM Models

Be able to discuss the PRAM machine architecture and assumptions (See Handout from Algorithms text)
Understand Types of PRAM models:

Combining CRCW, Priority CRCW, Arbitrary CRCW, Common CRCW, CREW, and CREW.

Understand and be able to apply Brent's theorem.
Review Homework problem!

LogP Model

Be able to define the Model Parameters, L, o, g, and P.
Understand the discussion of estimating running time of the broadcast/summing under the LogP model as discussed in class and the LogP paper.
Be able to use the LogP model to compute running times of programs.
Review Homework problem!

BSP Model

Be able to discuss the BSP execution model. (What is the BSP execution model and what does it consist of?)
Be able to use the cost model for BSP [w + hg + l] to estimate running times of BSP algorithms.
What is a h-relation? A 1-relation? A p-relation?
What is a super-step in the BSP model?
What is parallel slackness?
What role does the barrier operation play in the BSP model?
Review Homework problem!

MPI and Message Passing

Be able to discuss how deadlock is possible when using MPI_Send() and MPI_Recv() as discussed in class.
Be able to describe the minimum set of calls required of any MPI program, e.g. MPI_Init, MPI_Finalize, etc.
Be able to discuss the role of buffering in message passing computations.
Review Homework problems!

Programming Patterns

Background Information

What is a task-dependency graph?
What is a decomposition of a task-dependency graph?
What is fine-grained v.s coarse grained?
What is degree of concurrency?
What is the critical path of a task graph?
What is a task-interaction-graph and how does it relate to a task dependency graph?
What is a process and processor? How are they different?

Decomposition Techniques

Be able to define and give examples of:

recursive decomposition
data decomposition
exploratory decomposition
speculative decomposition

In data-decomposition, how does data partitioning relate to task partitioning? What is the meaning of partitioning input data or output data? What is the owner computes rule?
What problems are suitable to exploratory decomposition?
What problems are suitable for speculative decomposition?

Static Mapping Techniques

What are static mapping techniques?
Be able to define and describe:

block distributions
cyclic distributions
block-cyclic
random

How do we partition irregular data structures?

Dynamic Mapping Techniques

What is the differences between centralized and distributed schemes? What are the advantages/disadvantages of each?
What is self scheduling and chunk scheduling?
Why not use dynamic mapping for all problems?

Methods for containing overheads

Be able to discuss, compare, and contrast the following:

Maximizing Data Locality
Minimizing Volume of Data-Exchange
Minimize Frequency of Interactions
Minimizing Contention and Hot Spots
Overlapping Computations and communications
Replicating Data or Computations
Using optimized collective interaction operations
Overlapping Interactions with other interactions

Parallel Programming Patterns

Be able to discuss, compare, and contrast the following:

The Data Parallel Model
The Task Graph Model
The Work Pool Model
The Master-Slave Model
The Pipeline or Producer-Consumer Model
Hybrid models (be able to give an example of a hybrid approach)

Parallel Collective Communication Primitives

Be able to define One-to-All, All-to-All, scatter, gather and All-to-All Personalized communication primitives. (In other words, be able to reconstruct Figure's 4.1, 4.8, 4.14, and 4.16 from the text)
One-to-All Broadcast

Be able to describe and analyze the Hypercube One-to-All broadcast algorithm.
Be prepared to answer questions about the algorithms as listed in program form in Algorithm 4.1, and 4.2 from the text.
Understand how the One-to-All broadcast algorithm can be used as a basis for performing the Single-Node Accumulation.
Understand and be prepared to discuss the use of relabeling processor ID's using XOR to achieve the generalized algorithm

All-to-All Broadcast

Understand the All-to-All broadcast algorithms for the ring and Hypercube topology. In particular, pay attention to message sizes at each step of the algorithm.
Be prepared to answer questions about the algorithms as listed in program form in Algorithm 4.4, 4.5, and 4.6 from the text.
Be able to describe how the All-to-All broadcast can be used to as a basis for performing the Reduction or Multinode Accumulation operations.
Know the difference between an All-to-All reduction and an All-Reduce

One-to-All Personalized Communication

Understand how this algorithm is a variation of the One-to-All broadcast except with shrinking message sizes.
Be able to derive and analyze this algorithm on a Hypercube network by generalizing the One-to-All algorithm results.

All-to-All Personalized Communications

Be able to describe and analyze the the Hypercube algorithm.
Be able to describe and analyze the E-cube routing algorithm.
Be able to generalize the ts + m tw performance model to take in account network bisection width.

Know table 4.1 and 4.2 for the algorithms we have covered.
Be able to derive timing measurements for any of the above algorithms!
Study the Homework Assignments!

Performance and Scalability of Parallel Systems

Be able to define Speedup, Efficiency, Cost, Cost Optimal, Problem Size, Overhead Function, Isoefficiency Function, and sequential fraction.
Be able to discuss Amdahl's Law in terms of serial fraction.
Be able to determine if an algorithm is Cost Optimal.
Be able to determine if an algorithm has an Isoefficiency function
Be able to derive the overhead function and use this function to determine the existence of an Isoefficiency function. Be able to show that the overhead function is related to the isoefficiency function when one exists.
Understand why there is a lower bound on the isoefficiency function as described in section 5.4.4 from text.
What is the degree of concurrency of a parallel algorithm?
What is the relationship between degree of concurrency and the isoefficiency function? See section 5.4.5 from text.
What are the sources of parallel overhead? ANS: Communication time, Load Imbalance, and additional computations performed by the parallel algorithm.
What is scaled speedup? How do we measure it?
What is memory and time-constrained scaled speedup? How do we measure these?
Study the Homework Assignments!

Parallel Sorting Algorithms

Fundamentals

Be able to define internal and external sorting algorithms.
What is the compare-exchange operation? How is it implemented in parallel
What is the compare-split operation? How much time does it require?
What is a comparator? How do you draw one? What is a increasing or decreasing comparator and how do you represent it (draw it).
What does it mean to enumerate processors with respect to sorting algorithms? Why is it important for parallel sorting algorithms?

Sorting Networks

Be able to define a bitonic sequence, bitonic split and bitonic merge.
What is a bitonic merging network? What does it do? What is its size? What is its depth? What does it look like?
What is a bitonic sorting network? What does it do? What is its size? What is its depth? What does it look like?
How do we map a bitonic sorting network onto a hypercube? What does the communication patterns look like?
Be able to derive execution times for the bitonic sorting algorithm on a hypercube architecture. Be able to discuss both mapping one element per processor and many elements per processor.
Be able to perform a scalability analysis of the bitonic sorting algorithm on the hypercube architecture.

Quick-sort

Why is it difficult to develop an efficient parallel quick-sort algorithm by simply running independent subproblems in parallel?
What steps are involved in the shared memory quick-sort algorithm?
Why is pivot selection critical for an efficient sorting algorithm?
What is the scalability of the quick-sort algorithm?
Be prepared to estimate parallel running times for shared memory and distributed memory based quick-sort algorithms as discussed in the text.

Bucket Sort and Sample Sort

What is the bucket sort? What assumptions does it make? What is its serial run time?
What steps are involved in the parallel bucket sort algorithm?
What is a Sample Sort, how does it differ from Bucket Sort?
How do we form a sample set in the Sample Sort? How big is the sample?
How do we sort the Sample in Sample Sort?
What is the performance bottleneck for the Sample Sort?
Be able to estimate running time for these algorithms.
Be able to perform a scalability analysis of these algorithms.

Dense Matrix Algorithms

What is one dimensional and two dimensional partitioning?
Be able to describe and analyze one dimensional partitioned dense matrix vector multiplication for both row and column decomposition.
Review the concurrency constraints on the isoefficiency function. Be able to derive the isoefficiency bounds due to concurrency for all of the parallel matrix algorithms.
Be able to describe and analyze the two dimensional partitioned dense matrix vector multiplication algorithm.
Be able to describe and analyze the simple dense matrix-matrix multiplication algorithm.
Be able to describe and analyze Cannon's algorithm for dense matrix-matrix multiplication.
Be able to describe and analyze the DNS algorithm for matrix-matrix multiplication.
Be able to compare and contrast various algorithms (from the set above) for dense matrix-matrix multiply.

Outline of Material for Final

Architectural Models of Parallel Computers

Performance Models of Parallel Computers

MPI and Message Passing

Programming Patterns

Parallel Collective Communication Primitives

Performance and Scalability of Parallel Systems

Parallel Sorting Algorithms

Dense Matrix Algorithms

Study Homework Assignments!