Outline of Material for Test #1

Parallel Programming Platforms

Flynn's Taxonomy (SISD,SIMD,MIMD)

Understand discussion in section 2.3.1
Also understand SPMD (Single Program Multiple Data)

Memory Access Architectures

Shared Memory Multiprocessor System [Figure 2.5]

single address space
uniform memory access (UMA)/ nonuniform memory access (NUMA)
distributed shared memory system

Message Passing Multicomputer (Distributed Memory)

Network Models

Be able to discuss the implementation, cost, and expected execution times of the following switching networks

cross bar switch
bus
Omega network

Be prepared to identify and discuss blocking in an omega network (figure 2.13)
Be prepared to draw a small omega network configuration
Review homework!

Be able to discuss static network topologies

Star Connected
Completely Connected
Linear Arrays, Meshes, and k-d Meshes
Hypercube
Tree and Fat Tree Networks

Be able to compute Diameter, Bisection Width, and Cost for Linear, Ring, Mesh, Torus, Tree, and Hypercube networks.
Understand how to construct a d-dimensional hypercube recursively.
Be able to define and compute Diameter, Connectivity, Bisection Width, Bisection Bandwidth, and Cost for any arbitrary network.
Communication Cost Models for Static Interconnection Networks

communication latency
Communication times for

Store and Forward Routing
Cut-Through Routing
Using Simplified Cost model (and rational for simplification)

startup time, ts
Per-word transfer time tw
hop time th

Understand Mapping and its impact on performance

Be able to define and discuss congestion, dilation, and expansion. How does each of these measures of graph mappings impact communication performance.

PRAM Models

Be able to discuss the PRAM machine architecture and assumptions
Understand Types of PRAM models:

Combining CRCW, Priority CRCW, Arbitrary CRCW, Common CRCW, CREW, and CREW.

Review Homework problems!

MPI and Message Passing

Be prepared to describe or discuss the roll or function of the following MPI primitives

MPI_Init()

What Arguments are passed to this routine?
What is the purpose of this call?

MPI_Finalize()
MPI_COMM_WORLD
What are communicators and what role does this identifier play in the creation and use of communicators in MPI?
What calls are related to communicators in the MPI library? What do these calls do?

MPI_Comm_size()
MPI_Comm_rank()
MPI_Comm_dup()
MPI_Comm_split()

MPI_Send()

What is passed to this call and what purpose do these arguments serve?

MPI_Recv()

What arguments are passed to this call? What arguments are different from MPI_Send()? Why are there differences?

MPI_Sendrecv()

What purpose does this call serve?
What are the arguments that this call takes?

MPI non-blocking operations

Be able to describe how to perform non-blocking operations in MPI.
Why do we need non-blocking communication operations?
MPI_Isend()

How is the Isend different from send? What does it do that is different? How are the arguments different?

MPI_Irecv()

How is the Irecv different from the recv? What does it do that is different? How are the arguments different?

MPI_Test()

What does this call do? Why would it be useful?

MPI_Wait()

What does this call do? How is it different from a MPI_Test()?

MPI_Request_free()

What is a MPI_Request? Why would you want to free it?

MPI collective Communications

What are collective communications in MPI?
Can we perform a collective communication of a subset of processors? If so, how do we do that?
Do collective communications use tags? Why or why not?
Know the following Collective Calls: What they do and what arguments they have.

MPI_Barrier()
MPI_Bcast()
MPI_Reduce()
MPI_Allreduce()
MPI_Gather()
MPI_Scatter()
MPI_Allgather()
MPI_Alltoall

Be able to discuss how deadlock is possible when using MPI_Send() and MPI_Recv() as discussed in class.

Be able to discuss buffering in message passing protocols.

Review Homework!

Parallel Collective Communication Primitives

Be able to define One-to-All, All-to-All, scatter, gather and All-to-All Personalized communication primitives. (In other words, be able to reconstruct Figures 4.1, 4.8, 4.14, and 4.16 from the text)
One-to-All Broadcast

Be able to describe and analyze the Hypercube One-to-All broadcast algorithm.
Be prepared to answer questions about the algorithms as listed in program form in Algorithm 4.1, and 4.2 from the text.
Understand how the One-to-All broadcast algorithm can be used as a basis for performing the Single-Node Accumulation.
Understand and be prepared to discuss the use of relabeling processor ID's using XOR to achieve the generalized algorithm

All-to-All Broadcast

Understand the All-to-All broadcast algorithms for the ring and Hypercube topology. In particular, pay attention to message sizes at each step of the algorithm.
Be prepared to answer questions about the algorithms as listed in program form in Algorithm 4.4, 4.5, and 4.6 from handout.
Be able to describe how the All-to-All broadcast can be used to as a basis for performing the Reduction or Multinode Accumulation operations.
Know the difference between an All-to-All reduction and an All-Reduce

One-to-All Personalized Communication

Understand how this algorithm is a variation of the One-to-All broadcast except with shrinking message sizes.
Be able to derive and analyze this algorithm on a Hypercube network by generalizing the One-to-All (scatter,gather) algorithm results.

All-to-All Personalized Communications

Be able to describe and analyze the the Hypercube algorithm.
Be able to describe and analyze the E-cube routing algorithm.
Be able to generalize the ts + m tw performance model to take in account network bisection width.
Be able to describe how the all-to-all personalized algorithm relates to matrix transpose.

Prefix-Sum algorithm

Know what a prefix-sum is and how it is used
Be able to describe the hypercube algorithm
Be able to estimate running time using the ts + m tw communication cost model

Communication optimizations

Be able to determine if an argument is optimal based on bisection bandwidth arguments.
Be able to discuss and analyze the optimizations of communication algorithms as described in section 4.7

Know table 4.1 and 4.2 for the algorithms we have covered.
Be able to derive timing measurements for any of the above algorithms!
Study the Homework Assignments!

Principles of Parallel Algorithm Design

Algorithm Decomposition

Be able to define Tasks and Task Dependency Graphs
Be able to discuss Task Interactions

granularity (fine or coarse grained)
degree of concurrency (what does it measure)

maximum degree of concurrency
average degree of concurrency

What is a Critical Path? Why is it important to the analysis of parallel algorithm performance?
What is a task interaction graph? How is it different from a task dependency graph? What do we use a task interaction graph for?
Processes, Mapping, and Processors: Understand the roll each of these play in the design of parallel algorithms.
Decomposition Techniques

Recursive Decomposition

What is recursive decomposition
Well suited to divide and conquer algorithms (such as quick-sort)

Data Decomposition

Be able to define data decomposition.
Be able to discuss: Mapping data partitions to task partitions, partitioning input data v.s. Output Data, and owner-computes rule.

Exploratory Decomposition

What is Exploratory Decomposition?
When would you apply Exploratory Decomposition?
What types of problems are well suited for exploratory decomposition?

Speculative Decomposition

What is Speculative Decomposition?
How is Speculative Decomposition different from Exploratory Decomposition?
When would you you apply speculative decomposition?

Hybrid Decompositions

What are hybrid decompositions?
Be able to discuss an example of hybrid decompositions used to design a parallel algorithm.

Characteristics of Tasks and Interactions

Be able to define and discuss dynamic task generation. What problems does this present for parallel programs? How do we deal with these problems?
How does the size of data associated with tasks effect our decisions regarding parallel algorithm implementation?
How can we take advantage of static task interactions to improve performance? How is this question related to one-way v.s. two-way (one-sided, v.s. two-sided) communication?
What is the difference between regular and irregular interactions? What is the difference between irregular and dynamic interactions?
What is the impact of read-only access to data v.s. Read-Write?
What do we mean when we say an interaction is two-way or two-sided?
What are some static mapping techniques? (e.g. Array Distribution schemes, block, cyclic, block-cyclic) What are there so many mappings?
What is graph partitioning? When is it important for parallel algorithm design?
What are hierarchical Mappings?
Schemes for Dynamic Mapping

What is the tension between centralized and distributed schemes?
What is self scheduling and chunk-scheduling?

Methods for Containing Interaction Overheads

Be able to define and discuss the eight methods for containing interaction overheads in parallel programs.

Maximizing Data Locality
Minimizing Volume of Data Exchange
Minimize Frequency of Interactions
Minimizing Contention and Hot Spots
Overlapping Computations with Interactions
Replicating Data or Computations
Using Optimized Collective Interaction Operations
Overlapping Interactions with Other Interactions

Parallel Algorithm Models

Be able to define and compare/contrast the different parallel algorithm models ads discussed in section 3.6.

The Data-Parallel model
The Task-Graph model
The Work Pool Model
The Master-Slave Model
The Pipeline or Producer-Consumer model

Review Homework!