Overview

I took this course in the Fall of 2010 with Professor David Bindel.

Description

How do we solve the large-scale problems of science quickly on modern computers? How do we measure the performance of new or existing simulation codes, and what things can we do to make them run faster? How can we best take advantage of features like multicore processors, vector units, and graphics co-processors? These are the types of questions we will address in CS 5220, Applications of Parallel Computers. Topics include:

  • Single-processor architecture, caches, and serial performance tuning

  • Basics of parallel machine organization

  • Distributed memory programming with MPI

  • Shared memory programming with OpenMP

  • Parallel patterns: data partitioning, synchronization, and load balancing

  • Examples of parallel numerical algorithms

  • Applications from science and engineering

Because our examples will be drawn primarily from engineering and scientific computations, we will assume some prior exposure to numerical methods (e.g. at the level of CS 3220). Students should also be able to read and write serial programs written in C. Prior exposure to parallel programming is not required, and non-CS students from fields involving simulation are particularly welcome!

— Course Website

Deliverables

Project 1: Matrix Multiply Optimization

Project Description

For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one node of the crocus cluster (Intel Xeon E5504).

— Course Website

Tuning the Matrix Multiply Algorithm. Saugata Ghose and Jonathan Tse. PDF

Project 2: Smoothed Particle Hydrodynamics

Project Description

The purpose of this assignment is to get a somewhat more serious parallel programming experience. Your goal is to parallelize a toy smoothed particle hydrodynamics simulation. A reference serial implementation is documented here. The current version uses a naive algorithm that does not use spatial locality. Your mission, should you choose to accept it, is to fix that. You should report on your progress in two weeks, including three things:

Improve the complexity by changing from the naive O(n2) algorithm to an O(n) (roughly) force evaluation algorithm based on spatial partitioning. You will not necessarily need to repartition at every step!

Parallelize your code using OpenMP, and spend some time tuning the parallel implementation. You will need to go beyond just adding pragmas before your for loops in order to get reasonable speed!

Do a scaling study to characterize the performance of both your serial and parallel codes as a function of the number of particles in the system and the number of processors. Try to explain what you see. Where are the bottlenecks? What are the parallel overheads? Do you have good load balance? Are you making good use of the memory system? For several of these questions, you may find it useful to try timing your code with HPCToolkit and TAU (both of which are now installed on the cluster).

If you have extra time, play a little! Improve or extend the code in some way that appeals to you, either by doing something clever with the time integrator (e.g. automatic adaptation of the time step), or adding a feature (I deliberately left out the surface tension calculations), or by doing something else. If this project appeals to you, extensions to it could easily be the basis of a class project.

— Course Website

Smoothed Particle Hydronamics Saugata Ghose, Shreesha Srinath, and Jonathan Tse. PDF

Project 3: All Pairs Shortest Path

Project Description

The purpose of this assignment is to get some more experience with MPI programming, and to do some simple comparisons between an OpenMP code and an MPI code. Your goal is to profile a reference implementation of an all-pairs shortest path algorithm using OpenMP, and to convert that algorithm to use MPI. The reference implementation is documented here. For this assignment, you should do the following:

  1. Do a scaling study to characterize the performance of the OpenMP code as a function of the graph size and the number of nodes. Can you explain the performance you see? Note that different graphs will require different numbers of outer iterations; you should probably take this into consideration in your analysis.

  2. Write a naive MPI implementation in which each processor keeps a complete copy of the data and communicates it with neighboring processors using MPI_Allgatherv.

  3. Write a more sophisticated MPI implementation in which you improve the memory scalability by not keeping a full copy of the distance matrix at every processor. Instead, you should rotate the matrices around from one processor to another in some systematic way. I suggest adapting the simple 1D ring matrix multiply from the dense linear algebra lectures, but you are allowed to do something more sophisticated if you wish.

  4. Repeat your scaling study for your two MPI implementations. Again, try to explain the performance in terms of a simplified model. Make sure you do at least a few runs that involve more than one node!

The current OpenMP version is fairly naive (I didn’t attempt any blocking of the matrix-multiply-like kernel), and I’m sure there is room for significant improvement if you feel so inclined. However, I’m really more interested in what you get working in MPI; if you find you have time left over, it is better spent on the final project.

— Course Website

All Pairs Shortest Path Saugata Ghose, Shreesha Srinath, and Jonathan Tse. PDF

Final Project: Accelerating a PARSEC Benchmark Using Portable Subword SIMD

Abstract

We present a case study of the GNU Compiler Collection (GCC) Vector Extensions in GCC 4.7. In particular, we examine the relative performance of explicit vector code using the GCC Vector Extensions to that of automatically vectorized code from the Intel C++ Compiler (ICC). Our analysis focuses on the interactions between data-level and thread-level parallelism in the streamcluster benchmark from the PARSEC benchmark suite, in particular examining tradeoffs between portability and performance across different vectorization techniques.

Accelerating a PARSEC Benchmark Using Portable Subword SIMD Saugata Ghose, Shreesha Srinath, and Jonathan Tse. PDF