Programming of Parallel Computers/Lecture 10

From PBDN

Jump to: navigation, search

This content is a student-produced work-in-progress and may be incomplete or contain errors.

You can help by adding new material, links, or by correcting errors.

A wiki is a collaborative resource. The idea here is that a valuable communal resource can be created if many of us contribute to it, not that one student should create a resource for your benefit. If you don't think that you can make a direct positive contribution right now, that's OK, but at least register an account to signify that you are willing to help.

Contents

OpenMP

Continued from Lecture 9


Slides

OpenMP presentation slides are now online (PDF 474kB).


OpenMP

Overview

  • Variables are common to all threads unless explicitly marked by private() in the #pragma omp directive.
  • The program runs in a single thread unless otherwise specified via an an environment variable.
  • OpenMP needs to have global address space (e.g. NUMA or multiprocessor node) — e.g. no use on uniprocessor nodes in a COW/NOW.
  • Portable between parallel and serial platforms — directives are comments in Fortran and #pragmas in C/C++, which are ignored by non-OpenMP compilers.
  • Can specify number of threads in the OpenMP directive

Directives

Data Sharing

  • shared(…) — all by default (except loop iteration variable, which is implicitly thread-private)
  • private(…) — Private variables are stack allocated, uninitialized on entry, and removed on exit (from parallel section).
  • firstprivate — used to initialize a private variable from the master variable of the same name in the first iteration.
  • lastprivate — used to save the value of a private variable in the last iteration to the master variable of the same name.


Work Sharing

Loop counter (often i) is always private. Its range is divided into np chunks, where np is the number of threads. Threads are synchronised at the end of the parallel section.

  • reduction — reduction operations (like in MPI) including +, -, *, etc.)
  • schedule(type[,size]):
    • type=static — assign chunks cyclically to threads (good cache behaviour, but potentially poor load-balancing)
    • type=dynamic — assign chunks when ready (good load-balancing, but poor cache behaviour)
    • type=guided — like dynamic except that the chunk size is decreased (toward finer granularity) toward the end of execution (good load balancing, but poor cache behaviour).
    • type=runtime — use an environment variable to decide which schedule type to use.
  • ordered — only 1 thread is allowed into the block at a time sequentially in loop order; if we use the default schedule we get serial performance in an ordered section.

Can also use explicit scheduling according to your own function.

Task Parallelism

!$omp sections
    !$omp section
        !task 1
    !$omp end section

    !$omp section
        !task 2
    !$omp end section

    !$omp section
        !task 3
    !$omp end section

    ! ...
!$omp end sections

No load balancing. Can rewrite sections as a for-loop with dynamic scheduling to get load balancing.


Nested Parallelism

Performance Limits

Continued in Lecture 11.


Back to Programming of Parallel Computers main page.