Skip to main content
Engineering LibreTexts

5.2: Message-Passing Environments

  • Page ID
    13443
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Introduction

    A message-passing interface is a set of function and subroutine calls for C or FORTRAN that give you a way to split an application for parallel execution. Data is divided and passed out to other processors as messages. The receiving processors unpack them, do some work, and send the results back or pass them along to other processors in the parallel computer.

    In some ways, message passing is the “assembly language” of parallel processing. You get ultimate responsibility, and if you are talented (and your problem cooperates), you get ultimate performance. If you have a nice scalable problem and are not satisfied with the resulting performance, you pretty much have yourself to blame. The compiler is completely unaware of the parallel aspects of the program.

    The two most popular message-passing environments are parallel virtual machine (PVM) and message-passing interface (MPI). Most of the important features are available in either environment. Once you have mastered message passing, moving from PVM to MPI won’t cause you much trouble. You may also operate on a system that provides only a vendor-specific message-passing interface. However, once you understand message passing concepts and have properly decomposed your application, usually it’s not that much more effort to move from one message-passing library to another.1

    Parallel Virtual Machine

    The idea behind PVM is to assemble a diverse set of network-connected resources into a “virtual machine.” A user could marshal the resources of 35 idle workstations on the Internet and have their own personal scalable processing system. The work on PVM started in the early 1990s at Oak Ridge National Labs. PVM was pretty much an instant success among computer scientists. It provided a rough framework in which to experiment with using a network of workstations as a parallel processor.

    In PVM Version 3, your virtual machine can consist of single processors, shared-memory multiprocessors, and scalable multiprocessors. PVM attempts to knit all of these resources into a single, consistent, execution environment.

    To run PVM, you simply need a login account on a set of network computers that have the PVM software installed. You can even install it in your home directory. To create your own personal virtual machine, you would create a list of these computers in a file:

          % cat hostfile
          frodo.egr.msu.edu
          gollum.egr.msu.edu
          mordor.egr.msu.edu
          %

    After some nontrivial machinations with paths and environment variables, you can start the PVM console:

          % pvm hostfile
          pvmd already running.
          pvm> conf
          1 host, 1 data format
                              HOST     DTID     ARCH   SPEED
                            frodo     40000 SUN4SOL2    1000
                            gollum    40001 SUN4SOL2    1000
                            mordor    40002 SUN4SOL2    1000
          pvm> ps
                             HOST     TID   FLAG 0x COMMAND
                              frodo    40042     6/c,f pvmgs
          pvm> reset
          pvm> ps
                             HOST     TID   FLAG 0x COMMAND
          pvm>

    Many different users can be running virtual machines using the same pool of resources. Each user has their own view of an empty machine. The only way you might detect other virtual machines using your resources is in the percentage of the time your applications get the CPU.

    There is a wide range of commands you can issue at the PVM console. The ps command shows the running processes in your virtual machine. It’s quite possible to have more processes than computer systems. Each process is time-shared on a system along with all the other load on the system. The reset command performs a soft reboot on your virtual machine. You are the virtual system administrator of the virtual machine you have assembled.

    To execute programs on your virtual computer, you must compile and link your programs with the PVM library routines:2

          % aimk mast slav
          making in SUN4SOL2/ for SUN4SOL2
          cc -O -I/opt/pvm3/include -DSYSVBFUNC -DSYSVSTR -DNOGETDTBLSIZ
                -DSYSVSIGNAL -DNOWAIT3 -DNOUNIXDOM -o mast
                ../mast.c -L/opt/pvm3/lib/SUN4SOL2 -lpvm3 -lnsl -lsocket
          mv mast ˜crs/pvm3/bin/SUN4SOL2
          cc -O -I/opt/pvm3/include -DSYSVBFUNC -DSYSVSTR -DNOGETDTBLSIZ
                -DSYSVSIGNAL -DNOWAIT3 -DNOUNIXDOM -o slav
                ../slav.c -L/opt/pvm3/lib/SUN4SOL2 -lpvm3 -lnsl -lsocket
          mv slav ˜crs/pvm3/bin/SUN4SOL2
          %

    When the first PVM call is encountered, the application contacts your virtual machine and enrolls itself in the virtual machine. At that point it should show up in the output of the ps command issued at the PVM console.

    From that point on, your application issues PVM calls to create more processes and interact with those processes. PVM takes the responsibility for distributing the processes on the different systems in the virtual machine, based on the load and your assessment of each system’s relative performance. Messages are moved across the network using user datagram protocol (UDP) and delivered to the appropriate process.

    Typically, the PVM application starts up some additional PVM processes. These can be additional copies of the same program or each PVM process can run a different PVM application. Then the work is distributed among the processes, and results are gathered as necessary.

    There are several basic models of computing that are typically used when working with PVM:

    • Master/Slave: When operating in this mode, one process (usually the initial process) is designated as the master that spawns some number of worker processes. Work units are sent to each worker process, and the results are returned to the master. Often the master maintains a queue of work to be done and as a slave finishes, the master delivers a new work item to the slave. This approach works well when there is little data interaction and each work unit is independent. This approach has the advantage that the overall problem is naturally load-balanced even when there is some variation in the execution time of individual processes.
    • Broadcast/Gather: This type of application is typically characterized by the fact that the shared data structure is relatively small and can be easily copied into every processor’s node. At the beginning of the time step, all the global data structures are broadcast from the master process to all of the processes. Each process then operates on their portion of the data. Each process produces a partial result that is sent back and gathered by the master process. This pattern is repeated for each time step.
    • SPMD/Data decomposition: When the overall data structure is too large to have a copy stored in every process, it must be decomposed across multiple processes. Generally, at the beginning of a time step, all processes must exchange some data with each of their neighboring processes. Then with their local data augmented by the necessary subset of the remote data, they perform their computations. At the end of the time step, necessary data is again exchanged between neighboring processes, and the process is restarted.

    The most complicated applications have nonuniform data flows and data that migrates around the system as the application changes and the load changes on the system.

    In this section, we have two example programs: one is a master-slave operation, and the other is a data decomposition-style solution to the heat flow problem.

    Queue of Tasks

    In this example, one process (mast) creates five slave processes (slav) and doles out 20 work units (add one to a number). As a slave process responds, it’s given new work or told that all of the work units have been exhausted:

          % cat mast.c
          #include <stdio.h>
          #include "pvm3.h"
    
          #define MAXPROC 5
          #define JOBS 20
    
          main()
          {
            int mytid,info;
            int tids[MAXPROC];
            int tid,input,output,answers,work;
    
            mytid = pvm_mytid();
            info=pvm_spawn("slav", (char**)0, 0, "", MAXPROC, tids);
    
          /* Send out the first work */
            for(work=0;work<MAXPROC;work++) {
              pvm_initsend(PvmDataDefault);
              pvm_pkint(&work, 1, 1 ) ;
              pvm_send(tids[work],1) ;/* 1 = msgtype */
            }
    
          /* Send out the rest of the work requests */
            work = MAXPROC;
            for(answers=0; answers < JOBS ; answers++) {
              pvm_recv( -1, 2 ); /* -1 = any task 2 = msgtype */
              pvm_upkint( &tid, 1, 1 );
              pvm_upkint( &input, 1, 1 );
              pvm_upkint( &output, 1, 1 );
              printf("Thanks to %d 2*%d=%d\n",tid,input,output);
              pvm_initsend(PvmDataDefault);
              if ( work < JOBS ) {
                 pvm_pkint(&work, 1, 1 ) ;
                 work++;
              } else {
                input = -1;
                pvm_pkint(&input, 1, 1 ) ; /* Tell them to stop */
              }
              pvm_send(tid,1) ;
            }
    
            pvm_exit();
          }
          %

    One of the interesting aspects of the PVM interface is the separation of calls to prepare a new message, pack data into the message, and send the message. This is done for several reasons. PVM has the capability to convert between different floating-point formats, byte orderings, and character formats. This also allows a single message to have multiple data items with different types.

    The purpose of the message type in each PVM send or receive is to allow the sender to wait for a particular type of message. In this example, we use two message types. Type one is a message from the master to the slave, and type two is the response.

    When performing a receive, a process can either wait for a message from a specific process or a message from any process.

    In the second phase of the computation, the master waits for a response from any slave, prints the response, and then doles out another work unit to the slave or tells the slave to terminate by sending a message with a value of -1.

    The slave code is quite simple — it waits for a message, unpacks it, checks to see if it is a termination message, returns a response, and repeats:

          % cat slav.c
          #include <stdio.h>
          #include "pvm3.h"
    
          /* A simple program to double integers */
          main()
          {
            int mytid;
            int input,output;
            mytid = pvm_mytid();
    
            while(1) {
              pvm_recv( -1, 1 ); /* -1 = any task 1=msgtype */
              pvm_upkint(&input, 1, 1);
              if ( input == -1 ) break; /* All done */
    
              output = input * 2;
              pvm_initsend( PvmDataDefault );
              pvm_pkint( &mytid, 1, 1 );
              pvm_pkint( &input, 1, 1 );
              pvm_pkint( &output, 1, 1 );
              pvm_send( pvm_parent(), 2 );
            }
            pvm_exit();
          }
          %

    When the master program is executed, it produces the following output:

          % pheat
          Thanks to 262204 2*0=0
          Thanks to 262205 2*1=2
          Thanks to 262206 2*2=4
          Thanks to 262207 2*3=6
          Thanks to 262204 2*5=10
          Thanks to 262205 2*6=12
          Thanks to 262206 2*7=14
          Thanks to 262207 2*8=16
          Thanks to 262204 2*9=18
          Thanks to 262205 2*10=20
          Thanks to 262206 2*11=22
          Thanks to 262207 2*12=24
          Thanks to 262205 2*14=28
          Thanks to 262207 2*16=32
          Thanks to 262205 2*17=34
          Thanks to 262207 2*18=36
          Thanks to 262204 2*13=26
          Thanks to 262205 2*19=38
          Thanks to 262206 2*15=30
          Thanks to 262208 2*4=8
          %

    Clearly the processes are operating in parallel, and the order of execution is somewhat random. This code is an excellent skeleton for handling a wide range of computations. In the next example, we perform an SPMD-style computation to solve the heat flow problem using PVM.

    Heat Flow in PVM

    This next example is a rather complicated application that implements the heat flow problem in PVM. In many ways, it gives some insight into the work that is performed by the HPF environment. We will solve a heat flow in a two-dimensional plate with four heat sources and the edges in zero-degree water, as shown in [Figure 1].

    Figure 1: A two-dimensional plate with four constant heat sources
    Figure \(\PageIndex{1}\): A two-dimensional plate with four constant heat sources

    The data will be spread across all of the processes using a (*, BLOCK) distribution. Columns are distributed to processes in contiguous blocks, and all the row elements in a column are stored on the same process. As with HPF, the process that “owns” a data cell performs the computations for that cell after retrieving any data necessary to perform the computation.

    We use a red-black approach but for simplicity, we copy the data back at the end of each iteration. For a true red-black, you would perform the computation in the opposite direction every other time step.

    Note that instead of spawning slave process, the parent process spawns additional copies of itself. This is typical of SPMD-style programs. Once the additional processes have been spawned, all the processes wait at a barrier before they look for the process numbers of the members of the group. Once the processes have arrived at the barrier, they all retrieve a list of the different process numbers:

          % cat pheat.f
                PROGRAM PHEAT
                INCLUDE ’../include/fpvm3.h’
                INTEGER NPROC,ROWS,COLS,TOTCOLS,OFFSET
                PARAMETER(NPROC=4,MAXTIME=200)
                PARAMETER(ROWS=200,TOTCOLS=200)
                PARAMETER(COLS=(TOTCOLS/NPROC)+3)
                REAL*8 RED(0:ROWS+1,0:COLS+1), BLACK(0:ROWS+1,0:COLS+1)
                LOGICAL IAMFIRST,IAMLAST
                INTEGER INUM,INFO,TIDS(0:NPROC-1),IERR
                INTEGER I,R,C
                INTEGER TICK,MAXTIME
                CHARACTER*30 FNAME
    
          *     Get the SPMD thing going - Join the pheat group
                CALL PVMFJOINGROUP(’pheat’, INUM)
    
          * If we are the first in the pheat group, make some helpers
                IF ( INUM.EQ.0 ) THEN
                  DO I=1,NPROC-1
                    CALL PVMFSPAWN(’pheat’, 0, ’anywhere’, 1, TIDS(I), IERR)
                  ENDDO
                ENDIF
    
          *     Barrier to make sure we are all here so we can look them up
                CALL PVMFBARRIER( ’pheat’, NPROC, INFO )
    
          * Find my pals and get their TIDs - TIDS are necessary for sending
                DO I=0,NPROC-1
                  CALL PVMFGETTID(’pheat’, I, TIDS(I))
                ENDDO

    At this point in the code, we have NPROC processes executing in an SPMD mode. The next step is to determine which subset of the array each process will compute. This is driven by the INUM variable, which ranges from 0 to 3 and uniquely identifies these processes.

    We decompose the data and store only one quarter of the data on each process. Using the INUM variable, we choose our continuous set of columns to store and compute. The OFFSET variable maps between a “global” column in the entire array and a local column in our local subset of the array. [Figure 2] shows a map that indicates which processors store which data elements. The values marked with a B are boundary values and won’t change during the simulation. They are all set to 0. This code is often rather tricky to figure out. Performing a (BLOCK, BLOCK) distribution requires a two-dimensional decomposition and exchanging data with the neighbors above and below, in addition to the neighbors to the left and right:

    Figure 2: Assigning grid elements to processors
    Figure \(\PageIndex{2}\): Assigning grid elements to processors
          * Compute my geometry - What subset do I process? (INUM=0 values)
          * Actual Column = OFFSET + Column (OFFSET = 0)
          *     Column 0 = neighbors from left
          *     Column 1 = send to left
          *     Columns 1..mylen My cells to compute
          *     Column mylen = Send to right (mylen=50)
          *     Column mylen+1 = Neighbors from Right (Column 51)
    
                IAMFIRST = (INUM .EQ. 0)
                IAMLAST = (INUM .EQ. NPROC-1)
                OFFSET = (ROWS/NPROC * INUM )
                MYLEN = ROWS/NPROC
                IF ( IAMLAST ) MYLEN = TOTCOLS - OFFSET
                PRINT *,’INUM:’,INUM,’ Local’,1,MYLEN,
               +                  ’ Global’,OFFSET+1,OFFSET+MYLEN
    
          * Start Cold
                DO C=0,COLS+1
                  DO R=0,ROWS+1
                    BLACK(R,C) = 0.0
                  ENDDO
                ENDDO

    Now we run the time steps. The first act in each time step is to reset the heat sources. In this simulation, we have four heat sources placed near the middle of the plate. We must restore all the values each time through the simulation as they are modified in the main loop:

          * Begin running the time steps
                DO TICK=1,MAXTIME
    
          * Set the heat persistent sources
                  CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
               +     ROWS/3,TOTCOLS/3,10.0,INUM)
                  CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
               +     2*ROWS/3,TOTCOLS/3,20.0,INUM)
                  CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
               +     ROWS/3,2*TOTCOLS/3,-20.0,INUM)
                  CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
               +     2*ROWS/3,2*TOTCOLS/3,20.0,INUM)

    Now we perform the exchange of the “ghost values” with our neighboring processes. For example, Process 0 contains the elements for global column 50. To compute the next time step values for column 50, we need column 51, which is stored in Process 1. Similarly, before Process 1 can compute the new values for column 51, it needs Process 0’s values for column 50.

    [Figure 3] shows how the data is transferred between processors. Each process sends its leftmost column to the left and its rightmost column to the right. Because the first and last processes border unchanging boundary values on the left and right respectively, this is not necessary for columns one and 200. If all is done properly, each process can receive its ghost values from their left and right neighbors.

    Figure 3: Pattern of communication for ghost values
    Figure \(\PageIndex{3}\): Pattern of communication for ghost values

    The net result of all of the transfers is that for each space that must be computed, it’s surrounded by one layer of either boundary values or ghost values from the right or left neighbors:

          * Send left and right
                  IF ( .NOT. IAMFIRST ) THEN
                    CALL PVMFINITSEND(PVMDEFAULT,TRUE)
                    CALL PVMFPACK( REAL8, BLACK(1,1), ROWS, 1, INFO )
                    CALL PVMFSEND( TIDS(INUM-1), 1, INFO )
                  ENDIF
                  IF ( .NOT. IAMLAST ) THEN
                    CALL PVMFINITSEND(PVMDEFAULT,TRUE)
                    CALL PVMFPACK( REAL8, BLACK(1,MYLEN), ROWS, 1, INFO )
                    CALL PVMFSEND( TIDS(INUM+1), 2, INFO )
                  ENDIF
          * Receive right, then left
                  IF ( .NOT. IAMLAST ) THEN
                     CALL PVMFRECV( TIDS(INUM+1), 1, BUFID )
                     CALL PVMFUNPACK ( REAL8, BLACK(1,MYLEN+1), ROWS, 1, INFO
                  ENDIF
                  IF ( .NOT. IAMFIRST ) THEN
                    CALL PVMFRECV( TIDS(INUM-1), 2, BUFID )
                    CALL PVMFUNPACK ( REAL8, BLACK(1,0), ROWS, 1, INFO)
                  ENDIF

    This next segment is the easy part. All the appropriate ghost values are in place, so we must simply perform the computation in our subspace. At the end, we copy back from the RED to the BLACK array; in a real simulation, we would perform two time steps, one from BLACK to RED and the other from RED to BLACK, to save this extra copy:

          * Perform the flow
                  DO C=1,MYLEN
                    DO R=1,ROWS
                      RED(R,C) = ( BLACK(R,C) +
               +                   BLACK(R,C-1) + BLACK(R-1,C) +
               +                   BLACK(R+1,C) + BLACK(R,C+1) ) / 5.0
                    ENDDO
                  ENDDO
    
          * Copy back - Normally we would do a red and black version of the loop
                  DO C=1,MYLEN
                    DO R=1,ROWS
                      BLACK(R,C) = RED(R,C)
                    ENDDO
                  ENDDO
                ENDDO

    Now we find the center cell and send to the master process (if necessary) so it can be printed out. We also dump out the data into files for debugging or later visualization of the results. Each file is made unique by appending the instance number to the filename. Then the program terminates:

                CALL SENDCELL(RED,ROWS,COLS,OFFSET,MYLEN,INUM,TIDS(0),
               +       ROWS/2,TOTCOLS/2)
    
          * Dump out data for verification
                IF ( ROWS .LE. 20 ) THEN
                  FNAME = ’/tmp/pheatout.’ // CHAR(ICHAR(’0’)+INUM)
                  OPEN(UNIT=9,NAME=FNAME,FORM=’formatted’)
                  DO C=1,MYLEN
                    WRITE(9,100)(BLACK(R,C),R=1,ROWS)
          100        FORMAT(20F12.6)
                  ENDDO
                 CLOSE(UNIT=9)
               ENDIF
          * Lets all go together
                CALL PVMFBARRIER( ’pheat’, NPROC, INFO )
                CALL PVMFEXIT( INFO )
    
                END

    The SENDCELL routine finds a particular cell and prints it out on the master process. This routine is called in an SPMD style: all the processes enter this routine although all not at precisely the same time. Depending on the INUM and the cell that we are looking for, each process may do something different.

    If the cell in question is in the master process, and we are the master process, print it out. All other processes do nothing. If the cell in question is stored in another process, the process with the cell sends it to the master processes. The master process receives the value and prints it out. All the other processes do nothing.

    This is a simple example of the typical style of SPMD code. All the processes execute the code at roughly the same time, but, based on information local to each process, the actions performed by different processes may be quite different:

    SUBROUTINE SENDCELL(RED,ROWS,COLS,OFFSET,MYLEN,INUM,PTID,R,C)
                INCLUDE ’../include/fpvm3.h’
                INTEGER ROWS,COLS,OFFSET,MYLEN,INUM,PTID,R,C
                REAL*8 RED(0:ROWS+1,0:COLS+1)
                REAL*8 CENTER
    
          * Compute local row number to determine if it is ours
                I = C - OFFSET
                IF ( I .GE. 1 .AND. I.LE. MYLEN ) THEN
                  IF ( INUM .EQ. 0 ) THEN
                    PRINT *,’Master has’, RED(R,I), R, C, I
                  ELSE
                    CALL PVMFINITSEND(PVMDEFAULT,TRUE)
                    CALL PVMFPACK( REAL8, RED(R,I), 1, 1, INFO )
                    PRINT *, ’INUM:’,INUM,’ Returning’,R,C,RED(R,I),I
                    CALL PVMFSEND( PTID, 3, INFO )
                  ENDIF
                ELSE
                  IF ( INUM .EQ. 0 ) THEN
                    CALL PVMFRECV( -1 , 3, BUFID )
                    CALL PVMFUNPACK ( REAL8, CENTER, 1, 1, INFO)
                    PRINT *, ’Master Received’,R,C,CENTER
                  ENDIF
                ENDIF
                RETURN
                END

    Like the previous routine, the STORE routine is executed on all processes. The idea is to store a value into a global row and column position. First, we must determine if the cell is even in our process. If the cell is in our process, we must compute the local column (I) in our subset of the overall matrix and then store the value:

          SUBROUTINE STORE(RED,ROWS,COLS,OFFSET,MYLEN,R,C,VALUE,INUM)
          REAL*8 RED(0:ROWS+1,0:COLS+1)
          REAL VALUE
          INTEGER ROWS,COLS,OFFSET,MYLEN,R,C,I,INUM
          I = C - OFFSET
          IF ( I .LT. 1 .OR. I .GT. MYLEN ) RETURN
          RED(R,I) = VALUE
          RETURN
          END

    When this program executes, it has the following output:

          % pheat
           INUM: 0 Local 1 50 Global 1 50
           Master Received 100 100 3.4722390023541D-07
          %

    We see two lines of print. The first line indicates the values that Process 0 used in its geometry computation. The second line is the output from the master process of the temperature at cell (100,100) after 200 time steps.

    One interesting technique that is useful for debugging this type of program is to change the number of processes that are created. If the program is not quite moving its data properly, you usually get different results when different numbers of processes are used. If you look closely, the above code performs correctly with one process or 30 processes.

    Notice that there is no barrier operation at the end of each time step. This is in contrast to the way parallel loops operate on shared uniform memory multiprocessors that force a barrier at the end of each loop. Because we have used an “owner computes” rule, and nothing is computed until all the required ghost data is received, there is no need for a barrier. The receipt of the messages with the proper ghost values allows a process to begin computing immediately without regard to what the other processes are currently doing.

    This example can be used either as a framework for developing other grid-based computations, or as a good excuse to use HPF and appreciate the hard work that the HPF compiler developers have done. A well-done HPF implementation of this simulation should outperform the PVM implementation because HPF can make tighter optimizations. Unlike us, the HPF compiler doesn’t have to keep its generated code readable.

    PVM Summary

    PVM is a widely used tool because it affords portability across every architecture other than SIMD. Once the effort has been invested in making a code message passing, it tends to run well on many architectures.

    The primary complaints about PVM include:

    • The need for a pack step separate from the send step
    • The fact that it is designed to work in a heterogeneous environment that may incur some overhead
    • It doesn’t automate common tasks such as geometry computations

    But all in all, for a certain set of programmers, PVM is the tool to use. If you would like to learn more about PVM see PVM — A User’s Guide and Tutorial for Networked Parallel Computing, by Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam (MIT Press). Information is also available at www.netlib.org/pvm3/.

    Message-Passing Interface

    The Message-Passing Interface (MPI) was designed to be an industrial-strength message-passing environment that is portable across a wide range of hardware environments.

    Much like High Performance FORTRAN, MPI was developed by a group of computer vendors, application developers, and computer scientists. The idea was to come up with a specification that would take the strengths of many of the existing proprietary message passing environments on a wide variety of architectures and come up with a specification that could be implemented on architectures ranging from SIMD systems with thousands of small processors to MIMD networks of workstations and everything in between.

    Interestingly, the MPI effort was completed a year after the High Performance FORTRAN (HPF) effort was completed. Some viewed MPI as a portable message-passing interface that could support a good HPF compiler. Having MPI makes the compiler more portable. Also having the compiler use MPI as its message-passing environment insures that MPI is heavily tested and that sufficient resources are invested into the MPI implementation.

    PVM Versus MPI

    While many of the folks involved in PVM participated in the MPI effort, MPI is not simply a follow-on to PVM. PVM was developed in a university/research lab environment and evolved over time as new features were needed. For example, the group capability was not designed into PVM at a fundamental level. Some of the underlying assumptions of PVM were based “on a network of workstations connected via Ethernet” model and didn’t export well to scalable computers.3 In some ways, MPI is more robust than PVM, and in other ways, MPI is simpler than PVM. MPI doesn’t specify the system management details as in PVM; MPI doesn’t specify how a virtual machine is to be created, operated, and used.

    MPI Features

    MPI has a number of useful features beyond the basic send and receive capabilities. These include:

    • Communicators: A communicator is a subset of the active processes that can be treated as a group for collective operations such as broadcast, reduction, barriers, sending, or receiving. Within each communicator, a process has a rank that ranges from zero to the size of the group. A process may be a member of more than one communicator and have a different rank within each communicator. There is a default communicator that refers to all the MPI processes that is called MPI_COMM_WORLD.
    • Topologies: A communicator can have a topology associated with it. This arranges the processes that belong to a communicator into some layout. The most common layout is a Cartesian decomposition. For example, 12 processes may be arranged into a 3×4 grid.4 Once these topologies are defined, they can be queried to find the neighboring processes in the topology. In addition to the Cartesian (grid) topology, MPI also supports a graph-based topology.
    • Communication modes: MPI supports multiple styles of communication, including blocking and non- blocking. Users can also choose to use explicit buffers for sending or allow MPI to manage the buffers. The nonblocking capabilities allow the overlap of communication and computation. MPI can support a model in which there is no available memory space for buffers and the data must be copied directly from the address space of the sending process to the memory space of the receiving process. MPI also supports a single call to perform a send and receive that is quite useful when processes need to exchange data.
    • Single-call collective operations: Some of the calls in MPI automate collective operations in a single call. For example, the broadcast operation sends values from the master to the slaves and receives the values on the slaves in the same operation. The net result is that the values are updated on all processes. Similarly, there is a single call to sum a value across all of the processes to a single value. By bundling all this functionality into a single call, systems that have support for collective operations in hardware can make best use of this hardware. Also, when MPI is operating on a shared-memory environment, the broadcast can be simplified as all the slaves simply make a local copy of a shared variable.

    Clearly, the developers of the MPI specification had significant experience with developing message-passing applications and added many widely used features to the message-passing library. Without these features, each programmer needed to use more primitive operations to construct their own versions of the higher-level operations.

    Heat Flow in MPI

    In this example, we implement our heat flow problem in MPI using a similar decomposition to the PVM example. There are several ways to approach the problem. We could almost translate PVM calls to corresponding MPI calls using the MPI_COMM_WORLD communicator. However, to showcase some of the MPI features, we create a Cartesian communicator:

                PROGRAM MHEATC
                INCLUDE ’mpif.h’
                INCLUDE ’mpef.h’
                INTEGER ROWS,COLS,TOTCOLS
                PARAMETER(MAXTIME=200)
          * This simulation can be run on MINPROC or greater processes.
          * It is OK to set MINPROC to 1 for testing purposes
          * For a large number of rows and columns, it is best to set MINPROC
          * to the actual number of runtime processes
                PARAMETER(MINPROC=2) PARAMETER(ROWS=200,TOTCOLS=200,COLS=TOTCOLS/MINPROC)
                DOUBLE PRECISION RED(0:ROWS+1,0:COLS+1),BLACK(0:ROWS+1,0:COLS+1)
                INTEGER S,E,MYLEN,R,C
                INTEGER TICK,MAXTIME
                CHARACTER*30 FNAME

    The basic data structures are much the same as in the PVM example. We allocate a subset of the heat arrays in each process. In this example, the amount of space allocated in each process is set by the compile-time variable MINPROC. The simulation can execute on more than MINPROC processes (wasting some space in each process), but it can’t execute on less than MINPROC processes, or there won’t be sufficient total space across all of the processes to hold the array:

          INTEGER COMM1D,INUM,NPROC,IERR
          INTEGER DIMS(1),COORDS(1)
          LOGICAL PERIODS(1)
          LOGICAL REORDER
          INTEGER NDIM
          INTEGER STATUS(MPI_STATUS_SIZE)
          INTEGER RIGHTPROC, LEFTPROC

    These data structures are used for our interaction with MPI. As we will be doing a one-dimensional Cartesian decomposition, our arrays are dimensioned to one. If you were to do a two-dimensional decomposition, these arrays would need two elements:

          PRINT *,’Calling MPI_INIT’
          CALL MPI_INIT( IERR )
          PRINT *,’Back from MPI_INIT’
          CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, IERR )

    The call to MPI_INIT creates the appropriate number of processes. Note that in the output, the PRINT statement before the call only appears once, but the second PRINT appears once for each process. We call MPI_COMM_SIZE to determine the size of the global communicator MPI_COMM_WORLD. We use this value to set up our Cartesian topology:

          * Create new communicator that has a Cartesian topology associated
          * with it - MPI_CART_CREATE returns COMM1D - A communicator descriptor
    
                DIMS(1) = NPROC
                PERIODS(1) = .FALSE.
                REORDER = .TRUE.
                NDIM = 1
    
                CALL MPI_CART_CREATE(MPI_COMM_WORLD, NDIM, DIMS, PERIODS,
                +     REORDER, COMM1D, IERR)

    Now we create a one-dimensional (NDIM=1) arrangement of all of our processes (MPI_COMM_WORLD). All of the parameters on this call are input values except for COMM1D and IERR. COMM1D is an integer “communicator handle.” If you print it out, it will be a value such as 134. It is not actually data, it is merely a handle that is used in other calls. It is quite similar to a file descriptor or unit number used when performing input-output to and from files.

    The topology we use is a one-dimensional decomposition that isn’t periodic. If we specified that we wanted a periodic decomposition, the far-left and far-right processes would be neighbors in a wrapped-around fashion making a ring. Given that it isn’t periodic, the far-left and far-right processes have no neighbors.

    In our PVM example above, we declared that Process 0 was the far-right process, Process NPROC-1 was the far-left process, and the other processes were arranged linearly between those two. If we set REORDER to .FALSE., MPI also chooses this arrangement. However, if we set REORDER to .TRUE., MPI may choose to arrange the processes in some other fashion to achieve better performance, assuming that you are communicating with close neighbors.

    Once the communicator is set up, we use it in all of our communication operations:

          * Get my rank in the new communicator
    
                CALL MPI_COMM_RANK( COMM1D, INUM, IERR)

    Within each communicator, each process has a rank from zero to the size of the communicator minus 1. The MPI_COMM_RANK tells each process its rank within the communicator. A process may have a different rank in the COMM1D communicator than in the MPI_COMM_WORLD communicator because of some reordering.

    Given a Cartesian topology communicator,5 we can extract information from the communicator using the MPI_CART_GET routine:

          * Given a communicator handle COMM1D, get the topology, and my position
          * in the topology
    
                CALL MPI_CART_GET(COMM1D, NDIM, DIMS, PERIODS, COORDS, IERR)

    In this call, all of the parameters are output values rather than input values as in the MPI_CART_CREATE call. The COORDS variable tells us our coordinates within the communicator. This is not so useful in our one-dimensional example, but in a two-dimensional process decomposition, it would tell our current position in that two-dimensional grid:

          * Returns the left and right neighbors 1 unit away in the zeroth dimension
          * of our Cartesian map - since we are not periodic, our neighbors may
          * not always exist - MPI_CART_SHIFT handles this for us
    
                CALL MPI_CART_SHIFT(COMM1D, 0, 1, LEFTPROC, RIGHTPROC, IERR)
                CALL MPE_DECOMP1D(TOTCOLS, NPROC, INUM, S, E)
                MYLEN = ( E - S ) + 1
                IF ( MYLEN.GT.COLS ) THEN
                  PRINT *,’Not enough space, need’,MYLEN,’ have ’,COLS
                  PRINT *,TOTCOLS,NPROC,INUM,S,E
                  STOP
                ENDIF
                PRINT *,INUM,NPROC,COORDS(1),LEFTPROC,RIGHTPROC, S, E

    We can use MPI_CART_SHIFT to determine the rank number of our left and right neighbors, so we can exchange our common points with these neighbors. This is necessary because we can’t simply send to INUM-1 and INUM+1 if MPI has chosen to reorder our Cartesian decomposition. If we are the far-left or far-right process, the neighbor that doesn’t exist is set to MPI_PROC_NULL, which indicates that we have no neighbor. Later when we are performing message sending, it checks this value and sends messages only to real processes. By not sending the message to the “null process,” MPI has saved us an IF test.

    To determine which strip of the global array we store and compute in this process, we call a utility routine called MPE_DECOMP1D that simply does several calculations to evenly split our 200 columns among our processes in contiguous strips. In the PVM version, we need to perform this computation by hand.

    The MPE_DECOMP1D routine is an example of an extended MPI library call (hence the MPE prefix). These extensions include graphics support and logging tools in addition to some general utilities. The MPE library consists of routines that were useful enough to standardize but not required to be supported by all MPI implementations. You will find the MPE routines supported on most MPI implementations.

    Now that we have our communicator group set up, and we know which strip each process will handle, we begin the computation:

          * Start Cold
    
                DO C=0,COLS+1
                  DO R=0,ROWS+1
                    BLACK(R,C) = 0.0
                  ENDDO
                ENDDO

    As in the PVM example, we set the plate (including boundary values) to zero.

    All processes begin the time step loop. Interestingly, like in PVM, there is no need for any synchronization. The messages implicitly synchronize our loops.

    The first step is to store the permanent heat sources. We need to use a routine because we must make the store operations relative to our strip of the global array:

          * Begin running the time steps
                DO TICK=1,MAXTIME
    
          * Set the persistent heat sources
                  CALL STORE(BLACK,ROWS,COLS,S,E,ROWS/3,TOTCOLS/3,10.0,INUM)
                  CALL STORE(BLACK,ROWS,COLS,S,E,2*ROWS/3,TOTCOLS/3,20.0,INUM)
                  CALL STORE(BLACK,ROWS,COLS,S,E,ROWS/3,2*TOTCOLS/3,-20.0,INUM)
                  CALL STORE(BLACK,ROWS,COLS,S,E,2*ROWS/3,2*TOTCOLS/3,20.0,INUM)

    All of the processes set these values independently depending on which process has which strip of the overall array.

    Now we exchange the data with our neighbors as determined by the Cartesian communicator. Note that we don’t need an IF test to determine if we are the far-left or far-right process. If we are at the edge, our neighbor setting is MPI_PROC_NULL and the MPI_SEND and MPI_RECV calls do nothing when given this as a source or destination value, thus saving us an IF test.

    Note that we specify the communicator COMM1D because the rank values we are using in these calls are relative to that communicator:

          * Send left and receive right
                  CALL MPI_SEND(BLACK(1,1),ROWS,MPI_DOUBLE_PRECISION,
               +                LEFTPROC,1,COMM1D,IERR)
                  CALL MPI_RECV(BLACK(1,MYLEN+1),ROWS,MPI_DOUBLE_PRECISION,
               +                RIGHTPROC,1,COMM1D,STATUS,IERR)
    
          * Send Right and Receive left in a single statement
                  CALL MPI_SENDRECV(
               +       BLACK(1,MYLEN),ROWS,COMM1D,RIGHTPROC,2,
               +       BLACK(1,0),ROWS,COMM1D,LEFTPROC, 2,
               +       MPI_COMM_WORLD, STATUS, IERR)

    Just to show off, we use both the separate send and receive, and the combined send and receive. When given a choice, it’s probably a good idea to use the combined operations to give the runtime environment more flexibility in terms of buffering. One downside to this that occurs on a network of workstations (or any other high-latency interconnect) is that you can’t do both send operations first and then do both receive operations to overlap some of the communication delay.

    Once we have all of our ghost points from our neighbors, we can perform the algorithm on our subset of the space:

          * Perform the flow
                  DO C=1,MYLEN
                    DO R=1,ROWS
                      RED(R,C) = ( BLACK(R,C) +
               +                   BLACK(R,C-1) + BLACK(R-1,C) +
               +                   BLACK(R+1,C) + BLACK(R,C+1) ) / 5.0
                    ENDDO
                  ENDDO
    
          * Copy back - Normally we would do a red and black version of the loop
                 DO C=1,MYLEN
                   DO R=1,ROWS
                     BLACK(R,C) = RED(R,C)
                   ENDDO
                 ENDDO
               ENDDO

    Again, for simplicity, we don’t do the complete red-black computation.6 We have no synchronization at the bottom of the loop because the messages implicitly synchronize the processes at the top of the next loop.

    Again, we dump out the data for verification. As in the PVM example, one good test of basic correctness is to make sure you get exactly the same results for varying numbers of processes:

          * Dump out data for verification
                IF ( ROWS .LE. 20 ) THEN
                  FNAME = ’/tmp/mheatcout.’ // CHAR(ICHAR(’0’)+INUM)
                  OPEN(UNIT=9,NAME=FNAME,FORM=’formatted’)
                  DO C=1,MYLEN
                    WRITE(9,100)(BLACK(R,C),R=1,ROWS)
          100        FORMAT(20F12.6)
                  ENDDO
                 CLOSE(UNIT=9)
               ENDIF

    To terminate the program, we call MPI_FINALIZE:

          * Lets all go together
                CALL MPI_FINALIZE(IERR)
                END

    As in the PVM example, we need a routine to store a value into the proper strip of the global array. This routine simply checks to see if a particular global element is in this process and if so, computes the proper location within its strip for the value. If the global element is not in this process, this routine simply returns doing nothing:

                SUBROUTINE STORE(RED,ROWS,COLS,S,E,R,C,VALUE,INUM)
                REAL*8 RED(0:ROWS+1,0:COLS+1)
                REAL VALUE
                INTEGER ROWS,COLS,S,E,R,C,I,INUM
                IF ( C .LT. S .OR. C .GT. E ) RETURN
                I = ( C - S ) + 1
          *     PRINT *,’STORE, INUM,R,C,S,E,R,I’,INUM,R,C,S,E,R,I,VALUE RED(R,I) = VALUE
                RETURN
                END

    When this program is executed, it has the following output:

          % mpif77 -c mheatc.f mheatc.f:
          MAIN mheatc:
          store:
          % mpif77 -o mheatc mheatc.o -lmpe
          % mheatc -np 4
          Calling MPI_INIT
          Back from MPI_INIT
          Back from MPI_INIT
          Back from MPI_INIT
          Back from MPI_INIT
          0  4  0  -1  1  1  50
          2  4  2  1  3  101  150
          3  4  3  2  -1  151  200
          1  4  1  0  2  51  100
         %

    As you can see, we call MPI_INIT to activate the four processes. The PRINT statement immediately after the MPI_INIT call appears four times, once for each of the activated processes. Then each process prints out the strip of the array it will process. We can also see the neighbors of each process including -1 when a process has no neighbor to the left or right. Notice that Process 0 has no left neighbor, and Process 3 has no right neighbor. MPI has provided us the utilities to simplify message-passing code that we need to add to implement this type of grid- based application.

    When you compare this example with a PVM implementation of the same problem, you can see some of the contrasts between the two approaches. Programmers who wrote the same six lines of code over and over in PVM combined them into a single call in MPI. In MPI, you can think “data parallel” and express your program in a more data-parallel fashion.

    In some ways, MPI feels less like assembly language than PVM. However, MPI does take a little getting used to when compared to PVM. The concept of a Cartesian communicator may seem foreign at first, but with understanding, it becomes a flexible and powerful tool.

    Heat in MPI Using Broadcast/Gather

    One style of parallel programming that we have not yet seen is the broadcast/gather style. Not all applications can be naturally solved using this style of programming. However, if an application can use this approach effectively, the amount of modification that is required to make a code run in a message-passing environment is minimal.

    Applications that most benefit from this approach generally do a lot of computation using some small amount of shared information. One requirement is that one complete copy of the “shared” information must fit in each of the processes.

    If we keep our grid size small enough, we can actually program our heat flow application using this approach. This is almost certainly a less efficient implementation than any of the earlier implementations of this problem because the core computation is so simple. However, if the core computations were more complex and needed access to values farther than one unit away, this might be a good approach.

    The data structures are simpler for this approach and, actually, are no different than the single-process FORTRAN 90 or HPF versions. We will allocate a complete RED and BLACK array in every process:

          PROGRAM MHEAT
          INCLUDE ’mpif.h’
          INCLUDE ’mpef.h’
          INTEGER ROWS,COLS
          PARAMETER(MAXTIME=200)
          PARAMETER(ROWS=200,COLS=200)
          DOUBLE PRECISION RED(0:ROWS+1,0:COLS+1),BLACK(0:ROWS+1,0:COLS+1)

    We need fewer variables for the MPI calls because we aren’t creating a communicator. We simply use the default communicator MPI_COMM_WORLD. We start up our processes, and find the size and rank of our process group:

          INTEGER INUM,NPROC,IERR,SRC,DEST,TAG
          INTEGER S,E,LS,LE,MYLEN
          INTEGER STATUS(MPI_STATUS_SIZE)
          INTEGER I,R,C
          INTEGER TICK,MAXTIME
          CHARACTER*30 FNAME
    
          PRINT *,’Calling MPI_INIT’
          CALL MPI_INIT( IERR )
          CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, IERR )
          CALL MPI_COMM_RANK( MPI_COMM_WORLD, INUM, IERR)
          CALL MPE_DECOMP1D(COLS, NPROC, INUM, S, E, IERR)
          PRINT *,’My Share ’, INUM, NPROC, S, E

    Since we are broadcasting initial values to all of the processes, we only have to set things up on the master process:

          * Start Cold
    
                IF ( INUM.EQ.0 ) THEN
                  DO C=0,COLS+1
                    DO R=0,ROWS+1
                      BLACK(R,C) = 0.0
                    ENDDO
                  ENDDO
                ENDIF

    As we run the time steps (again with no synchronization), we set the persistent heat sources directly. Since the shape of the data structure is the same in the master and all other processes, we can use the real array coordinates rather than mapping them as with the previous examples. We could skip the persistent settings on the nonmaster processes, but it doesn’t hurt to do it on all processes:

          * Begin running the time steps
                DO TICK=1,MAXTIME
    
          * Set the heat sources
                  BLACK(ROWS/3, COLS/3)= 10.0
                  BLACK(2*ROWS/3, COLS/3) = 20.0
                  BLACK(ROWS/3, 2*COLS/3) = -20.0
                  BLACK(2*ROWS/3, 2*COLS/3) = 20.0

    Now we broadcast the entire array from process rank zero to all of the other processes in the MPI_COMM_WORLD communicator. Note that this call does the sending on rank zero process and receiving on the other processes. The net result of this call is that all the processes have the values formerly in the master process in a single call:

          * Broadcast the array
                 CALL MPI_BCAST(BLACK,(ROWS+2)*(COLS+2),MPI_DOUBLE_PRECISION,
               +                 0,MPI_COMM_WORLD,IERR)

    Now we perform the subset computation on each process. Note that we are using global coordinates because the array has the same shape on each of the processes. All we need to do is make sure we set up our particular strip of columns according to S and E:

          * Perform the flow on our subset
    
                  DO C=S,E
                    DO R=1,ROWS
                      RED(R,C) = ( BLACK(R,C) +
               +                   BLACK(R,C-1) + BLACK(R-1,C) +
               +                   BLACK(R+1,C) + BLACK(R,C+1) ) / 5.0
                    ENDDO
                  ENDDO

    Now we need to gather the appropriate strips from the processes into the appropriate strip in the master array for rebroadcast in the next time step. We could change the loop in the master to receive the messages in any order and check the STATUS variable to see which strip it received:

          * Gather back up into the BLACK array in master (INUM = 0)
                 IF ( INUM .EQ. 0 ) THEN
                   DO C=S,E
                     DO R=1,ROWS
                       BLACK(R,C) = RED(R,C)
                     ENDDO
                   ENDDO
                 DO I=1,NPROC-1
                   CALL MPE_DECOMP1D(COLS, NPROC, I, LS, LE, IERR)
                   MYLEN = ( LE - LS ) + 1
                   SRC = I TAG = 0
                   CALL MPI_RECV(BLACK(0,LS),MYLEN*(ROWS+2),
               +                     MPI_DOUBLE_PRECISION, SRC, TAG,
               +                     MPI_COMM_WORLD, STATUS, IERR)
          *         Print *,’Recv’,I,MYLEN
                 ENDDO
               ELSE
                 MYLEN = ( E - S ) + 1
                 DEST = 0
                 TAG = 0
                 CALL MPI_SEND(RED(0,S),MYLEN*(ROWS+2),MPI_DOUBLE_PRECISION,
               +                DEST, TAG, MPI_COMM_WORLD, IERR)
                    Print *,’Send’,INUM,MYLEN
                ENDIF
               ENDDO

    We use MPE_DECOMP1D to determine which strip we’re receiving from each process.

    In some applications, the value that must be gathered is a sum or another single value. To accomplish this, you can use one of the MPI reduction routines that coalesce a set of distributed values into a single value using a single call.

    Again at the end, we dump out the data for testing. However, since it has all been gathered back onto the master process, we only need to dump it on one process:

          * Dump out data for verification
                IF ( INUM .EQ.0 .AND. ROWS .LE. 20 ) THEN
                  FNAME = ’/tmp/mheatout’
                  OPEN(UNIT=9,NAME=FNAME,FORM=’formatted’)
                  DO C=1,COLS
                    WRITE(9,100)(BLACK(R,C),R=1,ROWS)
          100        FORMAT(20F12.6)
                  ENDDO
                 CLOSE(UNIT=9)
                ENDIF
    
                CALL MPI_FINALIZE(IERR)
                END

    When this program executes with four processes, it produces the following output:

          % mpif77 -c mheat.f
          mheat.f:
          MAIN mheat:
          % mpif77 -o mheat mheat.o -lmpe
          % mheat -np 4
          Calling MPI_INIT
          My Share 1 4 51 100
          My Share 0 4 1 50
          My Share 3 4 151 200
          My Share 2 4 101 150
          %

    The ranks of the processes and the subsets of the computations for each process are shown in the output.

    So that is a somewhat contrived example of the broadcast/gather approach to parallelizing an application. If the data structures are the right size and the amount of computation relative to communication is appropriate, this can be a very effective approach that may require the smallest number of code modifications compared to a single-processor version of the code.

    MPI Summary

    Whether you chose PVM or MPI depends on which library the vendor of your system prefers. Sometimes MPI is the better choice because it contains the newest features, such as support for hardware-supported multicast or broadcast, that can significantly improve the overall performance of a scatter-gather application.

    A good text on MPI is Using MPI — Portable Parallel Programmingwith the Message-Passing Interface, by William Gropp, Ewing Lusk, and Anthony Skjellum (MIT Press). You may also want to retrieve and print the MPI specification from http://www.netlib.org/mpi/.

    Closing Notes

    In this chapter we have looked at the “assembly language” of parallel programming. While it can seem daunting to rethink your application, there are often some simple changes you can make to port your code to message passing. Depending on the application, a master-slave, broadcast-gather, or decomposed data approach might be most appropriate.

    It’s important to realize that some applications just don’t decompose into message passing very well. You may be working with just such an application. Once you have some experience with message passing, it becomes easier to identify the critical points where data must be communicated between processes.

    While HPF, PVM, and MPI are all mature and popular technologies, it’s not clear whether any of these technologies will be the long-term solution that we will use 10 years from now. One possibility is that we will use FORTRAN 90 (or FORTRAN 95) without any data layout directives or that the directives will be optional. Another interesting possibility is simply to keep using FORTRAN 77. As scalable, cache-coherent, non-uniform memory systems become more popular, they will evolve their own data allocation primitives. For example, the HP/Exemplar supports the following data storage attributes: shared, node-private, and thread-private. As dynamic data structures are allocated, they can be placed in any one of these classes. Node-private memory is shared across all the threads on a single node but not shared beyond those threads. Perhaps we will only have to declare the storage class of the data but not the data layout for these new machines.

    PVM and MPI still need the capability of supporting a fault-tolerant style of computing that allows an application to complete as resources fail or otherwise become available. The amount of compute power that will be available to applications that can tolerate some unreliability in the resources will be very large. There have been a number of moderately successful attempts in this area such as Condor, but none have really caught on in the mainstream.

    To run the most powerful computers in the world at their absolute maximum performance levels, the need to be portable is somewhat reduced. Making your particular application go ever faster and scale to ever higher numbers of processors is a fascinating activity. May the FLOPS be with you!


    Footnotes

    1. Notice I said “not that much more effort.”
    2. Note: the exact compilation may be different on your system.
    3. One should not diminish the positive contributions of PVM, however. PVM was the first widely avail- able portable message-passing environment. PVM pioneered the idea of heterogeneous distributed computing with built-in format conversion.
    4. Sounds a little like HPF, no?
    5. Remember, each communicator may have a topology associated with it. A topology can be grid, graph, or none. Interestingly, the MPI_COMM_WORLD communicator has no topology associated with it.
    6. Note that you could do two time steps (one black-red-black iteration) if you exchanged two ghost columns at the top of the loop.

    This page titled 5.2: Message-Passing Environments is shared under a CC BY license and was authored, remixed, and/or curated by Chuck Severance.

    • Was this article helpful?