Skip to main content
Engineering LibreTexts

3.3: Eliminating Clutter

  • Page ID
    13430
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Introduction

    We have looked at code from the compiler’s point of view and at how to profile code to find the trouble spots. This is good information, but if you are dissatisfied with a code’s performance, you might still be wondering what to do about it. One possibility is that your code is too obtuse for the compiler to optimize properly. Excess code, too much modularization, or even previous optimization-related “improvements” can clutter up your code and confuse the compilers. Clutter is anything that contributes to the runtime without contributing to the answer. It comes in two forms:

    Things that contribute to overhead

    Subroutine calls, indirect memory references, tests within loops, wordy tests, type conversions, variables preserved unnecessarily

    Things that restrict compiler flexibility

    Subroutine calls, indirect memory references, tests within loops, ambiguous pointers

    It’s not a mistake that some of the same items appear in both lists. Subroutine calls or if-statements within loops can both bite and scratch you by taking too much time and by creating fences — places in the program where instructions that appear before can’t be safely intermixed with instructions that appear after, at least not without a great deal of care. The goal of this chapter is to show you how to eliminate clutter, so you can restructure what’s left over for the fastest execution. We save a few specific topics that might fit here, especially those regarding memory references, for later chapters where they are treated as subjects by themselves.

    Before we start, we’ll remind you: as you look for ways to improve what you have, keep your eyes and mind open to the possibility that there might be a fundamentally better way to do something—a more efficient sorting technique, random number generator, or solver. A different algorithm may buy you far more speed than tuning. Algorithms are beyond the scope of this book, but what we are discussing here should help you recognize “good” code, or help you to code a new algorithm to get the best performance.

    Subroutine Calls

    A typical corporation is full of frightening examples of overhead. Say your department has prepared a stack of paperwork to be completed by another department. What do you have to do to transfer that work? First, you have to be sure that your portion is completed; you can’t ask them to take over if the materials they need aren’t ready. Next, you need to package the materials — data, forms, charge numbers, and the like. And finally comes the official transfer. Upon receiving what you sent, the other department has to unpack it, do their job, repackage it, and send it back.

    A lot of time gets wasted moving work between departments. Of course, if the overhead is minimal compared to the amount of useful work being done, it won’t be that big a deal. But it might be more efficient for small jobs to stay within one department. The same is true of subroutine and function calls. If you only enter and exit modules once in a relative while, the overhead of saving registers and preparing argument lists won’t be significant. However, if you are repeatedly calling a few small subroutines, the overhead can buoy them to the top of the profile. It might be better if the work stayed where it was, in the calling routine.

    Additionally, subroutine calls inhibit compiler flexibility. Given the right opportunity, you’d like your compiler to have the freedom to intermix instructions that aren’t dependent upon each other. These are found on either side of a subroutine call, in the caller and callee. But the opportunity is lost when the compiler can’t peer into subroutines and functions. Instructions that might overlap very nicely have to stay on their respective sides of the artificial fence.

    It helps if we illustrate the challenge that subroutine boundaries present with an exaggerated example. The following loop runs very well on a wide range of processors:

          DO I=1,N
            A(I) = A(I) + B(I) * C
          ENDDO

    The code below performs the same calculations, but look at what we have done:

          DO I=1,N
            CALL MADD (A(I), B(I), C)
          ENDDO
          SUBROUTINE MADD (A,B,C)
          A = A + B * C
          RETURN
          END

    Each iteration calls a subroutine to do a small amount of work that was formerly within the loop. This is a particularly painful example because it involves floating- point calculations. The resulting loss of parallelism, coupled with the procedure call overhead, might produce code that runs 100 times slower. Remember, these operations are pipelined, and it takes a certain amount of “wind-up” time before the throughput reaches one operation per clock cycle. If there are few floating-point operations to perform between subroutine calls, the time spent winding up and winding down pipelines figures prominently.

    Subroutine and function calls complicate the compiler’s ability to efficiently man- age COMMON and external variables, delaying until the last possible moment actually storing them in memory. The compiler uses registers to hold the “live” values of many variables. When you make a call, the compiler cannot tell whether the subroutine will be changing variables that are declared as external or COMMON. Therefore, it’s forced to store any modified external or COMMON variables back into memory so that the callee can find them. Likewise, after the call has returned, the same variables have to be reloaded into registers because the compiler can no longer trust the old, register-resident copies. The penalty for saving and restoring variables can be substantial, especially if you are using lots of them. It can also be unwarranted if variables that ought to be local are specified as external or COMMON, as in the following code:

          COMMON /USELESS/ K
          DO K=1,1000
            IF (K .EQ. 1) CALL AUX
          ENDDO

    In this example, K has been declared as a COMMON variable. It is used only as a do-loop counter, so there really is no reason for it to be anything but local. However, because it is in a COMMON block, the call to AUX forces the compiler to store and reload K each iteration. This is because the side effects of the call are unknown.

    So far, it looks as if we are preparing a case for huge main programs without any subroutines or functions! Not at all. Modularity is important for keeping source code compact and understandable. And frankly, the need for maintainability and modularity is always more important than the need for small performance improvements. However, there are a few approaches for streamlining subroutine calls that don’t require you to scrap modular coding techniques: macros and procedure inlining.

    Remember, if the function or subroutine does a reasonable amount of work, procedure call overhead isn’t going to matter very much. However, if one small routine appears as a leaf node in one of the busiest sections of the call graph, you might want to think about inserting it in appropriate places in the program.

    Macros

    Macros are little procedures that are substituted inline at compile time. Unlike subroutines or functions, which are included once during the link, macros are replicated every place they are used. When the compiler makes its first pass through your program, it looks for patterns that match previous macro definitions and expands them inline. In fact, in later stages, the compiler sees an expanded macro as source code.

    Macros are part of both C and FORTRAN (although the FORTRAN notion of a macro, the statement function, is reviled by the FORTRAN community, and won’t survive much longer).1 For C programs, macros are created with a #define construct, as demonstrated here:

          #define average(x,y) ((x+y)/2)
          main ()
          {
             float q = 100, p = 50;
             float a;
             a = average(p,q);
             printf ("%f\n",a);
          }

    The first compilation step for a C program is a pass through the C preprocessor, cpp. This happens automatically when you invoke the compiler. cpp expands #define statements inline, replacing the pattern matched by the macro definition. In the program above, the statement:

          a = average(p,q);
    

    gets replaced with:

          a = ((p+q)/2);

    You have to be careful how you define the macro because it literally replaces the pattern located by cpp. For instance, if the macro definition said:

          #define multiply(a,b) (a*b)

    and you invoked it as:

          c = multiply(x+t,y+v);
    

    the resulting expansion would be x+t*y+v — probably not what you intended.

    If you are a C programmer you may be using macros without being conscious of it. Many C header files (.h) contain macro definitions. In fact, some “standard” C library functions are really defined as macros in the header files. For instance, the function getchar can be linked in when you build your program. If you have a statement:

          #include <stdio.h>

    in your file, getchar is replaced with a macro definition at compile time, replacing the C library function.

    You can make cpp macros work for FORTRAN programs too.2 For example, a FORTRAN version of the C program above might look like this:

          #define AVERAG(X,Y) ((X+Y)/2)
          C
                PROGRAM MAIN
                REAL A,P,Q
                DATA P,Q /50.,100./
                A = AVERAG(P,Q)
                WRITE (*,*) A
                END

    Without a little preparation, the #define statement is rejected by the FORTRAN compiler. The program first has to be preprocessed through cpp to replace the use of AVERAG with its macro definition. It makes compilation a two-step procedure, but that shouldn’t be too much of a burden, especially if you are building your programs under the control of the make utility. We would also suggest you store FORTRAN programs containing cpp directives under filename.F to distinguish them from unadorned FORTRAN. Just be sure you make your changes only to the .F files and not to the output from cpp. This is how you would preprocess FORTRAN .F files by hand:

          % /lib/cpp -P < average.F > average.f
          % f77 average.f -c

    The FORTRAN compiler never sees the original code. Instead, the macro definition is substituted inline as if you had typed it yourself:

          C
                PROGRAM MAIN
                REAL A,P,Q
                DATA P,Q /50.,100./ A = ((P+Q)/2)
                WRITE (*,*) A
                END

    By the way, some FORTRAN compilers recognize the .F extension already, making the two-step process unnecessary. If the compiler sees the .F extension it invokes cpp automatically, compiles the output, and throws away the intermediate .f file. Try compiling a .F on your computer to see if it works.

    Also, be aware that macro expansions may make source lines extend past column 72, which will probably make your FORTRAN compiler complain (or worse: it might pass unnoticed). Some compilers support input lines longer than 72 characters. On the Sun compilers the –e option allows extended input lines up to 132 characters long.

    Procedure Inlining

    Macro definitions tend to be pretty short, usually just a single statement. Some- times you have slightly longer (but not too long) bits of code that might also benefit from being copied inline, rather than called as a subroutine or function. Again, the reason for doing this is to eliminate procedure call overhead and expose parallelism. If your compiler is capable of inlining subroutine and function definitions into the modules that call them, then you have a very natural, very portable way to write modular code without suffering the cost of subroutine calls.

    Depending on the vendor, you can ask the compiler for procedure inlining by:

    • Specifying which routines should be inlined on the compiler’s command line
    • Putting inlining directives into the source program
    • Letting the compiler inline automatically

    The directives and compile line options are not standard, so you have to check your compiler documentation. Unfortunately, you may learn that there is no such feature (“yet,” always yet), or that it’s an expensive extra. The third form of inlining in the list, automatic, is available from just a few vendors. Automatic inlining depends on a sophisticated compiler that can view the definitions of several modules at once.

    There are some words of caution with regard to procedure inlining. You can easily do too much of it. If everything and anything is ingested into the body of its parents, the resulting executable may be so large that it repeatedly spills out of the instruction cache and becomes a net performance loss. Our advice is that you use the caller/callee information profilers give you and make some intelligent decisions about inlining, rather than trying to inline every subroutine available. Again, small routines that are called often are generally the best candidates for inlining.

    Branches

    People sometimes take a week to make a decision, so we can’t fault a computer if it takes a few tens of nanoseconds. However, if an if-statement appears in some heavily traveled section of the code, you might get tired of the delay. There are two basic approaches to reducing the impact of branches:

    • Streamline them.
    • Move them out to the computational suburbs. Particularly, get them out of loops.

    In [Section 2.3.4] we show you some easy ways to reorganize conditionals so they execute more quickly.

    Branches With Loops

    Numerical codes usually spend most of their time in loops, so you don’t want anything inside a loop that doesn’t have to be there, especially an if-statement. Not only do if-statements gum up the works with extra instructions, they can force a strict order on the iterations of a loop. Of course, you can’t always avoid conditionals. Sometimes, though, people place them in loops to process events that could have been handled outside, or even ignored.

    To take you back a few years, the following code shows a loop with a test for a value close to zero:

          PARAMETER (SMALL = 1.E-20)
          DO I=1,N
            IF (ABS(A(I)) .GE. SMALL) THEN
              B(I) = B(I) + A(I) * C
            ENDIF
          ENDDO

    The idea was that if the multiplier, A(I), were reasonably small, there would be no reason to perform the math in the center of the loop. Because floating-point operations weren’t pipelined on many machines, a comparison and a branch was cheaper; the test would save time. On an older CISC or early RISC processor, a comparison and branch is probably still a savings. But on other architectures, it costs a lot less to just perform the math and skip the test. Eliminating the branch eliminates a control dependency and allows the compiler to pipeline more arithmetic operations. Of course, the answer could change slightly if the test is eliminated. It then becomes a question of whether the difference is significant. Here’s another example where a branch isn’t necessary. The loop finds the absolute value of each element in an array:

          DO I=1,N
            IF (A(I) .LT. 0.) A(I) = -A(I)
          ENDDO

    But why perform the test at all? On most machines, it’s quicker to perform the abs() operation on every element of the array.

    We do have to give you a warning, though: if you are coding in C, the absolute value, fabs(), is a subroutine call. In this particular case, you are better off leaving the conditional in the loop.3

    When you can’t always throw out the conditional, there are things you can do to minimize negative performance. First, we have to learn to recognize which conditionals within loops can be restructured and which cannot. Conditionals in loops fall into several categories:

    • Loop invariant conditionals
    • Loop index dependent conditionals
    • Independent loop conditionals
    • Dependent loop conditionals
    • Reductions
    • Conditionals that transfer control

    Let’s look at these types in turn.

    Loop Invariant Conditionals

    The following loop contains an invariant test:

          DO I=1,K
            IF (N .EQ. 0) THEN
              A(I) = A(I) + B(I) * C
            ELSE
              A(I) = 0.
            ENDIF
          ENDDO

    “Invariant” means that the outcome is always the same. Regardless of what happens to the variables A, B, C, and I, the value of N won’t change, so neither will the outcome of the test.

    You can recast the loop by making the test outside and replicating the loop body twice — once for when the test is true, and once for when it is false, as in the following example:

          IF (N .EQ. 0) THEN
            DO I=1,K
              A(I) = A(I) + B(I) * C
            ENDDO
          ELSE
            DO I=1,K
              A(I) = 0
            ENDDO
          ENDIF

    The effect on the runtime is dramatic. Not only have we eliminated K-1 copies of the test, we have also assured that the computations in the middle of the loop are not control-dependent on the if-statement, and are therefore much easier for the compiler to pipeline.

    We remember helping someone optimize a program with loops containing similar conditionals. They were checking to see whether debug output should be printed each iteration inside an otherwise highly optimizable loop. We can’t fault the person for not realizing how much this slowed the program down. Performance wasn’t important at the time. The programmer was just trying to get the code to produce good answers. But later on, when performance mattered, by cleaning up invariant conditionals, we were able to speed up the program by a factor of 100.

    Loop Index Dependent Conditionals

    For loop index dependent conditionals, the test is true for certain ranges of the loop index variables. It isn’t always true or always false, like the conditional we just looked at, but it does change with a predictable pattern, and one that we can use to our advantage. The following loop has two index variables, I and J.

          DO I=1,N
            DO J=1,N
              IF (J .LT. I)
                A(J,I) = A(J,I) + B(J,I) * C
              ELSE
                A(J,I) = 0.0
              ENDIF
            ENDDO
          ENDDO

    Notice how the if-statement partitions the iterations into distinct sets: those for which it is true and those for which it is false. You can take advantage of the predictability of the test to restructure the loop into several loops — each custom-made for a different partition:

          DO I=1,N
            DO J=1,I-1
                A(J,I) = A(J,I) + B(J,I) * C
            ENDDO
            DO J=I,N
                A(J,I) = 0.0
            ENDDO
          ENDDO

    The new version will almost always be faster. A possible exception is when N is a small value, like 3, in which case we have created more clutter. But then, the loop probably has such a small impact on the total runtime that it won’t matter which way it’s coded.

    Independent Loop Conditionals

    It would be nice if you could optimize every loop by partitioning it. But more often than not, the conditional doesn’t directly depend on the value of the index variables. Although an index variable may be involved in addressing an array, it doesn’t create a recognizable pattern in advance — at least not one you can see when you are writing the program. Here’s such a loop:

          DO I=1,N
            DO J=1,N
              IF (B(J,I) .GT. 1.0) A(J,I) = A(J,I) + B(J,I) * C
            ENDDO
          ENDDO

    There is not much you can do about this type of conditional. But because every iteration is independent, the loop can be unrolled or can be performed in parallel.

    Dependent Loop Conditionals

    When the conditional is based on a value that changes with each iteration of the loop, the compiler has no choice but to execute the code exactly as written. For instance, the following loop has an if-statement with built-in scalar recursion:

          DO I=1,N
            IF (X .LT. A(I)) X = X + B(I)*2.
          ENDDO

    You can’t know which way the branch will go for the next iteration until you are done with the current iteration. To recognize the dependency, try to unroll the loop slightly by hand. If you can’t start the second test until the first has finished, you have a dependent loop conditional. You may want to look at these types of loops to see if you can eliminate the iteration-to-iteration value.

    Reductions

    Keep an eye out for loops in which the if-statement is performing a max or min function on a array. This is a reduction, so called because it reduces a array to a scalar result (the previous example was a reduction too, by the way). Again, we are getting a little bit ahead of ourselves, but since we are talking about if-statements in loops, I want to introduce a trick for restructuring reductions max and min to expose more parallelism. The following loop searches for the maximum value, z, in the array a by going through the elements one at a time:

          for (i=0; i<n; i++)
              z = a[i] > z ? a[i] : z;
    

    As written, it’s recursive like the loop from the previous section. You need the result of a given iteration before you can proceed to the next. However, since we are looking for the greatest element in the whole array, and since that will be the same element (essentially) no matter how we go about looking for it, we can restructure the loop to check several elements at a time (we assume n is evenly divisible by 2 and do not include the preconditioning loop):

          z0 = 0.;
          z1 = 0.;
          for (i=0; i< n-1; i+=2) {
            z0 = z0 < a[i] ? a[i] : z0;
            z1 = z1 < a[i+1] ? a[i+1] : z1;
          }
          z = z0 < z1 ? z1 : z0;

    Do you see how the new loop calculates two new maximum values each iteration? These maximums are then compared with one another, and the winner becomes the new official max. It’s analogous to a play-off arrangement in a Ping-Pong tournament. Whereas the old loop was more like two players competing at a time while the rest sat around, the new loop runs several matches side by side. In general this particular optimization is not a good one to code by hand. On parallel processors, the compiler performs the reduction in its own way. If you hand-code similar to this example, you may inadvertently limit the compiler’s flexibility on a parallel system.

    Conditionals That Transfer Control

    Let’s step back a second. Have you noticed a similarity among all the loops so far? We have looked only at a particular type of conditional, conditional assignments — based on the outcome of the test, a variable gets reassigned. Of course, not every conditional ends up in an assignment. You can have statements that transfer flow of control, such as subroutine calls or goto statements. In the following example, the programmer is carefully checking before dividing by zero.

    However, this test has an extremely negative impact on the performance because it forces the iterations to be done precisely in order:

          DO I=1,N
            DO J=1,N
              IF (B(J,I) .EQ. 0 ) THEN
                PRINT *,I,J
                STOP
              ENDIF
              A(J,I) = A(J,I) / B(J,I)
            ENDDO
          ENDDO

    Avoiding these tests is one of the reasons that the designers of the IEEE floating- point standard added the trap feature for operations such as dividing by zero. These traps allow the programmer in a performance-critical section of the code to achieve maximum performance yet still detect when an error occurs.

    Other Clutter

    Clutter comes in many forms. Consider the previous sections as having dealt with large pieces of junk you might find in the front hall closet: an ironing board, hockey sticks, and pool cues. Now we are down to the little things: a widowed checker, a tennis ball, and a hat nobody owns. We want to mention a few of them here. We apologize in advance for changing subjects a lot, but that’s the nature of cleaning out a closet!

    Data Type Conversions

    Statements that contain runtime type conversions suffer a little performance penalty each time the statement is executed. If the statement is located in a portion of the program where there is a lot of activity, the total penalty can be significant.

    People have their reasons for writing applications with mixed typing. Often it is a matter of saving memory space, memory bandwidth, or time. In the past, for instance, double-precision calculations took twice as long as their single-precision counterparts, so if some of the calculations could be arranged to take place in single precision, there could be a performance win.4 But any time saved by performing part of the calculations in single precision and part in double precision has to be measured against the additional overhead caused by the runtime type conversions. In the following code, the addition of A(I) to B(I) is mixed type:

          INTEGER NUMEL, I
          PARAMETER (NUMEL = 1000)
          REAL*8 A(NUMEL)
          REAL*4 B(NUMEL)
          DO I=1,NUMEL
            A(I) = A(I) + B(I)
          ENDDO

    In each iteration, B(I) has to be promoted to double precision before the addition can occur. You don’t see the promotion in the source code, but it’s there, and it takes time.

    C programmers beware: in Kernighan and Ritchie (K&R) C, all floating-point calculations in C programs take place in double precision — even if all the variables involved are declared as float. It is possible for you to write a whole K+R application in one precision, yet suffer the penalty of many type conversions.

    Another data type–related mistake is to use character operations in IF tests. On many systems, character operations have poorer performance than integer operations since they may be done via procedure calls. Also, the optimizers may not look at code using character variables as a good candidate for optimization. For example, the following code:

          DO I=1,10000
            IF ( CHVAR(I) .EQ. ’Y’ ) THEN
              A(I) = A(I) + B(I)*C
            ENDIF
          ENDDO

    might be better written using an integer variable to indicate whether or not a computation should be performed:

          DO I=1,10000
            IF ( IFLAG(I) .EQ. 1 ) THEN
              A(I) = A(I) + B(I)*C
            ENDIF
          ENDDO

    Another way to write the code, assuming the IFLAG variable was 0 or 1, would be as follows:

          DO I=1,10000
            A(I) = A(I) + B(I)*C*IFLAG(I)
          ENDDO
    

    The last approach might actually perform slower on some computer systems than the approach using the IF and the integer variable.

    Doing Your Own Common Subexpression Elimination

    So far we have given your compiler the benefit of the doubt. Common subexpression elimination — the ability of the compiler to recognize repeated patterns in the code and replace all but one with a temporary variable — probably works on your machine for simple expressions. In the following lines of code, most compilers would recognize a+b as a common subexpression:

          c = a + b + d
          e = q + a + b

    becomes:

          temp = a + b
          c = temp + d
          e = q + temp

    Substituting for a+b eliminates some of the arithmetic. If the expression is reused many times, the savings can be significant. However, a compiler’s ability to recognize common subexpressions is limited, especially when there are multiple components, or their order is permuted. A compiler might not recognize that a+b+c and c+b+a are equivalent.5 For important parts of the program, you might consider doing common subexpression elimination of complicated expressions by hand. This guarantees that it gets done. It compromises beauty somewhat, but there are some situations where it is worth it.

    Here’s another example in which the function sin is called twice with the same argument:

          x = r*sin(a)*cos(b);
          y = r*sin(a)*sin(b);
          z = r*cos(a);

    becomes:

          temp = r*sin(a);
          x = temp*cos(b);
          y = temp*sin(b);
          z = r*cos(a);

    We have replaced one of the calls with a temporary variable. We agree, the savings for eliminating one transcendental function call out of five won’t win you a Nobel prize, but it does call attention to an important point: compilers typically do not perform common subexpression elimination over subroutine or function calls. The compiler can’t be sure that the subroutine call doesn’t change the state of the argument or some other variables that it can’t see.

    The only time a compiler might eliminate common subexpressions containing function calls is when they are intrinsics, as in FORTRAN. This can be done because the compiler can assume some things about their side effects. You, on the other hand, can see into subroutines, which means you are better qualified than the compiler to group together common subexpressions involving subroutines or functions.

    Doing Your Own Code Motion

    All of these optimizations have their biggest payback within loops because that’s where all of a program’s activity is concentrated. One of the best ways to cut down on runtime is to move unnecessary or repeated (invariant) instructions out of the main flow of the code and into the suburbs. For loops, it’s called hoisting instructions when they are pulled out from the top and sinking when they are pushed down below. Here’s an example:

          DO I=1,N
            A(I) = A(I) / SQRT(X*X + Y*Y)
          ENDDO

    becomes:

          TEMP = 1 / SQRT(X*X + Y*Y)
          DO I=1,N
            A(I) = A(I) * TEMP
          ENDDO

    We hoisted an expensive, invariant operation out of the loop and assigned the result to a temporary variable. Notice, too, that we made an algebraic simplification when we exchanged a division for multiplication by an inverse. The multiplication will execute much more quickly. Your compiler might be smart enough to make these transformations itself, assuming you have instructed the compiler that these are legal transformations; but without crawling through the assembly language, you can’t be positive. Of course, if you rearrange code by hand and the runtime for the loop suddenly goes down, you will know that the compiler has been sandbagging all along.

    Sometimes you want to sink an operation below the loop. Usually, it’s some calculation performed each iteration but whose result is only needed for the last. To illustrate, here’s a sort of loop that is different from the ones we have been looking at. It searches for the final character in a character string:

          while (*p != ’ ’)
            c = *p++;

    becomes:

          while (*p++ != ’ ’);
          c = *(p-1);
    

    The new version of the loop moves the assignment of c beyond the last iteration. Admittedly, this transformation would be a reach for a compiler and the savings wouldn’t even be that great. But it illustrates the notion of sinking an operation very well.

    Again, hoisting or sinking instructions to get them out of loops is something your compiler should be capable of doing. But often you can slightly restructure the calculations yourself when you move them to get an even greater benefit.

    Handling Array Elements in Loops

    Here’s another area where you would like to trust the compiler to do the right thing. When making repeated use of an array element within a loop, you want to be charged just once for loading it from memory. Take the following loop as an example. It reuses X(I) twice:

          DO I=1,N
            XOLD(I) = X(I)
            X(I)= X(I) + XINC(I)
          ENDDO

    In reality, the steps that go into retrieving X(I) are just additional common subexpressions: an address calculation (possibly) and a memory load operation. You can see that the operation is repeated by rewriting the loop slightly:

          DO I=1,N
            TEMP= X(I)
            XOLD(I) = TEMP
            X(I)= TEMP + XINC(I)
          ENDDO

    FORTRAN compilers should recognize that the same X(I) is being used twice and that it only needs to be loaded once, but compilers aren’t always so smart. You sometimes have to create a temporary scalar variable to hold the value of an array element over the body of a loop. This is particularly true when there are subroutine calls or functions in the loop, or when some of the variables are external or COMMON. Make sure to match the types between the temporary variables and the other variables. You don’t want to incur type conversion overhead just because you are “helping” the compiler. For C compilers, the same kind of indexed expressions are an even greater challenge. Consider this code:

          doinc(int xold[],int x[],int xinc[],int n)
          {
            for (i=0; i<n; i++) {
              xold[i] = x[i];
              x[i]= x[i] + xinc[i];
            }
          }

    Unless the compiler can see the definitions of x, xinc, and xold, it has to assume that they are pointers leading back to the same storage, and repeat the loads and stores. In this case, introducing temporary variables to hold the values x, xinc, and xold is an optimization the compiler wasn’t free to make.

    Interestingly, while putting scalar temporaries in the loop is useful for RISC and superscalar machines, it doesn’t help code that runs on parallel hardware. A parallel compiler looks for opportunities to eliminate the scalars or, at the very least, to replace them with temporary vectors. If you run your code on a parallel machine from time to time, you might want to be careful about introducing scalar temporary variables into a loop. A dubious performance gain in one instance could be a real performance loss in another.

    Closing Notes

    In this chapter, we introduced tuning techniques for eliminating program clutter — anything that contributes to the runtime without contributing to the answer. We saw many examples of tuning techniques — enough that you may be asking yourself, “What’s left?” Well, as we will see in the upcoming chapters, there are a couple of ways we can help the compiler:

    • Find more parallelism
    • Use memory as effectively as possible

    Sometimes this means we make changes that are not beautiful. However, they are often quick.

    Exercises

    Exercise \(\PageIndex{1}\)

    How would you simplify the following loop conditional?
    DO I=1,N A(I) = A(I) * B IF (I .EQ. N/2) A(I) = 0. ENDDO

    Exercise \(\PageIndex{2}\)

    Time this loop on your computer, both with and without the test. Run it with three sets of data: one with all A(I)s less than SMALL, one with all A(I)s greater than SMALL, and one with an even split. When is it better to leave the test in the loop, if ever?
    PARAMETER (SMALL = 1.E-20) DO I=1,N IF (ABS(A(I)) .GE. SMALL) THEN B(I) = B(I) + A(I) * C ENDIF ENDDO

    Exercise \(\PageIndex{3}\)

    Write a simple program that calls a simple subroutine in its inner loop. Time the program execution. Then tell the compiler to inline the routine and test the performance again. Finally, modify the code to perform the operations in the body of the loop and time the code. Which option ran faster? You may have to look at the generated machine code to figure out why.


    Footnotes

    1. The statement function has been eliminated in FORTRAN 90.
    2. Some programmers use the standard UNIX m4 preprocessor for FORTRAN
    3. The machine representation of a floating-point number starts with a sign bit. If the bit is 0, the number is positive. If it is 1, the number is negative. The fastest absolute value function is one that merely “ands” out the sign bit. See macros in /usr/include/macros.h and /usr/include/math.h.
    4. Nowadays, single-precision calculations can take longer than double-precision calculations from register to register.
    5. And because of overflow and round-off errors in floating-point, in some situations they might not be equivalent.

    This page titled 3.3: Eliminating Clutter is shared under a CC BY license and was authored, remixed, and/or curated by Chuck Severance.

    • Was this article helpful?