# 6.2: Appendix B- Looking at Assembly Language

## Assembly Language

In this appendix, we take a look at the assembly language produced by a number of different compilers on a number of different architectures. In this survey we revisit some of the issues of CISC versus RISC, and the strengths and weaknesses of different architectures.

For this survey, two roughly identical segments of code were used. The code was a relatively long loop adding two arrays and storing the result in a third array. The loops were written both in FORTRAN and C.

The FORTRAN loop was as follows:

REAL A(10000),B(10000),C(10000)
INTEGER N,I
DO 10 I=1,N
A(I) = B(I) + C(I)
ENDDO
END

The C version was:

for(i=0;i<n;i++) a[i] = b[i] + c[i];

We have gathered these examples over the years from a number of different compilers, and the results are not particularly scientific. This is not intended to review a particular architecture or compiler version, but rather just to show an example of the kinds of things you can learn from looking at the output of the compiler.

### Intel 8088

The Intel 8088 processor used in the original IBM Personal Computer is a very traditional CISC processing system with features severely limited by its transistor count. It has very few registers, and the registers generally have rather specific functions. To support a large memory model, it must set its segment register leading up to each memory operation. This limitation means that every memory access takes a minimum of three instructions. Interestingly, a similar pattern occurs on RISC processors.

You notice that at one point, the code moves a value from the ax register to the bx register because it needs to perform another computation that can only be done in the ax register. Note that this is only an integer computation, as the Intel [sic].

mov     word ptr -2[bp],0          # bp is I
$11: mov ax,word ptr -2[bp] # Load I cmp ax,word ptr 18[bp] # Check I>=N bge$10
shl     ax,1                       # Multiply I by 2
mov     bx,ax                      # Done - now move to bx
add     bx,word ptr 10[bp]         # bx = Address of B + Offset
mov     es,word ptr 12[bp]         # Top part of address
mov     ax,es: word ptr [bx]       # Load B(i)
mov     bx,word ptr -2[bp]         # Load I
shl     bx,1                       # Multiply I by 2
add     bx,word ptr 14[bp]         # bx = Address of C + Offset
mov     es,word ptr 16[bp]         # Top part of address
mov     bx,word ptr -2[bp]         # Load I
shl     bx,1                       # Multiply I by 2
add     bx,word ptr 6[bp]          # bx = Address of A + Offset
mov     es,word ptr 8[bp]          # Top part of address
mov     es: word ptr [bx],ax       # Store
$9: inc word ptr -2[bp] # Increment I in memory jmp$11
\$10:

Because there are so few registers, the variable I is kept in memory and loaded several times throughout the loop. The inc instruction at the end of the loop actually updates the value in memory. Interestingly, at the top of the loop, the value is then reloaded from memory.

In this type of architecture, the available registers put such a strain on the flexibility of the compiler, there is often not much optimization that is practical.

### Motorola MC68020

In this section, we examine another classic CISC processor, the Motorola MC68020, which was used to build Macintosh computers and Sun workstations. We happened to run this code on a BBN GP-1000 Butterfly parallel processing system made up of 96 MC68020 processors.

The Motorola architecture is relatively easy to program in assembly language. It has plenty of 32-bit registers, and they are relatively easy to use. It has a CISC instruction set that keeps assembly language programming quite simple. Many instructions can perform multiple operations in a single instruction.

We use this example to show a progression of optimization levels, using a f77 compiler on a floating-point version of the loop. Our first example is with no optimization:

! Note d0 contains the value I
L5:
movl     d0,L13               ! Store I to memory if loop ends
lea      a1@(-4),a0           ! a1 = address of B
fmoves   a0@(0,d0:l:4),fp0    ! Load of B(I)
lea      a3@(-4),a0           ! a3 = address of C
lea      a2@(-4),a0           ! a2 = address of A
fmoves   fp0,a0@(0,d0:l:4)    ! Store of A(I)
subql    #1,d1                ! Decrement "N"
tstl     d1
bnes     L5

The value for I is stored in the d0 register. Each time through the loop, it’s incremented by 1. At the same time, register d1 is initialized to the value for N and decremented each time through the loop. Each time through the loop, I is stored into memory, so the proper value for I ends up in memory when the loop terminates. Registers a1, a2, and a3 are preloaded to be the first address of the arrays B, A, and C respectively. However, since FORTRAN arrays begin at 1, we must subtract 4 from each of these addresses before we can use I as the offset. The lea instructions are effectively subtracting 4 from one address register and storing it in another.

The following instruction performs an address computation that is almost a one-to- one translation of an array reference:

fmoves a0@(0,d0:l:4),fp0 ! Load of B(I)

This instruction retrieves a floating-point value from the memory. The address is computed by first multiplying d0 by 4 (because these are 32-bit floating-point numbers) and adding that value to a0. As a matter of fact, the lea and fmoves instructions could have been combined as follows:

fmoves a1@(-4,d0:l:4),fp0 ! Load of B(I)

To compute its memory address, this instruction multiplies d0 by 4, adds the contents of a1, and then subtracts 4. The resulting address is used to load 4 bytes into floating-point register fp0. This is almost a literal translation of fetching B(I). You can see how the assembly is set up to track high-level constructs.

It is almost as if the compiler were “trying” to show off and make use of the nifty assembly language instructions.

Like the Intel, this is not a load-store architecture. The fadds instruction adds a value from memory to a value in a register (fp0) and leaves the result of the addition in the register. Unlike the Intel 8088, we have enough registers to store quite a few of the values used throughout the loop (I, N, the address of A, B, and C) in registers to save memory operations.

#### C on the MC68020

In the next example, we compiled the C version of the loop with the normal optimization (-O) turned on. We see the C perspective on arrays in this code. C views arrays as extensions to pointers in C; the loop index advances as an offset from a pointer to the beginning of the array:

! d3 = I
! d1 = Address of A
! d2 = Address of B
! d0 = Address of C
! a6@(20) = N
moveq   #0,d3                 ! Initialize I
L1:     movl    d3,a1                 ! Make copy of I
movl    a1,d4                 ! Again
asll    #2,d4                 ! Multiply by 4 (word size)
movl    d4,a1                 ! Put back in an address register
movl    a6@(16),d0            ! Get address of C
fmoves  fp0,a1@(0,d1:l)       ! Store into A(I)
L5:
cmpl    a6@(20),d3
bits    L1

We first see the value of I being copied into several registers and multiplied by 4 (using a left shift of 2, strength reduction). Interestingly, the value in register a1 is I multiplied by 4. Registers d0, d1, and d2 are the addresses of C, B, and A respectively. In the load, add, and store, a1 is the base of the address computation and d0, d1, and d2 are added as an offset to a1 to compute each address.

This is a simplistic optimization that is primarily trying to maximize the values that are kept in registers during loop execution. Overall, it’s a relatively literal translation of the C language semantics from C to assembly. In many ways, C was designed to generate relatively efficient code without requiring a highly sophisticated optimizer.

#### More optimization

In this example, we are back to the FORTRAN version on the MC68020. We have compiled it with the highest level of optimization (-OLM) available on this compiler. Now we see a much more aggressive approach to the loop:

! a0 = Address of C(I)
! a1 = Address of B(I)
! a2 = Address of A(I)
L3:
fmoves  fp0,a2@                    ! Store A(I)
subql   #1,d0                      ! Decrement I
tstl    d0
bnes    L3

First off, the compiler is smart enough to do all of its address adjustment outside the loop and store the adjusted addresses of A, B, and C in registers. We do the load, add, and store in quick succession. Then we advance the array addresses by 4 and perform the subtraction to determine when the loop is complete.

This is very tight code and bears little resemblance to the original FORTRAN code.

### SPARC Architecture

These next examples were performed using a SPARC architecture system using FORTRAN. The SPARC architecture is a classic RISC processor using load-store access to memory, many registers and delayed branching. We first examine the code at the lowest optimization:

.L18:                                 ! Top of the loop
ld      [%fp-4],%l2           ! Address of B
sll     %l0,2,%l1             ! Multiply by 4
ld      [%fp-8],%l2           ! Address of C
sll     %l0,2,%l1             ! Multiply by 4
ld      [%fp-12],%l2          ! Address of A
sll     %l0,2,%l1             ! Multiply by 4
st      %f2,[%l0+0]           ! Store A(I)
st      %l1,[%l0+0]           ! Store I
cmp     %l1,%l0               ! Compare
ble     .L18
nop                           ! Branch Delay Slot

This is some pretty poor code. We don’t need to go through it line by line, but there are a few quick observations we can make. The value for I is loaded from memory five times in the loop. The address of I is computed six times throughout the loop (each time takes two instructions). There are no tricky memory addressing modes, so multiplying I by 4 to get a byte offset is done explicitly three times (at least they use a shift). To add insult to injury, they even put a NO-OP in the branch delay slot.

One might ask, “Why do they ever generate code this bad?” Well, it’s not because the compiler isn’t capable of generating efficient code, as we shall see below. One explanation is that in this optimization level, it simply does a one-to-one translation of the tuples (intermediate code) into machine language. You can almost draw lines in the above example and precisely identify which instructions came from which tuples.

One reason to generate the code using this simplistic approach is to guarantee that the program will produce the correct results. Looking at the above code, it’s pretty easy to argue that it indeed does exactly what the FORTRAN code does. You can track every single assembly statement directly back to part of a FORTRAN statement.

It’s pretty clear that you don’t want to execute this code in a high performance production environment without some more optimization.

#### Moderate optimization

In this example, we enable some optimization (-O1):

save    %sp,-120,%sp           ! Rotate the register window
st      %o0,[%fp-12]           ! Store on the stack
st      %o0,[%fp-4]            ! Store on the stack
st      %o0,[%fp-8]            ! Store on the stack
ld      [%i3],%o0              ! %o0 = N (fourth parameter)
or      %g0,1,%o1              ! %o1 = 1 (for addition)
st      %o0,[%fp-20]           ! store N on the stack
st      %o1,[%o2]              ! Set memory copy of I to 1
ld      [%o2],%o1              ! o1 = I (kind of redundant)
cmp     %o1,%o0                ! Check I > N (zero-trip?)
bg      .L12                   ! Don’t do loop at all
nop                             ! Delay Slot
ld      [%o2],%o0              ! Pre-load for Branch Delay Slot
.L900000110:                          ! Top of the loop
ld      [%fp-4],%o1            ! o1 = Address of B(0)
sll     %o0,2,%o0              ! Multiply I by 4
ld      [%o1+%o0],%f2          ! f2 = B(I)
ld      [%o2],%o0              ! Load I from memory
ld      [%fp-8],%o1            ! o1 = Address of C(0)
sll     %o0,2,%o0              ! Multiply I by 4
ld      [%o1+%o0],%f3          ! f3 = C(I)
ld      [%o2],%o0              ! Load I from memory (not again!)
ld      [%fp-12],%o1           ! o1 = Address of A(0)
sll     %o0,2,%o0              ! Multiply I by 4 (yes, again)
st      %f2,[%o1+%o0]          ! A(I) = f2
ld      [%o2],%o0              ! Load I from memory
add     %o0,1,%o0              ! Increment I in register
st      %o0,[%o2]              ! Store I back into memory
ld      [%o2],%o0              ! Load I back into a register
ld      [%fp-20],%o1           ! Load N into a register
cmp     %o0,%o1                ! I > N ??
ble,a   .L900000110
ld      [%o2],%o0              ! Branch Delay Slot

This is a significant improvement from the previous example. Some loop constant computations (subtracting 4) were hoisted out of the loop. We only loaded I 4 times during a loop iteration. Strangely, the compiler didn’t choose to store the addresses of A(0), B(0), and C(0) in registers at all even though there were plenty of registers. Even more perplexing is the fact that it loaded a value from memory immediately after it had stored it from the exact same register!

But one bright spot is the branch delay slot. For the first iteration, the load was done before the loop started. For the successive iterations, the first load was done in the branch delay slot at the bottom of the loop.

Comparing this code to the moderate optimization code on the MC68020, you can begin to get a sense of why RISC was not an overnight sensation. It turned out that an unsophisticated compiler could generate much tighter code for a CISC processor than a RISC processor. RISC processors are always executing extra instructions here and there to compensate for the lack of slick features in their instruction set. If a processor has a faster clock rate but has to execute more instructions, it does not always have better performance than a slower, more efficient processor.

But as we shall soon see, this CISC advantage is about to evaporate in this particular example.

#### Higher optimization

We now increase the optimization to -O2. Now the compiler generates much better code. It’s important you remember that this is the same compiler being used for all three examples.

At this optimization level, the compiler looked through the code sufficiently well to know it didn’t even need to rotate the register windows (no save instruction). Clearly the compiler looked at the register usage of the entire routine:

! Note, didn’t even rotate the register Window
! We just use the %o registers from the caller

! %o0 = Address of first element of A (from calling convention)
! %o1 = Address of first element of B (from calling convention)
! %o2 = Address of first element of C (from calling convention)
! %o3 = Address of N (from calling convention)

cmp     %g2,1                     ! Check to see if it is <1
bl      .L77000006                ! Check for zero trip loop
or      %g0,1,%g1                 ! Delay slot - Set I to 1
.L77000003:
ld      [%o1],%f0                 ! Load B(I) First time Only
.L900000109:
cmp     %g1,%g2                   ! Check Loop Termination
st      %f0,[%o0]                 ! Store A(I)
ble,a   .L900000109               ! Branch w/ annul
ld      [%o1],%f0                 ! Load the B(I)
.L77000006:
retl                              ! Leaf Return (No window)
nop                               ! Branch Delay Slot

This is tight code. The registers o0, o1, and o2 contain the addresses of the first elements of A, B, and C respectively. They already point to the right value for the first iteration of the loop. The value for I is never stored in memory; it is kept in global register g1. Instead of multiplying I by 4, we simply advance the three addresses by 4 bytes each iteration.

The branch delay slots are utilized for both branches. The branch at the bottom of the loop uses the annul feature to cancel the following load if the branch falls through.

The most interesting observation regarding this code is the striking similarity to the code and the code generated for the MC68020 at its top optimization level:

L3:
fmoves  fp0,a2@                   ! Store A(I)
subql   #1,d0                     ! Decrement I
tstl    d0
bnes    L3

The two code sequences are nearly identical! For the SPARC, it does an extra load because of its load-store architecture. On the SPARC, I is incremented and compared to N, while on the MC68020, I is decremented and compared to zero.

This aptly shows how the advancing compiler optimization capabilities quickly made the “nifty” features of the CISC architectures rather useless. Even on the CISC processor, the post-optimization code used the simple forms of the instructions because they produce they fastest execution time.

Note that these code sequences were generated on an MC68020. An MC68060 should be able to eliminate the three addql instructions by using post-increment, saving three instructions. Add a little loop unrolling, and you have some very tight code. Of course, the MC68060 was never a broadly deployed workstation processor, so we never really got a chance to take it for a test drive.

### Convex C-240

This section shows the results of compiling on the Convex C-Series of parallel/vector supercomputers. In addition to their normal registers, vector computers have vector registers that contain up to 256 64-bit elements. These processors can perform operations on any subset of these registers with a single instruction.

It is hard to claim that these vector supercomputers are more RISC or CISC. They have simple lean instruction sets and, hence, are RISC-like. However, they have instructions that implement loops, and so they are somewhat CISC-like.

The Convex C-240 has scalar registers (s2), vector registers (v2), and address registers (a3). Each vector register has 128 elements. The vector length register controls how many of the elements of each vector register are processed by vector instructions. If vector length is above 128, the entire register is processed.

The code to implement our loop is as follows:

L4:      mov.ws   2,vl                      ; Set the Vector length to N
ld.w     0(a5),v0                  ; Load B into Vector Register
ld.w     0(a2),v1                  ; Load C into Vector Register
st.w     v2,0(a3)                  ; Store results into A
lt.w     #0,s2                     ; Check to see if "N" is < 0
jbrs.t   L4

Initially, the vector length register is set to N. We assume that for the first iteration, N is greater than 128. The next instruction is a vector load instruction into register v0. This loads 128 32-bit elements into this register. The next instruction also loads 128 elements, and the following instruction adds those two registers and places the results into a third vector register. Then the 128 elements in Register v2 are stored back into memory. After those elements have been processed, N is decremented by 128 (after all, we did process 128 elements). Then we add 512 to each of the addresses (4 bytes per element) and loop back up. At some point, during the last iteration, if N is not an exact multiple of 128, the vector length register is less than 128, and the vector instructions only process those remaining elements up to N.

One of the challenges of vector processors is to allow an instruction to begin executing before the previous instruction has completed. For example, once the load into Register v1 has partially completed, the processor could actually begin adding the first few elements of v0 and v1 while waiting for the rest of the elements of v1 to arrive. This approach of starting the next vector instruction before the previous vector instruction has completed is called chaining. Chaining is an important feature to get maximum performance from vector processors.

### IBM RS-6000

The IBM RS-6000 is generally credited as the first RISC processor to have cracked the Linpack 100×100 benchmark. The RS-6000 is characterized by strong floating-point performance and excellent memory bandwidth among RISC workstations. The RS-6000 was the basis for IBM’s scalable parallel processor: the IBM-SP1 and SP2.

When our example program is run on the RS-6000, we can see the use of a CISC- style instruction in the middle of a RISC processor. The RS-6000 supports a branch- on-count instruction that combines the decrement, test, and branch operations into a single instruction. Moreover, there is a special register (the count register) that is part of the instruction fetch unit that stores the current value of the counter. The fetch unit also has its own add unit to perform the decrements for this instruction.

These types of features creeping into RISC architectures are occurring because there is plenty of chip space for them. If a wide range of programs can run faster with this type of instruction, it’s often added.

The assembly code on the RS-6000 is:

ai       r3,r3,-4                # Address of A(0)
ai       r5,r5,-4                # Address of B(0)
ai       r4,r4,-4                # Address of C(0)
bcr      BO_IF_NOT,CR0_GT
mtspr    CTR,r6                  # Store in the Counter Register
__L18:
lfsu     fp0,4(r4)               # Pre Increment Load
lfsu     fp1,4(r5)               # Pre Increment Load
fa       fp0,fp0,fp1
frsp     fp0,fp0
stfsu    fp0,4(r3)               # Pre-increment Store
bc       BO_dCTR_NZERO,CR0_LT,__L18    # Branch on Counter

The RS-6000 also supports a memory addressing mode that can add a value to its address register before using the address register. Interestingly, these two features (branch on count and pre-increment load) eliminate several instructions when compared to the more “pure” SPARC processor. The SPARC processor has 10 instructions in the body of its loop, while the RS-6000 has 6 instructions.

The advantage of the RS-6000 in this particular loop may be less significant if both processors were two-way superscalar. The instructions were eliminated on the RS-6000 were integer instructions. On a two-way superscalar processor, those integer instructions may simply execute on the integer units while the floating-point units are busy performing the floating-point computations.

### Conclusion

In this section, we have attempted to give you some understanding of the variety of assembly language that is produced by compilers at different optimization levels and on different computer architectures. At some point during the tuning of your code, it can be quite instructive to take a look at the generated assembly language to be sure that the compiler is not doing something really stupid that is slowing you down.

Please don’t be tempted to rewrite portions in assembly language. Usually any problems can be solved by cleaning up and streamlining your high-level source code and setting the proper compiler flags.

It is interesting that very few people actually learn assembly language any more. Most folks find that the compiler is the best teacher of assembly language. By adding the appropriate option (often -S), the compiler starts giving you lessons. I suggest that you don’t print out all of the code. There are many pages of useless variable declarations, etc. For these examples, I cut out all of that useless information. It is best to view the assembly in an editor and only print out the portion that pertains to the particular loop you are tuning.