Optimisation processes

The improvements in execution speed that are obtained depend upon the style and content of the source program, for example, whether one- or multi-dimensioned arrays are used, whether nested loops appear, and so on.

As optimisation can involve source code re arrangement and a change in the way that registers and store locations are used, it is possible that numerical results produced by an optimised program may differ in some way from those produced by the unoptimised version of the same program. This effect may be more noticeable with iterative algorithms and is due to the fact that a more accurate value can be held in a coprocessor floating point register than can be held in the corresponding store location.

Some programs may actually execute more slowly when optimised due to non-executed loops that cannot be detected by the compiler, for example:

DO 10 I=1,N

where N is zero or negative at run time. In this case code that is moved out of the loop will be executed once, rather than not at all as would happen if this optimisation had not been made.

When the compiler option /OPTIMISE is used, the compiler performs code optimisation based on rearranging the order of execution of statements which constitute a program unit (see below). If /OPTIMISE is not used, the following optimisations are typical of those performed by default.

Constant 'folding' and conversion of Fortran type at compile time. Constant folding is the process of taking a statement such as:

A = I + 3 + 7

and producing code which is the same as for the statement:

A = I + 10

This might not appear to be of much use at first glance, since you might not think that you would write expressions with multiple constants in this way. However, consider the expression 2*PI*Rwhere PI is a parameter - the 2*PI part would be evaluated at compile time. In addition to this however, a number of situations arise for the implicit arithmetic which the compiler plants code for (chiefly array subscript calculation) where this technique results in considerable reduction in the amount of arithmetic done at run time.

Related to this is the conversion of the type of constants where appropriate. For example, the statement:

X = 4

is compiled as:

X = 4.0

thus the need for a type conversion at run time is obviated.

Elimination of common subexpressions within a statement. Again, this applies equally to expressions which form subscript calculations. Consider the following assignment:

A(I,J+K) = A(I,J+K) + 3

The code necessary to calculate the offset represented by (I,J+K) is only performed once.

The contents of registers are "remembered" between statements so that redundant load and store operations are avoided. For example, consider the following sequence of statements:

K = I + J
L = K * I

For the second statement, the compiler recognises that it has the value of K in a register, so it does not need to load K from store.

Note, however, that it will probably need to reference the value of I from memory, since the calculation of I + J will have resulted in the loss of the value of I from a register.

Even if there were some statements interspersed between the statements above, this optimisation could still take place, so long as:

the register in question was not used for another purpose in the interim, and
none of the interim statements were GOTOs, and
none of the executable statements were labelled (a good reason to dispense with unused labels in your code).

The compiler tries to avoid using registers which might contain something useful in a subsequent calculation.

A related technique is used for the coprocessor floating point registers, although due to the limited size of the hardware register stack, it is not possible to leave a value in a register just in case it might be useful. Instead, if a recently calculated floating point value proves to be useful for a subsequent calculation, the instruction which places the result in the corresponding memory location is converted from "store and pop" to "store and don't pop". The value is then available somewhere in the register stack for the subsequent calculation.

Note that this floating point register tracking is not performed when the /DEBUG compiler option is used or implied.

Full use is made of the instruction set. For example, an integer addition or subtraction of 1 is performed by the appropriate increment or decrement instruction. Also, some optimisations can be used to perform certain arithmetic operations a little quicker. For example, evaluation of I*5, where I is of integer type, can be performed with the instruction sequence:

MOV EAX%,I
LEA EAX%,[EAX%+EAX%*4]

which is faster than the corresponding integer multiply. Note however, that this optimisation is not done in CHECK mode, since any overflow would go undetected.

When the /OPTIMISE option is used, optimisations performed include the following:

Loop invariant motion. This means that any calculations which are constant with respect to the loop variable may be moved, so that they are performed once and for all on entry to the loop. This leads to the actual degradation in performance mentioned earlier, for the case where the loop is not executed at all. However, in most cases, particularly when the loop is executed a large number of times, considerable savings can result.
Loop induction weakening. This means that, instead of using multiples of the loop index, a constant is added to a pseudo variable each time round the loop. For example, consider the following loop:

DO I = 1, N
A(1,I) = 0
END DO

The offset into the array A will be a constant multiple of the loop variable I. The constant is related to the size of the first dimension of the array A. Induction weakening will replace multiplication by this constant to produce the array offset at each iteration of the loop by a faster addition of the constant at each iteration.

Elimination of common subexpressions across statements. This is often a consequence of the optimisations in (1) and (2) : expressions which are taken out of the loop as either loop invariant, or as candidates for induction weakening, can themselves be sub-parts of larger expressions.
In some loops, particularly useful quantities can be "locked" into registers. "Locking" means that, for the duration of a loop, the value of a program variable, or perhaps a derived quantity such as an offset into an array, is kept in a register, and is not stored into its associated store location (if indeed it has one) until exit from the loop.

Obviously, this requires that exit from the loop cannot be by means of a GOTO from within itself, and that no subroutine or function is called from within the loop, as these statements could destroy any value held in the register.

Also, there is some trade-off involved in tying up a register in this way, so generally locking will only occur for relatively short loops.

Optimisation of the loop in the example given in 2) above involves induction weakening and locking the array offset in a register.

Some additional optimisations based on the 80486 and Pentium instruction set. In some cases integer instructions are used instead of floating point instructions. This often results in different behaviour where the operands are invalid (for example where they should cause an overflow), but it is assumed that, if optimisation is being employed, problems such as this have been eliminated.
Many cases of a "dot-product" construction are spotted and replaced with faster code, for example:

DO I = 1, N
SUM = SUM + A(I)*B(I)
END DO

Many cases of redundant combinations of instructions are eliminated, for example, jumps to the next line, loads from a register to itself which sometimes are generated as a result of register locking (see 4 above).

The above list is not exhaustive, and new optimisations will be added during the course of compiler development.

Optimisation processes

Fortran Links

News