# Difference between revisions of "Case Compiler Optimization"

CuriousKit (talk | contribs) m (Some minor simplification) |
CuriousKit (talk | contribs) m (Some more mathematical explanation) |
||

Line 180: | Line 180: | ||

====Buffer Overrun Trap==== | ====Buffer Overrun Trap==== | ||

− | One drawback to this highly optimised binary search is that it doesn't trap the situation where the input key is greater than the largest key in the table, and if this happens, the loop may exit with %lo + 1 = %hi = list length, then "lea 1(%lo, %hi), %mid" sets %mid to a value which is out of bounds. To get around this, we recall that in normal operation, %lo and %hi have a difference of 2 if the flow reaches the end of the loop, and are equal if a match was found early, therefore their sum is a multiple 2, and so the "+1" part of the LEA instruction has no effect after dividing the sum by 2, so the simple fix is to replace the LEA instruction with "lea (%lo, %hi), %mid". In other words, for i386: | + | One drawback to this highly optimised binary search is that it doesn't trap the situation where the input key is greater than the largest key in the table, and if this happens, the loop may exit with %lo + 1 = %hi = list length, then "lea 1(%lo, %hi), %mid" sets %mid to a value which is out of bounds. To get around this, we recall that in normal operation, %lo and %hi have a difference of 2 if the flow reaches the end of the loop, and are equal if a match was found early, therefore being both odd or both even, their sum is a multiple 2, and so the "+1" part of the LEA instruction has no effect after dividing the sum by 2, so the simple fix is to replace the LEA instruction with "lea (%lo, %hi), %mid", which stops %mid from ever exceeding the upper bound without affecting normal behaviour. In other words, for i386: |

<syntaxhighlight lang="asm" highlight="1"> | <syntaxhighlight lang="asm" highlight="1"> | ||

lea (%lo,%hi), %mid | lea (%lo,%hi), %mid |

## Revision as of 20:33, 5 August 2018

An editor has declared this article to be a stub, meaning that it needs more information. Can you help out and add some? If you have some useful information, you can help the Free Pascal Wiki by clicking on the edit box on the left and expanding this page.

## Contents

## x86

### Binary Search Lookup

Main article: Binary search algorithm

#### Requirements

- 32 or more items.
- All branches call an identical function. Branches that have multiple case values or ranges are treated as individual branches for each single value.
- The range of case labels is large, which would make a direct index or hash lookup impractical.

##### Most suitable case

A case block that evaluates an ordinal value with many identical branches that share a common header design. Example below taken from **compiler\x86_64\aoptcpu.pas**, lines 68-119:

```
{ All subroutines share the form "function(var p: tai): Boolean;" - note that if
one of them is, say "function(const p: tai): Boolean;", this system won't work }
case taicpu(p).opcode of
A_AND:
Result:=OptPass1AND(p);
A_MOV:
Result:=OptPass1MOV(p);
A_MOVSX,
A_MOVZX:
Result:=OptPass1Movx(p);
A_VMOVAPS,
A_VMOVAPD,
A_VMOVUPS,
A_VMOVUPD:
result:=OptPass1VMOVAP(p);
A_MOVAPD,
A_MOVAPS,
A_MOVUPD,
A_MOVUPS:
result:=OptPass1MOVAP(p);
A_VDIVSD,
A_VDIVSS,
A_VSUBSD,
A_VSUBSS,
A_VMULSD,
A_VMULSS,
A_VADDSD,
A_VADDSS,
A_VANDPD,
A_VANDPS,
A_VORPD,
A_VORPS,
A_VXORPD,
A_VXORPS:
result:=OptPass1VOP(p);
A_MULSD,
A_MULSS,
A_ADDSD,
A_ADDSS:
result:=OptPass1OP(p);
A_VMOVSD,
A_VMOVSS,
A_MOVSD,
A_MOVSS:
result:=OptPass1MOVXX(p);
A_LEA:
result:=OptPass1LEA(p);
A_SUB:
result:=OptPass1Sub(p);
A_SHL,A_SAL:
result:=OptPass1SHLSAL(p);
A_SETcc:
result:=OptPass1SETcc(p);
end;
```

#### Table Design

Internally, each entry of the lookup table contains two ordinal fields equal to the machine word size, totalling 8 bytes for 32-bit and 16 bytes for 64-bit.

```
TCaseTableEntry = packed record
Key: NativeInt;
ProcAddr: CodePointer;
end;
TCaseTable = array[0..ListSize-1] of TCaseTableEntry;
```

The most important element with this table, as with all binary searches, is that it must be sorted by key in ascending order. Given such case tables are static during runtime, this can be done during compilation using any general-purpose sorting algorithm.

#### Iteration Count

A binary search divides the eligible list in half with each iteration, meaning its complexity grows with the base-2 logarithm of the list size -- that is, the algorithm is O(log n). While this gives us a rough guide as to the number of times the critical loop will run, it is not an exact answer, and **log2**(*n*) is only integral when *n* is a power of 2. However, if we analyse the workings of the binary search, we can deduce the iteration count:

If we take a list size of 8 and evaluate the midpoint (either one of the two middle values will suffice) and determine it is not a match, we can exclude this value and hence are left with one of two sublists to evaluate, one of size 3 and one of size 4. Given the search may take us to the larger of the two, we have to assume worst-case behaviour. Similarly, at size 4, the sublists are of size 1 and 2, and then at size 2, we are left with either a sublist of 1, a match, or a mismatch, since there is no second sublist. At size 1, we just check if this is the value we are looking for, and if not, then we can conclude that the value does not appear in the list.

This evaluates to 3 iterations and a final comparison with the 1-sized sublist. This agrees with the value of **log2**(8) = 3. This is the simplest case though where the list size is a power of two. If we go one higher to a list size of 9...

Because we can exclude the midpoint, both sublists have a size of 4, and following the worst-case process listed above, this also results in 3 iterations. If we instead reduce the size to 7...

Both sublists are now only size 3, and taking the midpoint of one of these sublists yields a pair of 1-sized sublists, thereby reducing the iteration count to just 2 plus the final comparison.

From this, we can deduce that the largest sublist size from a list of size n is equal to **floor**(*n*/2). Taken to the logical extreme where the sublist size is an odd number in each division stage, we find that only 3 iterations are required up to an initial list size of 15, since this divides down to 7, then 3, and then we are left with one of two 1-sized sublists that just need a comparison. Such number sequences are all 1s in binary, and this corresponds with values of the form 2^{a} - 1, one below the argument that causes **log2** to reach its next integral value. Therefore, we can conclude that the total iteration count is a remarkably simple equation:

**floor**(**log2**(*n*))

This is relatively fast to calculate, as it's equivalent to the position of the most significant 1-bit in the binary value of *n*.

#### Loop Code

Start with %lo = -1, %hi = number of case labels, %key = value to be searched (assumed to be a signed integer), %table being the address of the map that contains the pointers (which may have to be set beforehand due to the requirement of relocatable code on x86-64), and %mid and %temp undefined. The reason for "lea 2(%lo), %temp" is explained later.

```
@LoopBegin:
lea 1(%lo,%hi,1), %mid
shr %mid
cmp (%table,%mid,8), %key
cmovge %mid, %lo
cmovle %mid, %hi
lea 2(%lo), %temp
cmp %temp, %hi
jg @LoopBegin
```

With loop unrolling, this can be simplified further:

```
lea 1(%lo,%hi,1), %mid
shr %mid
cmp (%table,%mid,8), %key
je @CaseMatch
cmovge %mid, %lo
cmovle %mid, %hi
```

... with “@CaseMatch” appearing later. This block of code is repeated for each iteration. If there is concern that the conditional jump may incur a performance penalty, it can be removed without incident, especially during the last couple of iterations (an exact match will set %lo and %hi to the same value).

For 64-bit systems, the comparison is more complicated because each entry in the table is 16 bytes, and there is no facility to multiply by 16 in reference operands. Nevertheless, with some rearrangement of the commands to take advantage of a CPU core's multiple ports and the availability of additional registers, this can be resolved relatively painlessly:

```
lea 1(%lo,%hi,1), %index
lea 1(%lo,%hi,1), %mid
and $-2, %index ; Clear lsb
shr %mid
cmp (%table,%index,8), %key
je @CaseMatch
cmovge %mid, %lo
cmovle %mid, %hi
```

During the search, the value of %lo is one below the lowest valid index, while %hi is one above the highest valid index. If at any point that %lo + 1 ≥ %hi, the entire list has been searched. Due to how %mid is calculated, if execution passes through the unrolled loop without jumping out, %lo and %hi will have a difference of 2 and sit either side of the index of the last remaining entry in the list that hasn't been checked, hence one final lookup in the table against %key will confirm a match, with a jump to the case's "else" block (or the end of the block if one isn't present) if it's not.

For i386:

```
lea 1(%lo,%hi), %mid
shr %mid
cmp (%table,%mid,8), %key
jne @ElseBlock
@CaseMatch:
mov 4(%table,%mid,8), %procaddr
; Do common parameter set-up
call %procaddr
jmp @CaseBlockEnd
@ElseBlock:
; Else block code
...
@CaseBlockEnd:
...
```

For x86-64 (note the use of %index in the @CaseMatch setion):

```
lea 1(%lo,%hi), %index
and $-2, %index
cmp (%table,%index,8), %key
jne @ElseBlock
@CaseMatch:
mov 8(%table,%index,8), %procaddr
; Do common parameter set-up
call %procaddr
jmp @CaseBlockEnd
@ElseBlock:
; Else block code
...
@CaseBlockEnd:
...
```

#### Buffer Overrun Trap

One drawback to this highly optimised binary search is that it doesn't trap the situation where the input key is greater than the largest key in the table, and if this happens, the loop may exit with %lo + 1 = %hi = list length, then "lea 1(%lo, %hi), %mid" sets %mid to a value which is out of bounds. To get around this, we recall that in normal operation, %lo and %hi have a difference of 2 if the flow reaches the end of the loop, and are equal if a match was found early, therefore being both odd or both even, their sum is a multiple 2, and so the "+1" part of the LEA instruction has no effect after dividing the sum by 2, so the simple fix is to replace the LEA instruction with "lea (%lo, %hi), %mid", which stops %mid from ever exceeding the upper bound without affecting normal behaviour. In other words, for i386:

```
lea (%lo,%hi), %mid
shr %mid
cmp (%table,%mid,8), %key
jne @ElseBlock
@CaseMatch:
mov 4(%table,%mid,8), %procaddr
; Do common parameter set-up
call %procaddr
jmp @CaseBlockEnd
@ElseBlock:
; Else block code
...
@CaseBlockEnd:
...
```

For x86-64:

```
lea (%lo,%hi), %index
and $-2, %index
cmp (%table,%index,8), %key
jne @ElseBlock
@CaseMatch:
mov 8(%table,%index,8), %procaddr
; Do common parameter set-up
call %procaddr
jmp @CaseBlockEnd
@ElseBlock:
; Else block code
...
@CaseBlockEnd:
...
```

#### Putting it all together

Now that the registers and exceptional cases are explained, we can put the entire thing together.

For i386:

```
@LoopBegin:
lea 1(%lo,%hi,1), %mid
shr %mid
cmp (%table,%mid,8), %key
cmovge %mid, %lo
cmovle %mid, %hi
lea 2(%lo), %temp
cmp %temp, %hi
jg @LoopBegin
; Unroll the above if desired
lea (%lo,%hi), %mid
shr %mid
cmp (%table,%mid,8), %key
jne @ElseBlock
@CaseMatch:
mov 4(%table,%mid,8), %procaddr
; Do common parameter set-up
call %procaddr
jmp @CaseBlockEnd
@ElseBlock:
; Else block code
...
@CaseBlockEnd:
...
```

For x86-64:

```
@LoopBegin:
lea 1(%lo,%hi,1), %index
lea 1(%lo,%hi,1), %mid
and $-2, %index ; Clear lsb
shr %mid
cmp (%table,%index,8), %key
je @CaseMatch
cmovge %mid, %lo
cmovle %mid, %hi
lea 2(%lo), %temp
cmp %temp, %hi
jg @LoopBegin
; Unroll the above if desired
lea (%lo,%hi), %index
and $-2, %index
cmp (%table,%index,8), %key
jne @ElseBlock
@CaseMatch:
mov 8(%table,%index,8), %procaddr
; Do common parameter set-up
call %procaddr
jmp @CaseBlockEnd
@ElseBlock:
; Else block code
...
@CaseBlockEnd:
...
```