243 lines
8.3 KiB
Plaintext
243 lines
8.3 KiB
Plaintext
|
Copyright 2000, 2001 Free Software Foundation, Inc.
|
||
|
|
||
|
This file is part of the GNU MP Library.
|
||
|
|
||
|
The GNU MP Library is free software; you can redistribute it and/or modify
|
||
|
it under the terms of the GNU Lesser General Public License as published by
|
||
|
the Free Software Foundation; either version 2.1 of the License, or (at your
|
||
|
option) any later version.
|
||
|
|
||
|
The GNU MP Library is distributed in the hope that it will be useful, but
|
||
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
||
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
|
||
|
License for more details.
|
||
|
|
||
|
You should have received a copy of the GNU Lesser General Public License
|
||
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
|
||
|
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
|
||
|
02110-1301, USA.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
AMD K6 MPN SUBROUTINES
|
||
|
|
||
|
|
||
|
|
||
|
This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
|
||
|
K6-3.
|
||
|
|
||
|
The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
|
||
|
has MMX code suiting K6-2 and K6-3. All chips in the K6 family have MMX,
|
||
|
the separate directories are just so that ./configure can omit them if the
|
||
|
assembler doesn't support MMX.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
STATUS
|
||
|
|
||
|
Times for the loops, with all code and data in L1 cache, are as follows.
|
||
|
|
||
|
cycles/limb
|
||
|
|
||
|
mpn_add_n/sub_n 3.25 normal, 2.75 in-place
|
||
|
|
||
|
mpn_mul_1 6.25
|
||
|
mpn_add/submul_1 7.65-8.4 (varying with data values)
|
||
|
|
||
|
mpn_mul_basecase 9.25 cycles/crossproduct (approx)
|
||
|
mpn_sqr_basecase 4.7 cycles/crossproduct (approx)
|
||
|
or 9.2 cycles/triangleproduct (approx)
|
||
|
|
||
|
mpn_l/rshift 3.0
|
||
|
|
||
|
mpn_divrem_1 20.0
|
||
|
mpn_mod_1 20.0
|
||
|
mpn_divexact_by3 11.0
|
||
|
|
||
|
mpn_copyi 1.0
|
||
|
mpn_copyd 1.0
|
||
|
|
||
|
|
||
|
K6-2 and K6-3 have dual-issue MMX and get the following improvements.
|
||
|
|
||
|
mpn_l/rshift 1.75
|
||
|
|
||
|
|
||
|
Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch"
|
||
|
instruction, code seems to run slower, and with just "mov" loads it doesn't
|
||
|
seem faster. Results so far are inconsistent. The K6 does a hardware
|
||
|
prefetch of the second cache line in a sector, so the penalty for not
|
||
|
prefetching in software is reduced.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
NOTES
|
||
|
|
||
|
All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
|
||
|
|
||
|
Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
|
||
|
execute them in both X and Y (and in both together).
|
||
|
|
||
|
Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
|
||
|
chapter 6 table 12).
|
||
|
|
||
|
Write-allocate L1 data cache means prefetching of destinations is unnecessary.
|
||
|
Store queue is 7 entries of 64 bits each.
|
||
|
|
||
|
Floating point multiplications can be done in parallel with integer
|
||
|
multiplications, but there doesn't seem to be any way to make use of this.
|
||
|
|
||
|
|
||
|
|
||
|
OPTIMIZATIONS
|
||
|
|
||
|
Unrolled loops are used to reduce looping overhead. The unrolling is
|
||
|
configurable up to 32 limbs/loop for most routines, up to 64 for some.
|
||
|
|
||
|
Sometimes computed jumps into the unrolling are used to handle sizes not a
|
||
|
multiple of the unrolling. An attractive feature of this is that times
|
||
|
smoothly increase with operand size, but an indirect jump is about 6 cycles
|
||
|
and the setups about another 6, so it depends on how much the unrolled code
|
||
|
is faster than a simple loop as to whether a computed jump ought to be used.
|
||
|
|
||
|
Position independent code is implemented using a call to get eip for
|
||
|
computed jumps and a ret is always done, rather than an addl $4,%esp or a
|
||
|
popl, so the CPU return address branch prediction stack stays synchronised
|
||
|
with the actual stack in memory. Such a call however still costs 4 to 7
|
||
|
cycles.
|
||
|
|
||
|
Branch prediction, in absence of any history, will guess forward jumps are
|
||
|
not taken and backward jumps are taken. Where possible it's arranged that
|
||
|
the less likely or less important case is under a taken forward jump.
|
||
|
|
||
|
|
||
|
|
||
|
MMX
|
||
|
|
||
|
Putting emms or femms as late as possible in a routine seems to be fastest.
|
||
|
Perhaps an emms or femms stalls until all outstanding MMX instructions have
|
||
|
completed, so putting it later gives them a chance to complete on their own,
|
||
|
in parallel with other operations (like register popping).
|
||
|
|
||
|
The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
|
||
|
at the start of a routine, in case it's been preceded by x87 floating point
|
||
|
operations. This isn't done because in gmp programs it's expected that x87
|
||
|
floating point won't be much used and that chances are an mpn routine won't
|
||
|
have been preceded by any x87 code.
|
||
|
|
||
|
|
||
|
|
||
|
CODING
|
||
|
|
||
|
Instructions in general code are shown paired if they can decode and execute
|
||
|
together, meaning two short decode instructions with the second not
|
||
|
depending on the first, only the first using the shifter, no more than one
|
||
|
load, and no more than one store.
|
||
|
|
||
|
K6 does some out of order execution so the pairings aren't essential, they
|
||
|
just show what slots might be available. When decoding is the limiting
|
||
|
factor things can be scheduled that might not execute until later.
|
||
|
|
||
|
|
||
|
|
||
|
NOTES
|
||
|
|
||
|
Code alignment
|
||
|
|
||
|
- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
|
||
|
short decode is inhibited. The cross.pl script detects this.
|
||
|
|
||
|
- loops and branch targets should be aligned to 16 bytes, or ensure at least
|
||
|
2 instructions before a 32 byte boundary. This makes use of the 16 byte
|
||
|
cache in the BTB.
|
||
|
|
||
|
Addressing modes
|
||
|
|
||
|
- (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this
|
||
|
problem, and can be used as an equivalent, or easier is just to use a
|
||
|
different register, like %ebx.
|
||
|
|
||
|
- K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3
|
||
|
have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
|
||
|
|
||
|
If more than 3 bytes are needed to determine instruction length then
|
||
|
decoding degrades from direct to long, or from long to vector. This
|
||
|
happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
|
||
|
with mod=00 the sib determines whether there's a displacement.
|
||
|
|
||
|
This affects all MMX and 3DNow instructions, and others with an 0F prefix,
|
||
|
like movzbl. The modes affected are anything with an index and no
|
||
|
displacement, or an index but no base, and this includes (%esp) which is
|
||
|
really (,%esp,1).
|
||
|
|
||
|
The cross.pl script detects problem cases. The workaround is to always
|
||
|
use a displacement, and to do this with Zdisp if it's zero so the
|
||
|
assembler doesn't discard it.
|
||
|
|
||
|
See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
|
||
|
13-14 and 36-37.
|
||
|
|
||
|
Calls
|
||
|
|
||
|
- indirect jumps and calls are not branch predicted, they measure about 6
|
||
|
cycles.
|
||
|
|
||
|
Various
|
||
|
|
||
|
- adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe
|
||
|
- bsf 12-27 cycles
|
||
|
- emms 5 cycles
|
||
|
- femms 3 cycles
|
||
|
- jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken)
|
||
|
- divl 20 cycles back-to-back
|
||
|
- imull 2 decode, 3 execute
|
||
|
- mull 2 decode, 3 execute (optimization manual decoding sample)
|
||
|
- prefetch 2 cycles
|
||
|
- rcll/rcrl implicit by one bit: 2 cycles
|
||
|
immediate or %cl count: 11 + 2 per bit for dword
|
||
|
13 + 4 per bit for byte
|
||
|
- setCC 2 cycles
|
||
|
- xchgl %eax,reg 1.5 cycles, back-to-back (strange)
|
||
|
reg,reg 2 cycles, back-to-back
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
REFERENCES
|
||
|
|
||
|
"AMD-K6 Processor Code Optimization Application Note", AMD publication
|
||
|
number 21924, revision D amendment 0, January 2000. This describes K6-2 and
|
||
|
K6-3. Available on-line,
|
||
|
|
||
|
http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf
|
||
|
|
||
|
"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
|
||
|
publication number 21828, revision A amendment 0, August 1997. This is an
|
||
|
older edition of the above document, describing plain K6. Available
|
||
|
on-line,
|
||
|
|
||
|
http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf
|
||
|
|
||
|
"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
|
||
|
This describes the femms and prefetch instructions, but nothing else from
|
||
|
3DNow has been used. Available on-line,
|
||
|
|
||
|
http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
|
||
|
|
||
|
"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
|
||
|
August 1999. This has some notes on general K6 optimizations as well as
|
||
|
3DNow. Available on-line,
|
||
|
|
||
|
http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
|
||
|
|
||
|
|
||
|
|
||
|
----------------
|
||
|
Local variables:
|
||
|
mode: text
|
||
|
fill-column: 76
|
||
|
End:
|