166 lines
5.6 KiB
Plaintext
166 lines
5.6 KiB
Plaintext
|
Copyright 2000, 2001 Free Software Foundation, Inc.
|
||
|
|
||
|
This file is part of the GNU MP Library.
|
||
|
|
||
|
The GNU MP Library is free software; you can redistribute it and/or modify
|
||
|
it under the terms of the GNU Lesser General Public License as published by
|
||
|
the Free Software Foundation; either version 2.1 of the License, or (at your
|
||
|
option) any later version.
|
||
|
|
||
|
The GNU MP Library is distributed in the hope that it will be useful, but
|
||
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
||
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
|
||
|
License for more details.
|
||
|
|
||
|
You should have received a copy of the GNU Lesser General Public License
|
||
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
|
||
|
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
|
||
|
02110-1301, USA.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
AMD K7 MPN SUBROUTINES
|
||
|
|
||
|
|
||
|
This directory contains code optimized for the AMD Athlon CPU.
|
||
|
|
||
|
The mmx subdirectory has routines using MMX instructions. All Athlons have
|
||
|
MMX, the separate directory is just so that configure can omit it if the
|
||
|
assembler doesn't support MMX.
|
||
|
|
||
|
|
||
|
|
||
|
STATUS
|
||
|
|
||
|
Times for the loops, with all code and data in L1 cache.
|
||
|
|
||
|
cycles/limb
|
||
|
mpn_add/sub_n 1.6
|
||
|
|
||
|
mpn_copyi 0.75 or 1.0 \ varying with data alignment
|
||
|
mpn_copyd 0.75 or 1.0 /
|
||
|
|
||
|
mpn_divrem_1 17.0 integer part, 15.0 fractional part
|
||
|
mpn_mod_1 17.0
|
||
|
mpn_divexact_by3 8.0
|
||
|
|
||
|
mpn_l/rshift 1.2
|
||
|
|
||
|
mpn_mul_1 3.4
|
||
|
mpn_addmul/submul_1 3.9
|
||
|
|
||
|
mpn_mul_basecase 4.42 cycles/crossproduct (approx)
|
||
|
mpn_sqr_basecase 2.3 cycles/crossproduct (approx)
|
||
|
or 4.55 cycles/triangleproduct (approx)
|
||
|
|
||
|
Prefetching of sources hasn't yet been tried.
|
||
|
|
||
|
|
||
|
|
||
|
NOTES
|
||
|
|
||
|
cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
|
||
|
|
||
|
Write-allocate L1 data cache means prefetching of destinations is unnecessary.
|
||
|
|
||
|
Floating point multiplications can be done in parallel with integer
|
||
|
multiplications, but there doesn't seem to be any way to make use of this.
|
||
|
|
||
|
Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on
|
||
|
the speed of the multiplication routines. The documentation shows mul
|
||
|
executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
|
||
|
to get near 3 cycles code has to be arranged so that nothing else is issued
|
||
|
to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other
|
||
|
apparently equivalent code takes 5.
|
||
|
|
||
|
|
||
|
|
||
|
OPTIMIZATIONS
|
||
|
|
||
|
Unrolled loops are used to reduce looping overhead. The unrolling is
|
||
|
configurable up to 32 limbs/loop for most routines and up to 64 for some.
|
||
|
The K7 has 64k L1 code cache so quite big unrolling is allowable.
|
||
|
|
||
|
Computed jumps into the unrolling are used to handle sizes not a multiple of
|
||
|
the unrolling. An attractive feature of this is that times increase
|
||
|
smoothly with operand size, but it may be that some routines should just
|
||
|
have simple loops to finish up, especially when PIC adds between 2 and 16
|
||
|
cycles to get %eip.
|
||
|
|
||
|
Position independent code is implemented using a call to get %eip for the
|
||
|
computed jumps and a ret is always done, rather than an addl $4,%esp or a
|
||
|
popl, so the CPU return address branch prediction stack stays synchronised
|
||
|
with the actual stack in memory.
|
||
|
|
||
|
Branch prediction, in absence of any history, will guess forward jumps are
|
||
|
not taken and backward jumps are taken. Where possible it's arranged that
|
||
|
the less likely or less important case is under a taken forward jump.
|
||
|
|
||
|
|
||
|
|
||
|
CODING
|
||
|
|
||
|
Instructions in general code have been shown grouped if they can execute
|
||
|
together, which means up to three direct-path instructions which have no
|
||
|
successive dependencies. K7 always decodes three and has out-of-order
|
||
|
execution, but the groupings show what slots might be available and what
|
||
|
dependency chains exist.
|
||
|
|
||
|
When there's vector-path instructions an effort is made to get triplets of
|
||
|
direct-path instructions in between them, even if there's dependencies,
|
||
|
since this maximizes decoding throughput and might save a cycle or two if
|
||
|
decoding is the limiting factor.
|
||
|
|
||
|
|
||
|
|
||
|
INSTRUCTIONS
|
||
|
|
||
|
adcl direct
|
||
|
divl 39 cycles back-to-back
|
||
|
lodsl,etc vector
|
||
|
loop 1 cycle vector (decl/jnz opens up one decode slot)
|
||
|
movd reg vector
|
||
|
movd mem direct
|
||
|
mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
|
||
|
popl vector (use movl for more than one pop)
|
||
|
pushl direct, will pair with a load
|
||
|
shrdl %cl vector, 3 cycles, seems to be 3 decode too
|
||
|
xorl r,r false read dependency recognised
|
||
|
|
||
|
|
||
|
|
||
|
REFERENCES
|
||
|
|
||
|
"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
|
||
|
22007, revision K, February 2002. Available on-line,
|
||
|
|
||
|
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
|
||
|
|
||
|
"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
|
||
|
This describes the femms and prefetch instructions. Available on-line,
|
||
|
|
||
|
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
|
||
|
|
||
|
"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
|
||
|
publication number 22466, revision D, March 2000. This describes
|
||
|
instructions added in the Athlon processor, such as pswapd and the extra
|
||
|
prefetch forms. Available on-line,
|
||
|
|
||
|
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf
|
||
|
|
||
|
"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
|
||
|
August 1999. This has some notes on general Athlon optimizations as well as
|
||
|
3DNow. Available on-line,
|
||
|
|
||
|
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
----------------
|
||
|
Local variables:
|
||
|
mode: text
|
||
|
fill-column: 76
|
||
|
End:
|