a197a2d3eb
Removed directories for no longer supported architectures.
117 lines
3.2 KiB
Plaintext
117 lines
3.2 KiB
Plaintext
Copyright 2000, 2001 Free Software Foundation, Inc.
|
|
|
|
This file is part of the GNU MP Library.
|
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify
|
|
it under the terms of the GNU Lesser General Public License as published by
|
|
the Free Software Foundation; either version 2.1 of the License, or (at your
|
|
option) any later version.
|
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but
|
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
|
|
License for more details.
|
|
|
|
You should have received a copy of the GNU Lesser General Public License
|
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
|
|
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
|
|
02110-1301, USA.
|
|
|
|
|
|
|
|
|
|
|
|
INTEL P6 MPN SUBROUTINES
|
|
|
|
|
|
|
|
This directory contains code optimized for Intel P6 class CPUs, meaning
|
|
PentiumPro, Pentium II and Pentium III. The mmx and p3mmx subdirectories
|
|
have routines using MMX instructions.
|
|
|
|
|
|
|
|
STATUS
|
|
|
|
Times for the loops, with all code and data in L1 cache, are as follows.
|
|
Some of these might be able to be improved.
|
|
|
|
cycles/limb
|
|
|
|
mpn_add_n/sub_n 3.7
|
|
|
|
mpn_copyi 0.75
|
|
mpn_copyd 1.75 (or 0.75 if no overlap)
|
|
|
|
mpn_divrem_1 39.0
|
|
mpn_mod_1 21.5
|
|
mpn_divexact_by3 8.5
|
|
|
|
mpn_mul_1 5.5
|
|
mpn_addmul/submul_1 6.35
|
|
|
|
mpn_l/rshift 2.5
|
|
|
|
mpn_mul_basecase 8.2 cycles/crossproduct (approx)
|
|
mpn_sqr_basecase 4.0 cycles/crossproduct (approx)
|
|
or 7.75 cycles/triangleproduct (approx)
|
|
|
|
Pentium II and III have MMX and get the following improvements.
|
|
|
|
mpn_divrem_1 25.0 integer part, 17.5 fractional part
|
|
|
|
mpn_l/rshift 1.75
|
|
|
|
|
|
|
|
|
|
NOTES
|
|
|
|
Write-allocate L1 data cache means prefetching of destinations is unnecessary.
|
|
|
|
Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
|
|
to 26 cycles depending how far speculative execution has gone. The 9 cycle
|
|
minimum penalty comes from the issue pipeline being 9 stages.
|
|
|
|
A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
|
|
5, 6 or 7 limb operations are all the same. The 0.75 cycles/limb would be 3
|
|
cycles per 16 byte block.
|
|
|
|
|
|
|
|
|
|
CODING
|
|
|
|
Instructions in general code have been shown grouped if they can execute
|
|
together, which means up to three instructions with no successive
|
|
dependencies, and with only the first being a multiple micro-op.
|
|
|
|
P6 has out-of-order execution, so the groupings are really only showing
|
|
dependent paths where some shuffling might allow some latencies to be
|
|
hidden.
|
|
|
|
|
|
|
|
|
|
REFERENCES
|
|
|
|
"Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
|
|
02/99, order number 245127 (order number 730795-001 is in the document too).
|
|
Available on-line:
|
|
|
|
http://download.intel.com/design/PentiumII/manuals/245127.htm
|
|
|
|
"Intel Architecture Optimization Manual", 1997, order number 242816. This
|
|
is an older document mostly about P5 and not as good as the above.
|
|
Available on-line:
|
|
|
|
http://download.intel.com/design/PentiumII/manuals/242816.htm
|
|
|
|
|
|
|
|
----------------
|
|
Local variables:
|
|
mode: text
|
|
fill-column: 76
|
|
End:
|