116 lines
3.9 KiB
Plaintext
116 lines
3.9 KiB
Plaintext
|
Copyright 2001 Free Software Foundation, Inc.
|
||
|
|
||
|
This file is part of the GNU MP Library.
|
||
|
|
||
|
The GNU MP Library is free software; you can redistribute it and/or modify
|
||
|
it under the terms of the GNU Lesser General Public License as published by
|
||
|
the Free Software Foundation; either version 2.1 of the License, or (at your
|
||
|
option) any later version.
|
||
|
|
||
|
The GNU MP Library is distributed in the hope that it will be useful, but
|
||
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
||
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
|
||
|
License for more details.
|
||
|
|
||
|
You should have received a copy of the GNU Lesser General Public License
|
||
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
|
||
|
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
|
||
|
02110-1301, USA.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
INTEL PENTIUM-4 MPN SUBROUTINES
|
||
|
|
||
|
|
||
|
This directory contains mpn functions optimized for Intel Pentium-4.
|
||
|
|
||
|
The mmx subdirectory has routines using MMX instructions, the sse2
|
||
|
subdirectory has routines using SSE2 instructions. All P4s have these, the
|
||
|
separate directories are just so configure can omit that code if the
|
||
|
assembler doesn't support it.
|
||
|
|
||
|
|
||
|
STATUS
|
||
|
|
||
|
cycles/limb
|
||
|
|
||
|
mpn_add_n/sub_n 4 normal, 6 in-place
|
||
|
|
||
|
mpn_mul_1 4 normal, 6 in-place
|
||
|
mpn_addmul_1 6
|
||
|
mpn_submul_1 7
|
||
|
|
||
|
mpn_mul_basecase 6 cycles/crossproduct (approx)
|
||
|
|
||
|
mpn_sqr_basecase 3.5 cycles/crossproduct (approx)
|
||
|
or 7.0 cycles/triangleproduct (approx)
|
||
|
|
||
|
mpn_l/rshift 1.75
|
||
|
|
||
|
|
||
|
|
||
|
The shifts ought to be able to go at 1.5 c/l, but not much effort has been
|
||
|
applied to them yet.
|
||
|
|
||
|
In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
|
||
|
calls, suffer from pipeline anomalies associated with write combining and
|
||
|
movd reads and writes to the same or nearby locations. The movq
|
||
|
instructions do not trigger the same hardware problems. Unfortunately,
|
||
|
using movq and splitting/combining seems to require too many extra
|
||
|
instructions to help. Perhaps future chip steppings will be better.
|
||
|
|
||
|
|
||
|
|
||
|
NOTES
|
||
|
|
||
|
The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
|
||
|
Many traditional x86 instructions run very slowly, requiring use of
|
||
|
alterative instructions for acceptable performance.
|
||
|
|
||
|
adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits
|
||
|
within a 64-bit mmx register seems better, though the combination
|
||
|
paddq/psrlq when propagating a carry is still a 4 cycle latency.
|
||
|
|
||
|
incl and decl should be avoided, instead use add $1 and sub $1. Apparently
|
||
|
the carry flag is not separately renamed, so incl and decl depend on all
|
||
|
previous flags-setting instructions.
|
||
|
|
||
|
shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
|
||
|
integer instructions (addl, subl, orl, andl, and some more). shldl and
|
||
|
shrdl seem to have 13 and 15 cycles latency, respectively. Bizarre.
|
||
|
|
||
|
movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
|
||
|
pxor/por or similar combination at 2 cycles latency can be used instead.
|
||
|
The movq however executes in the float unit, thereby saving MMX execution
|
||
|
resources. With the right juggling, data moves shouldn't be on a dependent
|
||
|
chain.
|
||
|
|
||
|
L1 is write-through, but the write-combining sounds like it does enough to
|
||
|
not require explicit destination prefetching.
|
||
|
|
||
|
xmm registers so far haven't found a use, but not much effort has been
|
||
|
expended. A configure test for whether the operating system knows
|
||
|
fxsave/fxrestor will be needed if they're used.
|
||
|
|
||
|
|
||
|
|
||
|
REFERENCES
|
||
|
|
||
|
Intel Pentium-4 processor manuals,
|
||
|
|
||
|
http://developer.intel.com/design/pentium4/manuals
|
||
|
|
||
|
"Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
|
||
|
order number 248966. Available on-line:
|
||
|
|
||
|
http://developer.intel.com/design/pentium4/manuals/248966.htm
|
||
|
|
||
|
|
||
|
|
||
|
----------------
|
||
|
Local variables:
|
||
|
mode: text
|
||
|
fill-column: 76
|
||
|
End:
|