2012-11-25 17:13:44 -05:00
|
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
|
<html>
|
|
|
|
|
<head>
|
|
|
|
|
<title>GMP Development Projects</title>
|
|
|
|
|
<link rel="shortcut icon" href="favicon.ico">
|
|
|
|
|
<link rel="stylesheet" href="gmp.css">
|
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
|
|
</head>
|
|
|
|
|
|
|
|
|
|
<center>
|
|
|
|
|
<h1>
|
|
|
|
|
GMP Development Projects
|
|
|
|
|
</h1>
|
|
|
|
|
</center>
|
|
|
|
|
|
|
|
|
|
<font size=-1>
|
|
|
|
|
Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006 Free Software Foundation,
|
|
|
|
|
Inc. <br><br>
|
|
|
|
|
Copyright 2008 William Hart. <br><br>
|
|
|
|
|
This file is part of the MPIR Library. <br><br>
|
|
|
|
|
The MPIR Library is free software; you can redistribute it and/or modify
|
|
|
|
|
it under the terms of the GNU Lesser General Public License as published
|
|
|
|
|
by the Free Software Foundation; either version 2.1 of the License, or (at
|
|
|
|
|
your option) any later version. <br><br>
|
|
|
|
|
The MPIR Library is distributed in the hope that it will be useful, but
|
|
|
|
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
|
|
|
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
|
|
|
|
|
License for more details. <br><br>
|
|
|
|
|
You should have received a copy of the GNU Lesser General Public License
|
|
|
|
|
along with the MPIR Library; see the file COPYING.LIB. If not, write to
|
|
|
|
|
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
|
|
|
|
|
MA 02110-1301, USA.
|
|
|
|
|
</font>
|
|
|
|
|
|
|
|
|
|
<hr>
|
|
|
|
|
<!-- NB. timestamp updated automatically by emacs -->
|
|
|
|
|
This file current as of 21 Apr 2006. Please send comments
|
|
|
|
|
to http://groups.google.co.uk/group/mpir-devel/.
|
|
|
|
|
|
|
|
|
|
<p> This file lists projects suitable for volunteers. Please see the
|
|
|
|
|
<a href="tasks.html">tasks file</a> for smaller tasks.
|
|
|
|
|
|
|
|
|
|
<p> If you want to work on any of the projects below, please let us know at
|
|
|
|
|
http://groups.google.co.uk/group/mpir-devel/. If you want to help with a project
|
|
|
|
|
that already somebody else is working on, we will help you will get in touch.
|
|
|
|
|
(There are no email addresses of
|
|
|
|
|
volunteers below, due to spamming problems.)
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li> <strong>Faster multiplication</strong>
|
|
|
|
|
|
|
|
|
|
<p> The current multiplication code uses Karatsuba, 3-way Toom, and Fermat
|
|
|
|
|
FFT. Several new developments are desirable:
|
|
|
|
|
|
|
|
|
|
<ol>
|
|
|
|
|
|
|
|
|
|
<li> Handle multiplication of operands with different digit count better
|
|
|
|
|
than today. We now split the operands in a very inefficient way, see
|
|
|
|
|
mpn/generic/mul.c. The best operands splitting strategy depends on
|
|
|
|
|
the underlying multiplication algorithm to be used.
|
|
|
|
|
|
|
|
|
|
<li> Implement an FFT variant computing the coefficients mod m different
|
|
|
|
|
limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
|
|
|
|
|
The wanted coefficients will at the end be found by lifting with CRT
|
|
|
|
|
(Chinese Remainder Theorem). If we let m = 3, i.e., use 3 primes, we
|
|
|
|
|
can split the operands into coefficients at limb boundaries, and if
|
|
|
|
|
our machine uses b-bit limbs, we can multiply numbers with close to
|
|
|
|
|
2^b limbs without coefficient overflow. For smaller multiplication,
|
|
|
|
|
we might perhaps let m = 1, and instead of splitting our operands at
|
|
|
|
|
limb boundaries, split them in much smaller pieces. We might also use
|
|
|
|
|
4 or more primes, and split operands into bigger than b-bit chunks.
|
|
|
|
|
By using more primes, the gain in shorter transform length, but lose
|
|
|
|
|
in having to do more FFTs, but that is a slight total save. We then
|
|
|
|
|
lose in more expensive CRT. <br><br>
|
|
|
|
|
|
|
|
|
|
An nearly complete implementation has been done by Tommy F<>rnqvist.
|
|
|
|
|
|
|
|
|
|
<li> Perhaps consider N-way Toom, N > 3. See Knuth's Seminumerical
|
|
|
|
|
Algorithms for details on the method. Code implementing it exists.
|
|
|
|
|
This is asymptotically inferior to FFTs, but is finer grained. A
|
|
|
|
|
Toom-4 might fit in between Toom-3 and the FFTs (or it might not).
|
|
|
|
|
|
|
|
|
|
<li> Add support for partial products, either a given number of low limbs
|
|
|
|
|
or high limbs of the result. A high partial product can be used by
|
|
|
|
|
<code>mpf_mul</code> and by Newton approximations, a low half partial
|
|
|
|
|
product might be of use in a future sub-quadratic REDC. On small
|
|
|
|
|
sizes a partial product will be faster simply through fewer
|
|
|
|
|
cross-products, similar to the way squaring is faster. But work by
|
|
|
|
|
Thom Mulders shows that for Karatsuba and higher order algorithms the
|
|
|
|
|
advantage is progressively lost, so for large sizes partial products
|
|
|
|
|
turn out to be no faster.
|
|
|
|
|
|
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
<p> Another possibility would be an optimized cube. In the basecase that
|
|
|
|
|
should definitely be able to save cross-products in a similar fashion to
|
|
|
|
|
squaring, but some investigation might be needed for how best to adapt
|
|
|
|
|
the higher-order algorithms. Not sure whether cubing or further small
|
|
|
|
|
powers have any particularly important uses though.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Assembly routines</strong>
|
|
|
|
|
|
|
|
|
|
<p> Write new and improve existing assembly routines. The tests/devel
|
|
|
|
|
programs and the tune/speed.c and tune/many.pl programs are useful for
|
|
|
|
|
testing and timing the routines you write. See the README files in those
|
|
|
|
|
directories for more information.
|
|
|
|
|
|
|
|
|
|
<p> Please make sure your new routines are fast for these three situations:
|
|
|
|
|
<ol>
|
|
|
|
|
<li> Operands that fit into the cache.
|
|
|
|
|
<li> Small operands of less than, say, 10 limbs.
|
|
|
|
|
<li> Huge operands that does not fit into the cache.
|
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
<p> The most important routines are mpn_addmul_1, mpn_mul_basecase and
|
|
|
|
|
mpn_sqr_basecase. The latter two don't exist for all machines, while
|
|
|
|
|
mpn_addmul_1 exists for almost all machines.
|
|
|
|
|
|
|
|
|
|
<p> Standard techniques for these routines are unrolling, software
|
|
|
|
|
pipelining, and specialization for common operand values. For machines
|
|
|
|
|
with poor integer multiplication, it is often possible to improve the
|
|
|
|
|
performance using floating-point operations, or SIMD operations such as
|
|
|
|
|
MMX or Sun's VIS.
|
|
|
|
|
|
|
|
|
|
<p> Using floating-point operations is interesting but somewhat tricky.
|
|
|
|
|
Since IEEE double has 53 bit of mantissa, one has to split the operands
|
|
|
|
|
in small prices, so that no result is greater than 2^53. For 32-bit
|
|
|
|
|
computers, splitting one operand into 16-bit pieces works. For 64-bit
|
|
|
|
|
machines, one operand can be split into 21-bit pieces and the other into
|
|
|
|
|
32-bit pieces. (A 64-bit operand can be split into just three 21-bit
|
|
|
|
|
pieces if one allows the split operands to be negative!)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Faster GCD</strong>
|
|
|
|
|
|
|
|
|
|
<p> Work on Sch<63>nhage GCD and GCDEXT for large numbers is in progress.
|
|
|
|
|
Contact Niels M<>ller if you want to help.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Math functions for the mpf layer</strong>
|
|
|
|
|
|
|
|
|
|
<p> Implement the functions of math.h for the GMP mpf layer! Check the book
|
|
|
|
|
"Pi and the AGM" by Borwein and Borwein for ideas how to do this. These
|
|
|
|
|
functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2,
|
|
|
|
|
cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Faster sqrt</strong>
|
|
|
|
|
|
|
|
|
|
<p> The current code uses divisions, which are reasonably fast, but it'd be
|
|
|
|
|
possible to use only multiplications by computing 1/sqrt(A) using this
|
|
|
|
|
formula:
|
|
|
|
|
<pre>
|
|
|
|
|
2
|
|
|
|
|
x = x (3 − A x )/2
|
|
|
|
|
i+1 i i </pre>
|
|
|
|
|
The square root can then be computed like this:
|
|
|
|
|
<pre>
|
|
|
|
|
sqrt(A) = A x
|
|
|
|
|
n </pre>
|
|
|
|
|
<p> That final multiply might be the full size of the input (though it might
|
|
|
|
|
only need the high half of that), so there may or may not be any speedup
|
|
|
|
|
overall.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Nth root</strong>
|
|
|
|
|
|
|
|
|
|
<p> Improve mpn_rootrem. The current code is really naive, using full
|
|
|
|
|
precision from the first iteration. Also, calling mpn_pow_1 isn't very
|
|
|
|
|
clever, as only 1/n of the result bits will be used; truncation after
|
|
|
|
|
each multiplication would be better. Avoiding division might also be
|
|
|
|
|
possible.
|
|
|
|
|
|
|
|
|
|
Work mostly done, see
|
|
|
|
|
<a href="http://gmplib.org/devel/</a>.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Quotient-Only Division</strong>
|
|
|
|
|
|
|
|
|
|
<p> Some work can be saved when only the quotient is required in a division,
|
|
|
|
|
basically the necessary correction -0, -1 or -2 to the estimated quotient
|
|
|
|
|
can almost always be determined from only a few limbs of multiply and
|
|
|
|
|
subtract, rather than forming a complete remainder. The greatest savings
|
|
|
|
|
are when the quotient is small compared to the dividend and divisor.
|
|
|
|
|
|
|
|
|
|
<p> Some code along these lines can be found in the current
|
|
|
|
|
<code>mpn_tdiv_qr</code>, though perhaps calculating bigger chunks of
|
|
|
|
|
remainder than might be strictly necessary. That function in its current
|
|
|
|
|
form actually then always goes on to calculate a full remainder.
|
|
|
|
|
Burnikel and Zeigler describe a similar approach for the divide and
|
|
|
|
|
conquer case.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Sub-Quadratic REDC and Exact Division</strong>
|
|
|
|
|
|
|
|
|
|
<p> See also
|
|
|
|
|
<a href="http://gmplib.org/devel/">http://gmplib.org/devel/</a>
|
|
|
|
|
for some new code for divexact.
|
|
|
|
|
|
|
|
|
|
<p> <code>mpn_bdivmod</code> and the <code>redc</code> in
|
|
|
|
|
<code>mpz_powm</code> should use some sort of divide and conquer
|
|
|
|
|
algorithm. This would benefit <code>mpz_divexact</code>, and
|
|
|
|
|
<code>mpn_gcd</code> on large unequal size operands. See "Exact Division
|
|
|
|
|
with Karatsuba Complexity" by Jebelean for a (brief) description.
|
|
|
|
|
|
|
|
|
|
<p> Failing that, some sort of <code>DIVEXACT_THRESHOLD</code> could be added
|
|
|
|
|
to control whether <code>mpz_divexact</code> uses
|
|
|
|
|
<code>mpn_bdivmod</code> or <code>mpn_tdiv_qr</code>, since the latter is
|
|
|
|
|
faster on large divisors.
|
|
|
|
|
|
|
|
|
|
<p> For the REDC, basically all that's needed is Montgomery's algorithm done
|
|
|
|
|
in multi-limb integers. R is multiple limbs, and the inverse and
|
|
|
|
|
operands are multi-precision.
|
|
|
|
|
|
|
|
|
|
<p> For exact division the time to calculate a multi-limb inverse is not
|
|
|
|
|
amortized across many modular operations, but instead will probably
|
|
|
|
|
create a threshold below which the current style <code>mpn_bdivmod</code>
|
|
|
|
|
is best. There's also Krandick and Jebelean, "Bidirectional Exact
|
|
|
|
|
Integer Division" to basically use a low to high exact division for the
|
|
|
|
|
low half quotient, and a quotient-only division for the high half.
|
|
|
|
|
|
|
|
|
|
<p> It will be noted that low-half and high-half multiplies, and a low-half
|
|
|
|
|
square, can be used. These ought to each take as little as half the time
|
|
|
|
|
of a full multiply, or square, though work by Thom Mulders shows the
|
|
|
|
|
advantage is progressively lost as Karatsuba and higher algorithms are
|
|
|
|
|
applied.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Exceptions</strong>
|
|
|
|
|
|
|
|
|
|
<p> Some sort of scheme for exceptions handling would be desirable.
|
|
|
|
|
Presently the only thing documented is that divide by zero in GMP
|
|
|
|
|
functions provokes a deliberate machine divide by zero (on those systems
|
|
|
|
|
where such a thing exists at least). The global <code>gmp_errno</code>
|
|
|
|
|
is not actually documented, except for the old <code>gmp_randinit</code>
|
|
|
|
|
function. Being currently just a plain global means it's not
|
|
|
|
|
thread-safe.
|
|
|
|
|
|
|
|
|
|
<p> The basic choices for exceptions are returning an error code or having a
|
|
|
|
|
handler function to be called. The disadvantage of error returns is they
|
|
|
|
|
have to be checked, leading to tedious and rarely executed code, and
|
|
|
|
|
strictly speaking such a scheme wouldn't be source or binary compatible.
|
|
|
|
|
The disadvantage of a handler function is that a <code>longjmp</code> or
|
|
|
|
|
similar recovery from it may be difficult. A combination would be
|
|
|
|
|
possible, for instance by allowing the handler to return an error code.
|
|
|
|
|
|
|
|
|
|
<p> Divide-by-zero, sqrt-of-negative, and similar operand range errors can
|
|
|
|
|
normally be detected at the start of functions, so exception handling
|
|
|
|
|
would have a clean state. What's worth considering though is that the
|
|
|
|
|
GMP function detecting the exception may have been called via some third
|
|
|
|
|
party library or self contained application module, and hence have
|
|
|
|
|
various bits of state to be cleaned up above it. It'd be highly
|
|
|
|
|
desirable for an exceptions scheme to allow for such cleanups.
|
|
|
|
|
|
|
|
|
|
<p> The C++ destructor mechanism could help with cleanups both internally and
|
|
|
|
|
externally, but being a plain C library we don't want to depend on that.
|
|
|
|
|
|
|
|
|
|
<p> A C++ <code>throw</code> might be a good optional extra exceptions
|
|
|
|
|
mechanism, perhaps under a build option. For GCC
|
|
|
|
|
<code>-fexceptions</code> will add the necessary frame information to
|
|
|
|
|
plain C code, or GMP could be compiled as C++.
|
|
|
|
|
|
|
|
|
|
<p> Out-of-memory exceptions are expected to be handled by the
|
|
|
|
|
<code>mp_set_memory_functions</code> routines, rather than being a
|
|
|
|
|
prospective part of divide-by-zero etc. Some similar considerations
|
|
|
|
|
apply but what differs is that out-of-memory can arise deep within GMP
|
|
|
|
|
internals. Even fundamental routines like <code>mpn_add_n</code> and
|
|
|
|
|
<code>mpn_addmul_1</code> can use temporary memory (for instance on Cray
|
|
|
|
|
vector systems). Allowing for an error code return would require an
|
|
|
|
|
awful lot of checking internally. Perhaps it'd still be worthwhile, but
|
|
|
|
|
it'd be a lot of changes and the extra code would probably be rather
|
|
|
|
|
rarely executed in normal usages.
|
|
|
|
|
|
|
|
|
|
<p> A <code>longjmp</code> recovery for out-of-memory will currently, in
|
|
|
|
|
general, lead to memory leaks and may leave GMP variables operated on in
|
|
|
|
|
inconsistent states. Maybe it'd be possible to record recovery
|
|
|
|
|
information for use by the relevant allocate or reallocate function, but
|
|
|
|
|
that too would be a lot of changes.
|
|
|
|
|
|
|
|
|
|
<p> One scheme for out-of-memory would be to note that all GMP allocations go
|
|
|
|
|
through the <code>mp_set_memory_functions</code> routines. So if the
|
|
|
|
|
application has an intended <code>setjmp</code> recovery point it can
|
|
|
|
|
record memory activity by GMP and abandon space allocated and variables
|
|
|
|
|
initialized after that point. This might be as simple as directing the
|
|
|
|
|
allocation functions to a separate pool, but in general would have the
|
|
|
|
|
disadvantage of needing application-level bookkeeping on top of the
|
|
|
|
|
normal system <code>malloc</code>. An advantage however is that it needs
|
|
|
|
|
nothing from GMP itself and on that basis doesn't burden applications not
|
|
|
|
|
needing recovery. Note that there's probably some details to be worked
|
|
|
|
|
out here about reallocs of existing variables, and perhaps about copying
|
|
|
|
|
or swapping between "permanent" and "temporary" variables.
|
|
|
|
|
|
|
|
|
|
<p> Applications desiring a fine-grained error control, for instance a
|
|
|
|
|
language interpreter, would very possibly not be well served by a scheme
|
|
|
|
|
requiring <code>longjmp</code>. Wrapping every GMP function call with a
|
|
|
|
|
<code>setjmp</code> would be very inconvenient.
|
|
|
|
|
|
|
|
|
|
<p> Another option would be to let <code>mpz_t</code> etc hold a sort of NaN,
|
|
|
|
|
a special value indicating an out-of-memory or other failure. This would
|
|
|
|
|
be similar to NaNs in mpfr. Unfortunately such a scheme could only be
|
|
|
|
|
used by programs prepared to handle such special values, since for
|
|
|
|
|
instance a program waiting for some condition to be satisfied could
|
|
|
|
|
become an infinite loop if it wasn't also watching for NaNs. The work to
|
|
|
|
|
implement this would be significant too, lots of checking of inputs and
|
|
|
|
|
intermediate results. And if <code>mpn</code> routines were to
|
|
|
|
|
participate in this (which they would have to internally) a lot of new
|
|
|
|
|
return values would need to be added, since of course there's no
|
|
|
|
|
<code>mpz_t</code> etc structure for them to indicate failure in.
|
|
|
|
|
|
|
|
|
|
<p> Stack overflow is another possible exception, but perhaps not one that
|
|
|
|
|
can be easily detected in general. On i386 GNU/Linux for instance GCC
|
|
|
|
|
normally doesn't generate stack probes for an <code>alloca</code>, but
|
|
|
|
|
merely adjusts <code>%esp</code>. A big enough <code>alloca</code> can
|
|
|
|
|
miss the stack redzone and hit arbitrary data. GMP stack usage is
|
|
|
|
|
normally a function of operand size, which might be enough for some
|
|
|
|
|
applications to know they'll be safe. Otherwise a fixed maximum usage
|
|
|
|
|
can probably be obtained by building with
|
|
|
|
|
<code>--enable-alloca=malloc-reentrant</code> (or
|
|
|
|
|
<code>notreentrant</code>). Arranging the default to be
|
|
|
|
|
<code>alloca</code> only on blocks up to a certain size and
|
|
|
|
|
<code>malloc</code> thereafter might be a better approach and would have
|
|
|
|
|
the advantage of not having calculations limited by available stack.
|
|
|
|
|
|
|
|
|
|
<p> Actually recovering from stack overflow is of course another problem. It
|
|
|
|
|
might be possible to catch a <code>SIGSEGV</code> in the stack redzone
|
|
|
|
|
and do something in a <code>sigaltstack</code>, on systems which have
|
|
|
|
|
that, but recovery might otherwise not be possible. This is worth
|
|
|
|
|
bearing in mind because there's no point worrying about tight and careful
|
|
|
|
|
out-of-memory recovery if an out-of-stack is fatal.
|
|
|
|
|
|
|
|
|
|
<p> Operand overflow is another exception to be addressed. It's easy for
|
|
|
|
|
instance to ask <code>mpz_pow_ui</code> for a result bigger than an
|
|
|
|
|
<code>mpz_t</code> can possibly represent. Currently overflows in limb
|
|
|
|
|
or byte count calculations will go undetected. Often they'll still end
|
|
|
|
|
up asking the memory functions for blocks bigger than available memory,
|
|
|
|
|
but that's by no means certain and results are unpredictable in general.
|
|
|
|
|
It'd be desirable to tighten up such size calculations. Probably only
|
|
|
|
|
selected routines would need checks, if it's assumed say that no input
|
|
|
|
|
will be more than half of all memory and hence size additions like say
|
|
|
|
|
<code>mpz_mul</code> won't overflow.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Performance Tool</strong>
|
|
|
|
|
|
|
|
|
|
<p> It'd be nice to have some sort of tool for getting an overview of
|
|
|
|
|
performance. Clearly a great many things could be done, but some primary
|
|
|
|
|
uses would be,
|
|
|
|
|
|
|
|
|
|
<ol>
|
|
|
|
|
<li> Checking speed variations between compilers.
|
|
|
|
|
<li> Checking relative performance between systems or CPUs.
|
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
<p> A combination of measuring some fundamental routines and some
|
|
|
|
|
representative application routines might satisfy these.
|
|
|
|
|
|
|
|
|
|
<p> The tune/time.c routines would be the easiest way to get good accurate
|
|
|
|
|
measurements on lots of different systems. The high level
|
|
|
|
|
<code>speed_measure</code> may or may not suit, but the basic
|
|
|
|
|
<code>speed_starttime</code> and <code>speed_endtime</code> would cover
|
|
|
|
|
lots of portability and accuracy questions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Using <code>restrict</code></strong>
|
|
|
|
|
|
|
|
|
|
<p> There might be some value in judicious use of C99 style
|
|
|
|
|
<code>restrict</code> on various pointers, but this would need some
|
|
|
|
|
careful thought about what it implies for the various operand overlaps
|
|
|
|
|
permitted in GMP.
|
|
|
|
|
|
|
|
|
|
<p> Rumour has it some pre-C99 compilers had <code>restrict</code>, but
|
|
|
|
|
expressing tighter (or perhaps looser) requirements. Might be worth
|
|
|
|
|
investigating that before using <code>restrict</code> unconditionally.
|
|
|
|
|
|
|
|
|
|
<p> Loops are presumably where the greatest benefit would be had, by allowing
|
|
|
|
|
the compiler to advance reads ahead of writes, perhaps as part of loop
|
|
|
|
|
unrolling. However critical loops are generally coded in assembler, so
|
|
|
|
|
there might not be very much to gain. And on Cray systems the explicit
|
|
|
|
|
use of <code>_Pragma</code> gives an equivalent effect.
|
|
|
|
|
|
|
|
|
|
<p> One thing to note is that Microsoft C headers (on ia64 at least) contain
|
|
|
|
|
<code>__declspec(restrict)</code>, so a <code>#define</code> of
|
|
|
|
|
<code>restrict</code> should be avoided. It might be wisest to setup a
|
|
|
|
|
<code>gmp_restrict</code>.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Nx1 Division</strong>
|
|
|
|
|
|
|
|
|
|
<p> The limb-by-limb dependencies in the existing Nx1 division (and
|
|
|
|
|
remainder) code means that chips with multiple execution units or
|
|
|
|
|
pipelined multipliers are not fully utilized.
|
|
|
|
|
|
|
|
|
|
<p> One possibility is to follow the current preinv method but taking two
|
|
|
|
|
limbs at a time. That means a 2x2->4 and a 2x1->2 multiply for
|
|
|
|
|
each two limbs processed, and because the 2x2 and 2x1 can each be done in
|
|
|
|
|
parallel the latency will be not much more than 2 multiplies for two
|
|
|
|
|
limbs, whereas the single limb method has a 2 multiply latency for just
|
|
|
|
|
one limb. A version of <code>mpn_divrem_1</code> doing this has been
|
|
|
|
|
written in C, but not yet tested on likely chips. Clearly this scheme
|
|
|
|
|
would extend to 3x3->9 and 3x1->3 etc, though with diminishing
|
|
|
|
|
returns.
|
|
|
|
|
|
|
|
|
|
<p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
|
|
|
|
|
scheme. For a limb R=2^<code>bits_per_mp_limb</code>, pre-calculate
|
|
|
|
|
values R mod N, R^2 mod N, R^3 mod N, R^4 mod N. Then take dividend
|
|
|
|
|
limbs and multiply them by those values, thereby reducing them (moving
|
|
|
|
|
them down) by the corresponding factor. The products can be added to
|
|
|
|
|
produce an intermediate remainder of 2 or 3 limbs to be similarly
|
|
|
|
|
included in the next step. The point is that such multiplies can be done
|
|
|
|
|
in parallel, meaning as little as 1 multiply worth of latency for 4
|
|
|
|
|
limbs. If the modulus N is less than R/4 (or is it R/5?) the summed
|
|
|
|
|
products will fit in 2 limbs, otherwise 3 will be required, but with the
|
|
|
|
|
high only being small. Clearly this extends to as many factors of R as a
|
|
|
|
|
chip can efficiently apply.
|
|
|
|
|
|
|
|
|
|
<p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N"
|
|
|
|
|
for i up to k, the size of the dividend. This could then be applied at
|
|
|
|
|
multiplier throughput speed like an inner product. If the powers took
|
|
|
|
|
roughly k divide steps to calculate then there'd be an advantage any time
|
|
|
|
|
the same N was used three or more times. Suggested by Victor Shoup in
|
|
|
|
|
connection with chinese-remainder style decompositions, but perhaps with
|
|
|
|
|
other uses.
|
|
|
|
|
|
|
|
|
|
<p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0<=x<d
|
|
|
|
|
satisfying a = q*d + x*b^n, where b=2^bits_per_limb. The factor b^n
|
|
|
|
|
needed to get the true remainder r could be calculated by a powering
|
|
|
|
|
algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
|
|
|
|
|
service for an <code>mpn_mod_1</code>. <code>modexact_1</code> is
|
|
|
|
|
simpler and on some chips can run noticeably faster than plain
|
|
|
|
|
<code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
|
|
|
|
|
Such a difference could soon overcome the time to calculate b^n. The
|
|
|
|
|
requirement for an odd divisor in <code>modexact</code> can be handled by
|
|
|
|
|
some shifting on-the-fly, or perhaps by an extra partial-limb step at the
|
|
|
|
|
end.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Factorial</strong>
|
|
|
|
|
|
|
|
|
|
<p> The removal of twos in the current code could be extended to factors of 3
|
|
|
|
|
or 5. Taking this to its logical conclusion would be a complete
|
|
|
|
|
decomposition into powers of primes. The power for a prime p is of
|
|
|
|
|
course floor(n/p)+floor(n/p^2)+... Conrad Curry found this is quite fast
|
|
|
|
|
(using simultaneous powering as per Handbook of Applied Cryptography
|
|
|
|
|
algorithm 14.88).
|
|
|
|
|
|
|
|
|
|
<p> A difficulty with using all primes is that quite large n can be
|
|
|
|
|
calculated on a system with enough memory, larger than we'd probably want
|
|
|
|
|
for a table of primes, so some sort of sieving would be wanted. Perhaps
|
|
|
|
|
just taking out the factors of 3 and 5 would give most of the speedup
|
|
|
|
|
that a prime decomposition can offer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Binomial Coefficients</strong>
|
|
|
|
|
|
|
|
|
|
<p> An obvious improvement to the current code would be to strip factors of 2
|
|
|
|
|
from each multiplier and divisor and count them separately, to be applied
|
|
|
|
|
with a bit shift at the end. Factors of 3 and perhaps 5 could even be
|
|
|
|
|
handled similarly.
|
|
|
|
|
|
|
|
|
|
<p> Conrad Curry reports a big speedup for binomial coefficients using a
|
|
|
|
|
prime powering scheme, at least for k near n/2. Of course this is only
|
|
|
|
|
practical for moderate size n since again it requires primes up to n.
|
|
|
|
|
|
|
|
|
|
<p> When k is small the current (n-k+1)...n/1...k will be fastest. Some sort
|
|
|
|
|
of rule would be needed for when to use this or when to use prime
|
|
|
|
|
powering. Such a rule will be a function of both n and k. Some
|
|
|
|
|
investigation is needed to see what sort of shape the crossover line will
|
|
|
|
|
have, the usual parameter tuning can of course find machine dependent
|
|
|
|
|
constants to fill in where necessary.
|
|
|
|
|
|
|
|
|
|
<p> An easier possibility also reported by Conrad Curry is that it may be
|
|
|
|
|
faster not to divide out the denominator (1...k) one-limb at a time, but
|
|
|
|
|
do one big division at the end. Is this because a big divisor in
|
|
|
|
|
<code>mpn_bdivmod</code> trades the latency of
|
|
|
|
|
<code>mpn_divexact_1</code> for the throughput of
|
|
|
|
|
<code>mpn_submul_1</code>? Overheads must hurt though.
|
|
|
|
|
|
|
|
|
|
<p> Another reason a big divisor might help is that
|
|
|
|
|
<code>mpn_divexact_1</code> won't be getting a full limb in
|
|
|
|
|
<code>mpz_bin_uiui</code>. It's called when the n accumulator is full
|
|
|
|
|
but the k may be far from full. Perhaps the two could be decoupled so k
|
|
|
|
|
is applied when full. It'd be necessary to delay consideration of k
|
|
|
|
|
terms until the corresponding n terms had been applied though, since
|
|
|
|
|
otherwise the division won't be exact.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Perfect Power Testing</strong>
|
|
|
|
|
|
|
|
|
|
<p> <code>mpz_perfect_power_p</code> could be improved in a number of ways,
|
|
|
|
|
for instance p-adic arithmetic to find possible roots.
|
|
|
|
|
|
|
|
|
|
<p> Non-powers can be quickly identified by checking for Nth power residues
|
|
|
|
|
modulo small primes, like <code>mpn_perfect_square_p</code> does for
|
|
|
|
|
squares. The residues to each power N for a given remainder could be
|
|
|
|
|
grouped into a bit mask, the masks for the remainders to each divisor
|
|
|
|
|
would then be "and"ed together to hopefully leave only a few candidate
|
|
|
|
|
powers. Need to think about how wide to make such masks, ie. how many
|
|
|
|
|
powers to examine in this way.
|
|
|
|
|
|
|
|
|
|
<p> Any zero remainders found in residue testing reveal factors which can be
|
|
|
|
|
divided out, with the multiplicity restricting the powers that need to be
|
|
|
|
|
considered, as per the current code. Further prime dividing should be
|
|
|
|
|
grouped into limbs like <code>PP</code>. Need to think about how much
|
|
|
|
|
dividing to do like that, probably more for bigger inputs, less for
|
|
|
|
|
smaller inputs.
|
|
|
|
|
|
|
|
|
|
<p> <code>mpn_gcd_1</code> would probably be better than the current private
|
|
|
|
|
GCD routine. The use it's put to isn't time-critical, and it might help
|
|
|
|
|
ensure correctness to just use the main GCD routine.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Prime Testing</strong>
|
|
|
|
|
|
|
|
|
|
<p> GMP is not really a number theory library and probably shouldn't have
|
|
|
|
|
large amounts of code dedicated to sophisticated prime testing
|
|
|
|
|
algorithms, but basic things well-implemented would suit. Tests offering
|
|
|
|
|
certainty are probably all too big or too slow (or both!) to justify
|
|
|
|
|
inclusion in the main library. Demo programs showing some possibilities
|
|
|
|
|
would be good though.
|
|
|
|
|
|
|
|
|
|
<p> The present "repetitions" argument to <code>mpz_probab_prime_p</code> is
|
|
|
|
|
rather specific to the Miller-Rabin tests of the current implementation.
|
|
|
|
|
Better would be some sort of parameter asking perhaps for a maximum
|
|
|
|
|
chance 1/2^x of a probable prime in fact being composite. If
|
|
|
|
|
applications follow the advice that the present reps gives 1/4^reps
|
|
|
|
|
chance then perhaps such a change is unnecessary, but an explicitly
|
|
|
|
|
described 1/2^x would allow for changes in the implementation or even for
|
|
|
|
|
new proofs about the theory.
|
|
|
|
|
|
|
|
|
|
<p> <code>mpz_probab_prime_p</code> always initializes a new
|
|
|
|
|
<code>gmp_randstate_t</code> for randomized tests, which unfortunately
|
|
|
|
|
means it's not really very random and in particular always runs the same
|
|
|
|
|
tests for a given input. Perhaps a new interface could accept an rstate
|
|
|
|
|
to use, so successive tests could increase confidence in the result.
|
|
|
|
|
|
|
|
|
|
<p> <code>mpn_mod_34lsub1</code> is an obvious and easy improvement to the
|
|
|
|
|
trial divisions. And since the various prime factors are constants, the
|
|
|
|
|
remainder can be tested with something like
|
|
|
|
|
<pre>
|
|
|
|
|
#define MP_LIMB_DIVISIBLE_7_P(n) \
|
|
|
|
|
((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)
|
|
|
|
|
</pre>
|
|
|
|
|
Which would help compilers that don't know how to optimize divisions by
|
|
|
|
|
constants, and is even an improvement on current gcc 3.2 code. This
|
|
|
|
|
technique works for any modulus, see Granlund and Montgomery "Division by
|
|
|
|
|
Invariant Integers" section 9.
|
|
|
|
|
|
|
|
|
|
<p> The trial divisions are done with primes generated and grouped at
|
|
|
|
|
runtime. This could instead be a table of data, with pre-calculated
|
|
|
|
|
inverses too. Storing deltas, ie. amounts to add, rather than actual
|
|
|
|
|
primes would save space. <code>udiv_qrnnd_preinv</code> style inverses
|
|
|
|
|
can be made to exist by adding dummy factors of 2 if necessary. Some
|
|
|
|
|
thought needs to be given as to how big such a table should be, based on
|
|
|
|
|
how much dividing would be profitable for what sort of size inputs. The
|
|
|
|
|
data could be shared by the perfect power testing.
|
|
|
|
|
|
|
|
|
|
<p> Jason Moxham points out that if a sqrt(-1) mod N exists then any factor
|
|
|
|
|
of N must be == 1 mod 4, saving half the work in trial dividing. (If
|
|
|
|
|
x^2==-1 mod N then for a prime factor p we have x^2==-1 mod p and so the
|
|
|
|
|
jacobi symbol (-1/p)=1. But also (-1/p)=(-1)^((p-1)/2), hence must have
|
|
|
|
|
p==1 mod 4.) But knowing whether sqrt(-1) mod N exists is not too easy.
|
|
|
|
|
A strong pseudoprime test can reveal one, so perhaps such a test could be
|
|
|
|
|
inserted part way though the dividing.
|
|
|
|
|
|
|
|
|
|
<p> Jon Grantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a
|
|
|
|
|
quadratic pseudoprime test taking about 3x longer than a plain test, but
|
|
|
|
|
with only a 1/7710 chance of error (whereas 3 plain Miller-Rabin tests
|
|
|
|
|
would offer only (1/4)^3 == 1/64). Such a test needs completely random
|
|
|
|
|
parameters to satisfy the theory, though single-limb values would run
|
|
|
|
|
faster. It's probably best to do at least one plain Miller-Rabin before
|
|
|
|
|
any quadratic tests, since that can identify composites in less total
|
|
|
|
|
time.
|
|
|
|
|
|
|
|
|
|
<p> Some thought needs to be given to the structure of which tests (trial
|
|
|
|
|
division, Miller-Rabin, quadratic) and how many are done, based on what
|
|
|
|
|
sort of inputs we expect, with a view to minimizing average time.
|
|
|
|
|
|
|
|
|
|
<p> It might be a good idea to break out subroutines for the various tests,
|
|
|
|
|
so that an application can combine them in ways it prefers, if sensible
|
|
|
|
|
defaults in <code>mpz_probab_prime_p</code> don't suit. In particular
|
|
|
|
|
this would let applications skip tests it knew would be unprofitable,
|
|
|
|
|
like trial dividing when an input is already known to have no small
|
|
|
|
|
factors.
|
|
|
|
|
|
|
|
|
|
<p> For small inputs, combinations of theory and explicit search make it
|
|
|
|
|
relatively easy to offer certainty. For instance numbers up to 2^32
|
|
|
|
|
could be handled with a strong pseudoprime test and table lookup. But
|
|
|
|
|
it's rather doubtful whether a smallnum prime test belongs in a bignum
|
|
|
|
|
library. Perhaps if it had other internal uses.
|
|
|
|
|
|
|
|
|
|
<p> An <code>mpz_nthprime</code> might be cute, but is almost certainly
|
|
|
|
|
impractical for anything but small n.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li> <strong>Intra-Library Calls</strong>
|
|
|
|
|
|
|
|
|
|
<p> On various systems, calls within libmpir still go through the PLT, TOC or
|
|
|
|
|
other mechanism, which makes the code bigger and slower than it needs to
|
|
|
|
|
be.
|
|
|
|
|
|
|
|
|
|
<p> The theory would be to have all GMP intra-library calls resolved directly
|
|
|
|
|
to the routines in the library. An application wouldn't be able to
|
|
|
|
|
replace a routine, the way it can normally, but there seems no good
|
|
|
|
|
reason to do that, in normal circumstances.
|
|
|
|
|
|
|
|
|
|
<p> The <code>visibility</code> attribute in recent gcc is good for this,
|
|
|
|
|
because it lets gcc omit unnecessary GOT pointer setups or whatever if it
|
|
|
|
|
finds all calls are local and there's no global data references.
|
|
|
|
|
Documented entrypoints would be <code>protected</code>, and purely
|
|
|
|
|
internal things not wanted by test programs or anything can be
|
|
|
|
|
<code>internal</code>.
|
|
|
|
|
|
|
|
|
|
<p> Unfortunately, on i386 it seems <code>protected</code> ends up causing
|
|
|
|
|
text segment relocations within libmpir.so, meaning the library code can't
|
|
|
|
|
be shared between processes, defeating the purpose of a shared library.
|
|
|
|
|
Perhaps this is just a gremlin in binutils (debian packaged
|
|
|
|
|
2.13.90.0.16-1).
|
|
|
|
|
|
|
|
|
|
<p> The linker can be told directly (with a link script, or options) to do
|
|
|
|
|
the same sort of thing. This doesn't change the code emitted by gcc of
|
|
|
|
|
course, but it does mean calls are resolved directly to their targets,
|
|
|
|
|
avoiding a PLT entry.
|
|
|
|
|
|
|
|
|
|
<p> Keeping symbols private to libmpir.so is probably a good thing in general
|
|
|
|
|
too, to stop anyone even attempting to access them. But some
|
|
|
|
|
undocumented things will need or want to be kept visible, for use by
|
|
|
|
|
mpfr, or the test and tune programs. Libtool has a standard option for
|
|
|
|
|
selecting public symbols (used now for libmp).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</ul>
|
|
|
|
|
<hr>
|
|
|
|
|
|
|
|
|
|
</body>
|
|
|
|
|
</html>
|
|
|
|
|
|
|
|
|
|
<!--
|
|
|
|
|
Local variables:
|
|
|
|
|
eval: (add-hook 'write-file-hooks 'time-stamp)
|
|
|
|
|
time-stamp-start: "This file current as of "
|
|
|
|
|
time-stamp-format: "%:d %3b %:y"
|
|
|
|
|
time-stamp-end: "\\."
|
|
|
|
|
time-stamp-line-limit: 50
|
|
|
|
|
End:
|
|
|
|
|
-->
|