mpir/pa32 at c22d37cecd7fadc1c6f7905f7ad7fa4b9c77959e - mpir

cheng/mpir
History
wbhart d7eed06c0c Added tuning values for hppa2.0 (ABI=1.0).		2009-06-03 03:12:18 +00:00
..
hppa1_1
hppa2_0	Added tuning values for hppa2.0 (ABI=1.0).	2009-06-03 03:12:18 +00:00
add_n.asm
gmp-mparam.h
lshift.asm
pa-defs.m4
README
rshift.asm
sub_n.asm
udiv.asm
README

Copyright 1996, 1999, 2001, 2002, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.






This directory contains mpn functions for various HP PA-RISC chips.  Code
that runs faster on the PA7100 and later implementations, is in the pa7100
directory.

RELEVANT OPTIMIZATION ISSUES

  Load and Store timing

On the PA7000 no memory instructions can issue the two cycles after a store.
For the PA7100, this is reduced to one cycle.

The PA7100 has a lookup-free cache, so it helps to schedule loads and the
dependent instruction really far from each other.

STATUS

1. mpn_mul_1 could be improved to 6.5 cycles/limb on the PA7100, using the
   instructions below (but some sw pipelining is needed to avoid the
   xmpyu-fstds delay):

	fldds	s1_ptr

	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)

	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)

	addc
	stws	res_ptr
	addc
	stws	res_ptr

	addib	Loop

2. mpn_addmul_1 could be improved from the current 10 to 7.5 cycles/limb
   (asymptotically) on the PA7100, using the instructions below.  With proper
   sw pipelining and the unrolling level below, the speed becomes 8
   cycles/limb.

	fldds	s1_ptr
	fldds	s1_ptr

	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)

	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	addc
	addc
	addc
	addc
	addc	%r0,%r0,cy-limb

	ldws	res_ptr
	ldws	res_ptr
	ldws	res_ptr
	ldws	res_ptr
	add
	stws	res_ptr
	addc
	stws	res_ptr
	addc
	stws	res_ptr
	addc
	stws	res_ptr

	addib

3. For the PA8000 we have to stick to using 32-bit limbs before compiler
   support emerges.  But we want to use 64-bit operations whenever possible,
   in particular for loads and stores.  It is possible to handle mpn_add_n
   efficiently by rotating (when s1/s2 are aligned), masking+bit field
   inserting when (they are not).  The speed should double compared to the
   code used today.




LABEL SYNTAX

The HP-UX assembler takes labels starting in column 0 with no colon,

	L$loop  ldws,mb -4(0,%r25),%r22

Gas on hppa GNU/Linux however requires a colon,

	L$loop: ldws,mb -4(0,%r25),%r22

This is covered by using LDEF() from asm-defs.m4.  An alternative would be
to use ".label" which is accepted by both,

		.label  L$loop
		ldws,mb -4(0,%r25),%r22

but that's not as nice to look at, not if you're used to assembler code
having labels in column 0.




REFERENCES

Hewlett Packard, "HP Assembler Reference Manual", 9th edition, June 1998,
part number 92432-90012.



----------------
Local variables:
mode: text
fill-column: 76
End: