--- In gbadev@y..., tom st denis <tomstdenis@y...> wrote:
> I didn't find much in the archives for this so I wrote my own....
>
> This routine gets ~27000 multiplies per second in VBA which means
on
> the real hardware its probably around ~22000 [my gba broke so I
can't
> test it myself...]
>
> What the code does is a 32x32 multiply then a shift right by 16
bits so
> it emulates a multiply for a 16.16 system.
I've managed to speed up the code 5x and I will tell y'all how.
Really easy using some school math.
Ok so the the basic 16.16 multiply looks like this
x = (a*b)>>16
Where a*b is a 64-bit [or at least 48 bits] product. Problem: The
GBA only produces 32 bits of product. Solution: Only do 16x16
multiplies!
Let a = AB and b = CD then you have
AB
*CD
---
Where you must compute [with the right shift taken]
B*D >> 16, D*A, C*B and A*C << 16
So in my new code I do just that. After taking care of the sign of
the input [all four A,B,C,D must be unsigned at this point] I
proceed to mask/chop off the 16-bit words then I perform the four
multiplications.
In VBA I routinely get >100k worth of multiplications per second
[instead of 20k]. I dunno how VBA emulates the multiplication but I
know one thing for sure this is going to be faster than the my other
method. My other method had 7 instructions per bit for a total
around 250 instructions. This routine has ~40 instructions which is
1/5 the size :-)
Because I preshift the values [to keep this all 32 bits] the lower
bits of the fraction are probably not exactly what they should be.
I've tested the routine with various inputs [negative and positive]
and I found it appears to work.
In my new library I've added an Inverse table so you can compute
division easily. I've also added 16.16 Cos/Sin/Tan tables to my lib
[I already had 8.8 tables...]
Should be enough to implement pretty much any 3d graphics in a
larger world space.
You can see my multiplier code in /lib/amath.s its called "fmul32"
and is fairly well commented assuming you can read ARM assembler.
All of my code is in Thumb mode because the instructions execute
quicker [and I don't have an ARM reference handy...]
http://tomstdenis.home.dhs.org/mylib.zip
I'd still be interested in tweaking my multiplier code if possible
so please reply if you have any ideas.
Also [I'm going to search too but I thought I'd ask....] anyone have
any good info on implementing atan() quickly? My trig is not 100%
so I could use some help.
Tom
[mod note: I think you should take a look at the ARM MULL instructions for your
own good, this has also been discussed lots and lots previously on the list :]