Game Development Community

Altivec optimized point3F_normalize

by asmaloney (Andy) · in Torque Game Engine · 01/11/2007 (1:56 pm) · 29 replies

In my quest to make the Stronghold mission actually playable on my PowerBook G4, my Shark profile showed that m_point3F_normalize_C() accounted for 1.3% of the time. [It showed uses mainly in the precipitation code.] So here's an altivec implementation which reduced that to 0.4% - not a bad savings. [I used journal playback so I can compare profile results directly.]

In math/mMathAltiVec.cc add:
// ------------
//	The following code is from here: http://developer.apple.com/hardwaredrivers/ve/algorithms.html

//result = v-0.5
inline vector float ReciprocalSquareRoot( vector float v )
{
	//Get the square root reciprocal estimate
	vector float zero = (vector float)(0);
	vector float oneHalf = (vector float)(0.5);
	vector float one = (vector float)(1.0);
	vector float estimate = vec_rsqrte( v );

	//One round of Newton-Raphson refinement
	vector float estimateSquared = vec_madd( estimate, estimate, zero );
	vector float halfEstimate = vec_madd( estimate, oneHalf, zero );
	return vec_madd( vec_nmsub( v, estimateSquared, one ), halfEstimate, estimate );
}
// ------------

static void vec_point3F_normalize(F32 *p)
{	
	vector float	zeros = (vector float)(0.0f);
	vector float	pv;

	F32	*loader = (F32*) &pv;

	loader[0] = p[0];
	loader[1] = p[1];
	loader[2] = p[2];
	loader[3] = 0.0f;

	// Square each of the components of the vector, then add them together
	//	After this, all components of the vector are the same value
	//	This is the recommended way to operate across a vector
	vector float	squared = vec_madd( pv, pv, zeros );
	squared = vec_add( squared, vec_sld( squared, squared, 8 ) );
	squared = vec_add( squared, vec_sld( squared, squared, 4 ) );
	
	if ( vec_all_eq( squared, zeros ) )
	{
		p[0] = 0;
		p[1] = 0;
		p[2] = 1;
	}
	else
	{
		// If you're happy with a faster estimated reciprocal sqrt, use vec_rsqrte() instead of ReciprocalSquareRoot()
	//	pv = vec_madd( pv, vec_rsqrte( squared ), zeros );
		pv = vec_madd( pv, ReciprocalSquareRoot( squared ), zeros );
		
		p[0] = loader[0];
		p[1] = loader[1];
		p[2] = loader[2];
	}
}

Then in mInstallLibrary_Vec(), add this:
m_point3F_normalize     = vec_point3F_normalize;

If there are any Altivec masters out there - I'm not - please let me know how this can be improved!
Page«First 1 2 Next»
#21
01/23/2007 (11:50 am)
Hmm - sounds like an optimizer bug. What compiler are you using? Does the asm do something weird for debug vs. release?
#22
01/23/2007 (12:36 pm)
From my coding partner Tim:
Quote:
S32 i = *(S32*)&x; // get bits for floating value"

creates type-punning... gcc doesn't care about this when the optimizer isn't involved... but, when you run the optimizer, it does (we pass -O2 for release builds), and it produces slightly more code to allow the optimizer to treat the pun "i" and "x" differently (sadly, in this case, the debug version probably produces code that runs in fewer cycles).

Adding -fno-strict-aliasing to the build flags for release makes this behave as expected.

Interestingly, adding -wstrict-aliasing to the build flags produces a number of warnings in places in Torque where aliasing is happening.
#23
01/23/2007 (12:48 pm)
I think we've been compiling Torque with -fno-strict-aliasing on OS X for a while now, so that might be an acceptable fix.

Or - it looks like you could work around this with a union, according to gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.html#index-fstrict_002dali....
#24
01/23/2007 (1:40 pm)
While we're all here... [Alex?] I've spent the better part of a day tracking down a problem - thankfully I had it show up in a new profile journal so I could keep playing it back. [Man - that journaling is GREAT!] It was showing up as an assert in DepthSortList::depthPartition(), but I eventually tracked it down to the function above...

It seems that Point3FToVector0() doesn't always do the right thing. Alex: is this the same as the one you wrote which you reference under the heading Issues with 12 bytes?

In the meantime, I'm going back to the old method:

register vector float	pv;
	register F32		*pvp = (F32*) &pv;
	
	pvp[0] = p[0];
	pvp[1] = p[1];
	pvp[2] = p[2];
	pvp[3] = 0.0f;

Which works fine, even though it's slightly slower...
#25
01/23/2007 (2:38 pm)
I think this altivec function should not be used...

The results are not exactly what the C version produces - it's the same to several decimal places, but not exact - even adding in extra rounds of Newton-Raphson. This showed up on my journal playback - the paths were correct , but things were just a little off.

My concern is that this might make Mac PPCs not play nicely with others.
#26
01/23/2007 (2:40 pm)
Quote:
My concern is that this might make Mac PPCs not play nicely with others.

Interestingly, this has been an issue with Torque based games in the past as well. You might be sniffing out something here that has been an issue for a very long time.
#27
01/25/2007 (12:59 pm)
If you want fast, this is much faster than even the one Orion pasted in:

float SqrtSSE(float x)
{
   __asm
   {
      movss xmm0, x
      rsqrtss xmm0, xmm0
      rcpss xmm0, xmm0
      movss x, xmm0
   }

   return x;
}

Of course, you need SSE (although it's not hard to check if it's available).

This runs almost 3x faster. I haven't done any testing to determine the accuracy, but it's probably as good as the one Orion pasted.

I just started experimenting with the "rsqrtps" instruction, but it seems having a function to do 4 inverse square roots might have limited utility in Torque.

I haven't looked at the newer Torque code bases, but it seems like there are a number of routines in "math" that could be replaced with SSE-enabled or 3DNow-enabled variants and get a substantial speed boost (maybe 1.4 or 1.5 use SSE more, I haven't looked).
#28
01/27/2007 (6:19 pm)
Mmm thats good stuff Tim, trying out in my code now.
and using this for invSqrt which hits a few times in the skinning code and wasn't getting SSE'd by vc even though I told it to do so in the opt settings.
inline F32 mInvSqrt(const F32 val)
{
   __asm
   {
      movss xmm0, val
      rsqrtss xmm0, xmm0
      movss val, xmm0
   }

   return val;

}

building now, we'll see..
#29
01/27/2007 (7:33 pm)
Well that didn't work,
I made my invsqrt just call your sse function instead...making mSqrt call your sse sqrt ends up giving me a bunch of numeric errors, and the game opens but the scene never loads up..maybe I don't have me a good SSE processor here...I did a full rebuild, have to try again tomorrow.
Page«First 1 2 Next»