Game Development Community

A better Interior::setupActivePolyList()

by asmaloney (Andy) · in Torque Game Engine · 01/14/2007 (3:56 pm) · 41 replies

In my continuing quest for a Stronghold which runs at a reasonable speed on my PowerBook G4, I've improved Interior::setupActivePolyList() which was near the top of my profile as a heavy function. The changes should be useful for everyone so I'm posting them here instead of in the Mac forum.

Interior::setupActivePolyList() is a fairly long function. To simplify profiling and to allow the compiler a better chance to optimize at the function level, I added a new function [doFogActive()] to separate out the cases when we have fog to deal with.

The main change though is the way activePoints is handled. This is used to mark points we've already visited so we don't have to do calculations on them again. In the original version, it was an array of U8s and was allocated as part of the frame using FrameAllocator::alloc(). The two problems with this are (1) activePoints is only used in this function, so there's no need to add it as part of the frame memory and (2) indexing into an array of U8s just for a 0-or-1 comparison is quite expensive because of the alignment of the data. So this version replaces activePoints with a bit vector and checks bits instead.

The initial version accounted for 8% of my run time on my profile journal. These changes reduced it to 6%.

If you try it, please let me know what kind of results you get [I'm interested in how it affects the Windows build too]. If you have suggestions for further improvement - I'm sure there are other things to do here - please post!

Aside: One of the things I see in profiling is that some classes/structs are not aligned. I think this is the cause of a lot of inefficient loads and stores, but some class/structs seem to be sensitive to their data layout, so I'm going to have to look at this more carefully.

-----

In interior/interior.h around line 708 add the function header for doFogActive():
void traverseZone(const RectD* inRects, const U32 numInputRects, U32 currZone, Vector<U32>& zoneStack);
[b]   void	doFogActive( bool environmentActive,
								SceneState* state,
								U32 outputCount, U16* output,
								U16* planeSides,
								const PlaneF &distPlane,
								const F32 distOffset,
								const Point3F &worldP,
								const Point3F& osZVec,
								const F32 worldZ );
[/b]
   void setupActivePolyList(ZoneVisDeterminer&, SceneState*,
                            const Point3F&, const Point3F& rViewVector,
                            const Point3F&,
                            const F32 worldz, const Point3F& scale);

continued...

[Edit: include no longer needed]
#21
01/16/2007 (8:32 pm)
I start profiling once the loading screens are finished and the main game screen appears. I may miss the exact time by a tiny bit, but it's definitely after the loading is done.

It would be basically impossible for me to run my game with no exterior rendering, since each room in the DIF has one or more windows or doors displaying the exterior. However, I could do a fixed view comparison.

I guess one of the points from this is that most of the performance issues I'm seeing are related to the blender, which essentially drown out any other performance bottlenecks that I might otherwise notice.
#22
01/16/2007 (8:47 pm)
I did try it again using a basically fixed view from inside the starting room of the DIF, with no exterior visible. I profiled with Shark, starting well after the loading screens were done and the game had started (using journaling).

Whereas Blender::blend_vec() previously accounted for over 13% with the spinning routine, it was now 0.7% (both with the old build and the new build), so that was taken out of the equation nicely.

That said, Interior:setupActivePolyList() went from 0.6% to 0.7%...
#23
01/17/2007 (4:17 pm)
Testing things out for the music lounge with this on WinXP 2ghz 2gb ram using VC 2005. doing it piece by piece.

for one particular level, SetupActivePolyList is the top hotspot was taking 10.01% of time wihin our exe.

changing just the activepoints to use the bitvector instead of the frameallocated U8's basically had no effect for me :(. it now takes up 10.20% of time within our exe. I suspect that vc's optimizer is smart enough to do it nicely there.

the exe did take like 0.2% less overall time this run though. but that's small enough a % it could just be normal variation.

I'm leaving the bitvector in as it seems like a better thing to do, and in case it helps our mac builds :)

I'll try some of your other changes and see how they go, and post results here.
#24
01/17/2007 (5:28 pm)
Hey Clint - thanks for the info. I wondered if the BitVector would make a difference on x86. Not sure if it's as sensitive to data alignment issues or - as you say - could be a compiler issue.

Are you using fog coords or textured fog?

I'll be interested in how your stuff performs when you put in all the changes here together. At the very least, moving the planeSides() checks should be somewhat beneficial by reducing internal loops.
#25
01/17/2007 (6:02 pm)
Hi Andy, I'm using FC fog.

I tried pulling the planesides stuff out to the top like you did, then just deleting the other checks, since they shouldn't be in the list anymore right? This actually made performance slightly worse, not much but slightly, might have been slop. It's possible I missed something in trying it out too.

I just tried pulling things out into the FogCalc class. This gave no perf boost for me, I'm still running about the same as before, it just shifted things around, now setupActivePolyList takes 6.22% but CalcFC takes 4.35% leaving me in about the same place.

I'm gonna look over it again and make sure I didn't miss anything obvious.
#26
01/17/2007 (7:10 pm)
Clint: Could you try just copying in all the code above to see what you get? Maybe it's just a wash on Windows? That would surprise me.
#27
01/17/2007 (8:46 pm)
You should read over the assembly the compiler is outputting to make sure the class is actually saving you calls and loads. Optimizers are finicky - based on my experience writing a software blitter a while ago, a few simple changes can make WORLDS of difference in how the optimizer is able to speed things up.
#28
01/17/2007 (8:56 pm)
@Ben: 'You' me or 'You' Clint :-) I've verified that both the FC and textured code are better on the GCC/PPC side of things - can't do x86 though. I'm still surprised that moving the planeside stuff doesn't gain anything - that just seems strange.
#29
01/18/2007 (9:49 am)
Andy, I was avoiding using your entire functions as I know we've changed a lot and though that we had worked on that func some, but looks like we haven't done much there. I'll try your whole functions and let you know how it works :)
#30
01/18/2007 (10:19 am)
Sorry Andy no luck here with our setup.
I took your functions exactly.

after your changes it does split things up and
setupactivepolylist is taking a low % but combined with doActiveFog and CalcFC it totals to being a bit slower than before this change.

how are you measuring your %'s are you sure you are including the amount setupactivepolylist, doActiveFog and your fogcalc are taking up now? the total of that should be compared to setupactivepolylist before your change. for me the it totals out to being either a bit slower or a wash.
#31
01/18/2007 (10:24 am)
Clint - just for my own benefit - are you testing on OS X or Win32?

Also, have you confirmed that the VC optimizer is generating appropriately inlined code for the fog calculator helper class?
#32
01/18/2007 (10:33 am)
Hey Ben, I'm on Win32
no I haven't been checking the assembly. just changing and running perf test with change.
#33
01/18/2007 (11:01 am)
Clint,

It's good to see what changes on Win32. G4's tend to have limited memory bandwidth, though, which means that in random access scenarios like this one, they get hit way harder than recent x86 systems will - and thus see way bigger wins from memory-access reduction optimizations that are happening here. I'd expect that if you weren't memory-bound, and you added a check to reduce memory read/writes, then you'd see - no change, or maybe a slight slowdown, which seems inline with what you're observing.

Although the fog calculator stuff _might_ be a win - provided that the optimizer is properly inlining the code. Vectorizing the fog calculations could also be a big win (ie, use SSE or MMX).
#34
01/18/2007 (11:26 am)
How do you see the fog stuff helping?
using CalcFC the only thing it looks like it's saving is one subtraction.

that is definitely the spot that is hitting us the hardest in setupActivePolyList though.

in particular this line:
mPoints[index].fogCoord = state->getHazeAndFog(mFabs(distPlane.distToPlane(mPoints[index].point)) + distOffset, (mDot(mPoints[index].point, osZVec) + worldZ) - worldP.z);

even with the inlined function it seems to me that it'll have to do almost the same work like so:

return( sState->getHazeAndFog( mFabs( distPlane.distToPlane(point) ) + fc_distOffset, fc_newWorldZ + mDot(point, osZVec) ) );
#35
01/18/2007 (4:05 pm)
Well, if it's inlining it, then you save a few instructions per calc, which adds up when you do something for every vertex. Maybe it also saves you an instruction cache miss, but you'd have to check in with VTune or AMD's profiler to find out if that's even an issue. If you're not memory bound then it probably isn't.

Of course, for all we know the VC compiler is spilling registers and doing all kinds of stupid casts and loads - which is why the asm is important to review.

In any case, vectorizing these calcs is likely to be a much bigger win on a compute bound platform (which is what it looks like x86 is).
#36
01/18/2007 (5:01 pm)
"Well, if it's inlining it, then you save a few instructions per calc,"

but only because we've moved it into a function in the first place right? If I leave it alone and don't add the fogcalc, It's basically the same as using fogcalc with inlining?

I feel like I'm missing something :)

thanks for the advice.
#37
01/18/2007 (5:19 pm)
The FogCalc helper class?

Quote:
if ( !activePoints.test( index ) )
{
activePoints.set( index );

mPoints[index].fogCoord = fogCalc.CalcFC( mPoints[index].point );

AssertFatal(mPoints[index].fogCoord >= 0.0f, "Error, neg fog coord!");
}

Previously this stuff was calling the scenegraph to do these calcs.
#38
01/18/2007 (5:31 pm)
Right now that codes just been moved into CalcFC which does basically the same code everytime you call it.

Quote:
inline F32 CalcFC( const Point3F &point ) const
{
return( sState->getHazeAndFog( mFabs( distPlane.distToPlane(point) ) + fc_distOffset,
fc_newWorldZ + mDot(point, osZVec) ) );
}

still calling into the scenestate to get the result.

the only change I really see (for this code path at least) is that fc_newWorldZ is precalculated for you. and of course you've done good software engineering by localizing your code to a function and not having to repeat it making everything else more readable :)
#39
01/18/2007 (5:39 pm)
Hmm. Hey Andy, you shouldn't use the FC path, it's gonna be way slow. :P

CalcTextured is the route that will be fast and the one I was referring to.
#40
01/18/2007 (6:07 pm)
@Ben: lol - I was just following orders!

Adding the FC calc to the class really cleans up the doFogActive function and keeps it in sync with the textured fog calc == easier to understand and maintain.

As I mentioned, I cannot really test/optimize the FC stuff easily, but it should not be slower. Just took a look at it and if I manually inline + jigger getHazeAndFog(), I can see a way to get at least a small savings... Let me see if I can get the FC stuff working again on my Mac, then I can revisit this.