Game Development Community

dev|Pro Game Development Curriculum

Optimizing on the Mac

by Kyle Goodwin · 09/09/2004 (10:54 am) · 9 comments

When you get right down to it, the fact is that Torque is highly underoptimized for the mac. Macs are not slow and it's a shame that Torque runs so slowly on macs by default. The following optimizations will help you get a leg up on mac speed improvements including altivec, ppc assembly, and compiler optimizations.

Compiler Optimizations and Altivec

The most important and most easily implemented optimizations for mac are done by gcc itself. On the g4, the defaults aren't horrible, but they can be improved. On the g5 the defaults are utterly horrible. Just changing the flags will give quite a nice gain on the g5.

First, a little background on how PowerPC processors work and why the default options aren't optimal. PowerPC processors since the g4 (and IBM's special version of the g3) have included a vector engine called either altivec or the velocity engine depending on whose marketing department you ask. We'll call it altivec. Altivec is capable of operating on 128-bit vectors composed of either 8-bit, 16-bit, 32-bit, or 64-bit values (and with some hackerey you can get it to crunch 128-bit single values for you). This is important because many things are done 2-4 times in 3d graphics since we have 2-4 components to our vectors, colors, etc. (uv, xyz, xyzw, rgb, rgba, etc.) With altivec you can take one rgba (in 32-bits per component) for example and add it to another rgba in one step rather than the 4 steps it would take without altivec. The way to do this is to put each component in a slot of a vector and then use the altivec opperations like vec_add() (the vec_ calls are generally 1-to-1 correspondence between the C function and the assembly instruction, so it's unnecessary to go to the trouble of actually writing most altivec code in assembly) to operate on them. The big thing to realize here, though, is that it can take some work to recode things to make use of altivec. The reason I discuss it here is that it has a side effect: PowerPC in general is very alignment-sensitive and altivec *requires* 16-byte alignment. This leads us back to our discussion of compiler flags and the best thing you can do to speed up your mac code easily: tell gcc to properly align values/instructions in memory.

Altivec Flags

In order to make use of altivec you must use the following flag (which defines __VEC__ which you can check for in order to enable or diable altivec optimizations in your code).

-faltivec

Alignment Flags

Unfortunately I'm not as knowledgeable about the G4 internals as I am about the G5 internals and they are quite different chips. Following are the flags I use for alignment on the G5. For the G4 these should help as well, but I'm not entirely sure exactly how much.

-falign-loops=16 -falign-functions=16 -falign-labels=16 -falign-jumps=16

Other Optimization Flags

You should always use -O3 when compiling for release, in addition you should produce seperate G4 and G5 binaries (I don't treat the G3 here since it is largely obsolete).

G5: -mcpu=970 -mtune=970 -mpowerpc64
G4: -mcpu=G4 -mtune=G4

On the G5 there are various things like square root implemented in hardware, you should use the following flags to enable these and similar optimizations on the G5.

-mpowerpc-gpopt -ffast-math

To further improve your G4/G5 builds you can use the following flags which enable various optimizations generally increasing binary size but decreasing run time.

-funroll-loops -finline -fobey-inline -malign-natural

Lastly multiple memory operations are costly on G5, so disable them with the following flag.

-mno-multiple

Altivec

In addition to being able to fold multiple calculations into one with altivec, altivec code also executes on its own part of the CPU and thus can occur in parallel with normal CPU operation. This is the second big gain with altivec, so in order to take optimal advantage of this you want to interact with vectors using the normal CPU operations as little as possible to allow parallelization to take place. There is a very good treatment of the basic altivec C instructions on the apple developer connection here: http://developer.apple.com/hardware/ve/instruction_crossref.html

Soon to be included in the HEAD code is my altivec rewrite of the texture blender. According to my profiling and researh I determined that this was the limiting factor on the mac (it was written for mmx/sse/sse2 in assembly on pc but only in unoptimized C for mac) and rewrote it to use altivec when available. This single improvement plus the glFinish() optimization mentioned below will improve framerates by a factor of 2-5 in most cases meaning my average framerate on my G5 with Radeon 9600 went for 20 to about 80-120. Feel free to email me for info on getting this into your code before it appears in HEAD if you're in a time crunch.

A Final Word

Parallelism, generally, is the biggest key to improving performance on the mac. We want to parallelize the GPU operations, CPU operations, and altivec operations. Keeping the CPU and altivec operations in parallel was discussed above, keeping the GPU in parallel is done by removing the call to glFinish() in the engine. This is a useless call that the drivers will worry about and which forces syncing between the CPU and GPU, a bad thing. The only reason to have it is for "accurate" CPU benchmarking using FPS, basically.

#1
09/09/2004 (9:16 am)
(One note about the glFinish call - on the PC version of Torque, it's necessary to keep the D3D layer in synch. Be careful if you get rid of it!)

Good work, and interesting read, Kyle. Hopefully we can apply some of this stuff to HEAD!
#2
09/09/2004 (10:19 am)
Thanks! That's really good to know, I'll ammend my changes to only disable glFinish if it's on a mac. I've never doen any testing on windows through d3d really.
#3
09/10/2004 (9:41 am)
Great article Kyle :)
#4
11/20/2004 (10:54 pm)
@Ben - I know this is an old thread, but can you be more specific regarding the glFinish call?
#5
11/21/2004 (12:26 pm)
@Joe: I guess he means that in directx there's an equivalent to glFinish that HAS to be called, so removing glFinish would make direct3d not to work. But that's just my guess on his wording, not on personal experience.
#6
11/22/2004 (10:50 am)
It's just that when translating openGL to D3D you have to make sure the D3D stuff is actually getting done before you throw more at it from mroe openGL so the translator can know a portion is finished when glFinish is paused and sort of flush the D3D queue. I'm not sure if this is even remotely right, but it's my best guess.
#7
11/24/2004 (2:19 pm)
To note, the next version of Torque will have calls to glFinish ifdef'd for Mac (and probably linux). After profiling, a significant portion of the time spent in OpenGL is tied up in calls to glFinish(), so this is an easy, worthwhile optimization for the engine.
#8
11/29/2004 (5:09 pm)
glFinish will block until all pending GL commands are finished (ie. The call will only return when everything has been drawn. man glFinish). This is a bad thing to do as you loose one of the main advantages of having a GPU do the drawing (asynchronous execution between the GPU and CPU).

It sounds like it was being used to emulate the behavior of the DX device method EndScene. This is not the best approach for an OpenGL system (please forgive my complete lack of knowledge of the engine as I just got it today and have not really had a chance to look at it yet. I am just browsing the forums trying to get acclimated :).

The standart GL way to signify that you have finished drawing into a context is to call glFlush. This call will return immediately (or nearly so) and allows the driver/card to do the right thing in the best way it knows how while you are busy preparing the next frame.

With a double buffered context for drawing to the screen (whether fullscreen or windowed) ans assuming you are using AGL (from what I have been able to gather so far, I think Torque is using AGL), you would instead call aglSwapBuffer which performs an implicit glFlush and will swap the contents of the backbuffer (after all commands have finished) to the front buffer to display on screen. Other Mac GL layers (CGL or NSOpenGL) have similar mechanisms.

In the case where you are rendering into a buffer (whether a pbuffer or an offscreen context) and you want to use the results of that render as a texture for another context, you should also call glFlush before switching contexts to make sure the texture will contain the finished results of the render.

Hope this helps.
#9
12/06/2005 (2:13 pm)
Thanks for the tips Kyle. I'm a PC user and just bought my first mac last week. Several of your suggested optimizations are set in the XCode project for TGE 1.4, but I added the few that were missing. I have an iMac G5 2.1 GHz and at 1024*768 I get 70-120 FPS in the fps and racing demos. Nice and speedy. Thanks!