Game Development Community

dev|Pro Game Development Curriculum

Multi-processor parallel programming in TGE

by Duncan Gray · 05/15/2007 (11:58 am) · 23 comments

Download Code File

Latest Update
I have finally upgraded to a duel core AMD Athlone system and done some further testing of this resource.

The benchmark test in the resource has a minor bug where the single CPU test only does 1000 iterations instead of 10000. Set it to 10000 and you get a fair comparison between single and multi core times.

Here is my benchmark test results:
Parallel Processing Actvated on 2 CPU's
Setting single thread modet
Benchmark test took 187 milliseconds in single thread mode
Parallel Processing activated
Benchmark test took 94 milliseconds with 2 threads
Best performance [94] was with 2 threads
setting thread pool accordingly

So data parallelism does work. But trying to apply it to the two sections of TGE (skinning, vehicle wheel physics) actually causes a 3-5 % decrease in frame-rate which is bad, not what was expected.

Here is what I believe is happening. This threading model works by releasing threads into the computational loop and then halting the main game thread to wait for the computational threads to complete. It takes a bit of time and CPU cycles to turn threads on and off When the computational threads are done, this model releases the main game thread but again it takes a tiny bit of time for things to get going in the main thread.

All these delays add up to a 'cost' in time which is brought on by using threads. To overcome this 'cost' the computational loops need to consume more time that the time lost waiting while the O/S goes off to attend to other threads. So the benchmark test has 10000 iterations and the results look good because it consumes enough CPU time to outweigh the thread wait time by a large degree.

The CPU time of the computational loops tested in TGE are just to short to be useful candidates for data-parallelism. Either that or there is some bug I missed.


The original resource follows below:
--------------------------------------------------------------------------------------------------------------------------------------

Note: At the time of writing, I don't own a multi-processor computer. I have thus depended on feedback from other community members to validate this.

Warning: this can be complex, lengthy explanations to follow.

Installing
Unzip the zip file in you engine folder.
You need to add 4 new files to your project before compiling. (platformMultiprocess.h, platformMultiprocess.cc, winMultiprocess.cc, x86UNIXMultiprocess.cc, macCarbMultiprocess.cc)

The changes to main.cc was simply a one line call to instanciate a global instance of the multiprocess class
The changes to consolObject.h was just to derive it from MP_base.

I have added the necessary Mac and Linux code but have only compiled it on Windows and Linux.

Design Overview

The MultiProcess class pre-creates a thread pool, one thread per CPU. Computational loops are then circulated through this multi-processor class so that loop cycle's are shared among different CPU's which should result in the overall loop computation time being divided by the number of CPU's, resulting in a significant increase in performance.

The threads, being pre-created, don't add any time overhead. But two semaphores and a mutex are required to keep things in order and these will add a tiny bit of overhead.

For that reason, the class uses a non threaded loop if it detects a single CPU which means you can add this to the engine with zero negative impact on single CPU machines but significant performance boost on multi CPU machines.


Used within TGE

The zip file contains all the code plus examples of using the threaded loop concept in the TSSkinMesh::updateSkin method as well as the WheeledVehicle::UpdateForces method.

updateSkin() is a good choice because the standard code goes through each player mesh vertex (about 3000 for LOD 0) and does a bone matrix multiplication on each. That's 3000 computations (per frame!) which can be shared among all available CPU's to save time.

The WheeledVehicle::UpdateForces method does intensive and drawn out computations for each wheel of the vehicle. Imagine if you have 4 CPU's and each handles the computations for a wheel...

Where else can you use this in the engine? Any place which has a loop (for, while etc) containing enough 'work' to justify sharing that work load among various CPU's
TSanimate.cc contains many such loops which cycle through each bone and applies translations, rotations, scales to each for each frame. Collisions cycle through each poly of a collision hull. AI cycles through it's list of options and chooses the best solution etc. I'm sure there are plenty more.

Where can't you use it? Any loop which has to be processed in a definite order. (Think about it)

What precautions to look out for
Any loop where some part of each cycle of the loop depends on data from the previous cycle.
global and static variables
Loops must not contain a 'break' statement
int cumulative_result = 0;
int result=0;
for(int j=0;j<something;j++)
{
    [b]result[/b] = do_some_maths(data[j]);
    [b]cumulative_result += result;[/b]
}
If the above loop was handled in a data parallelism manner, it would give incorrect results because there are two or more threads running through that loop.
Below is how you fix that to make it work.
int cumulative_result = 0;
[b]_mutex = createmutex();[/b]
for(int j=0;j<something;j++)
{
    [b]int[/b] result = do_some_maths(data[j]);
    [b]_mutex.aquire();[/b]
    cumulative_result += result;
    [b]_mutex.release()[/b]
}
The declaration of the 'result' variable was moved into the threaded area so that each thread would have its own instance of that variable to prevent threads from overwriting the 'result' of another thread.
The cumulative_result variable was kept global (with respect to the threads) but it was made thread safe with a mutex set.
If you don't understand why the above was necessary then you need to study up on threaded programming.
It's not rocket science but you do need to think in parallel or you are going to trip up.

The above bit of code is a poor example of reality but it's ok for instructional purposes. In the example TGE code in the zip file, it was not necessary to use any mutex locking. Avoid such mutex locking it if you can but sometimes it's necessary.

Detailed Explanation

The multiprocess class actually consists of two classes. A per instance base class and a single instance 'working' class. The base class, called MP_base is a pure virtual class contain a pure virtual 'processThread' method which derived classes must implement. This means you need to derive the relevant TGE base class from MP_base and then implement the 'processThread' method as and where you need it.

The single instance multiprocess class then calls the 'processThread' method of the class instance which needs to run the loop.

The concept is that the above loop contents gets put into it's own private method within the class it came from, as shown below.
myclass::MP_process1(S32 counter, void* anyOtherData)
{
    [b]int[/b] result = do_some_maths(data[counter]);
    [b]_mutex.aquire();[/b]
    cumulative_result += result;
    [b]_mutex.release()[/b]
}
Then the for(int i=0;i mp.start(this, loopStartValue, loopEndValue,&anyDataOrStructureYouMayNeed);

mp is the single instance of the multiprocess class. The first parameter is always 'this' because mp needs to know who called it. Next two parameters are for loop control and the final parameter is a void pointer for passing anything you may need.

In the wheeledvehicle example you will see a call which looks as follows
// Multi-Process load method pointer and call mp.start()
   pMethod = &WheeledVehicle::MP_updateForces1;
   mp.start(this,0,wend - mWheel,&dt);
The method pointer was introduced to speed up the repetitive function call in the processing loop. Instead of repeatedly checking a int to determine which method to run, the loop just calls the execute() method which in turn calls whatever method the pMethod pointer has been loaded with.
The wheeledvehicle header file has the following additions
//--- Multi-process additions ------------------------------------------------------------------------------
  void (WheeledVehicle::*pMethod)(S32 counter, void* data ); // pointer to a method in THIS class
  virtual void execute(S32 counter, void* data){(this->*pMethod)(counter, data );} // execute the above method
  void MP_updateForces1(S32 counter, void *data);
The fist line is a method pointer definition. You will need to replace WheeledVehicle with the respective class name where you are implementing the method pointer. i.e. if you are going to add a similar method pointer to the Vehicle class then it will be void (Vehicle::*pMethod)(S32 counter, void* data );
The second line is the implementation of the base class pure virtual execute() method. Depending on your setup, you may also need to add a Parent::execute(counter, data) call to the implementation.
The third line is the method which the multiprocess class will finally call to do the work. These methods must all be of a form which takes 2 parameters (S32, void*)

The MultiProcess class constructor will get a CPU count and if more than one is found, it creates a thread pool equal to the CPU count. These threads then run smack bang into a blocking semaphore forcing them to wait for further instructions.

If the CPU count was one, then no threads are created. For performance reasons, a method-pointer is used to store the appropriate loop implementation (threaded or not).

The 'start' method is what the TGE derived class instance will call.

The 'start' method calls the method-pointer to run either a same-thread loop or a multi-threaded loop.

If the startThreaded method was called then the blocking semaphores are released while the main thread waits for completion of the released threads.

The processThreaded method is designed as for-ever loop which just calls the base class instance processThread method for the required number of loop cycles.

Simple.... :)

Other ways of getting CPU count
Here
Here

www.ultrabizz.com/images/Multi_process_resource.jpg
Page «Previous 1 2
#1
05/15/2007 (7:18 am)
Great, i will implement and test it tomorrow ;)
Thx for a nice and useful ressource!
#2
05/17/2007 (7:39 pm)
I have update the windows section, can someone please give feedback.
#3
05/18/2007 (10:13 pm)
Coded updated for TGE 1.52.
Also added 2 console methods:

MP_setThreaded();
MP_setSingle();

so that you can compare modes without swapping EXE's
#4
05/19/2007 (8:02 pm)
Updated with linux and mac changes. It actually compiles on linux SUSE 10.2 now.
#5
05/22/2007 (2:42 pm)
I will be doing another update in a day or two to squeese out a bit more performance.
#6
05/22/2007 (3:58 pm)
I'm trying to try out your code in Linux. I've just managed to get TGE to build and pretty much run (involved rebuilding an X library to get it to link!). It's not quite right, though, yet...
#7
05/22/2007 (4:12 pm)
Not many have commented on this, but good frameworks that make it easy to take selective advantage of multicore CPUs is critical for the upcoming shift in game development. Both the Saints Row team and Valve had great talks on how they approached multicore in their projects. This resource is closer to how Valve approached things from what i can tell.

Another thing to look into Duncan is lockless programming techniques to remove the semaphores and mutex overhead.
#8
05/22/2007 (9:30 pm)
Running into a little roadblock building on Linux. Don't know if it's my ignorance (most likely) or something else:
--> Compiling ts/tsMesh.cc
In file included from ./dgl/materialList.h:10,
                 from ./ts/tsMesh.h:22,
                 from ts/tsMesh.cc:6:
./dgl/gTexManager.h:428:26: warning: no newline at end of file
ts/tsMesh.cc: In member function `virtual void TSMesh::render(S32, S32, TSMaterialList*)':
ts/tsMesh.cc:317: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapMethod'
ts/tsMesh.cc:317: error: `LIGHT_MAP_MULTI' is not a member of `TSShapeInstance'
ts/tsMesh.cc:321: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc: In static member function `static void TSMesh::initMaterials()':
ts/tsMesh.cc:373: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapMethod'
ts/tsMesh.cc:378: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc:640: error: `LIGHT_MAP_MULTI' is not a member of `TSShapeInstance'
ts/tsMesh.cc: In static member function `static void TSMesh::resetMaterials()':
ts/tsMesh.cc:783: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapMethod'
ts/tsMesh.cc:783: error: `LIGHT_MAP_MULTI' is not a member of `TSShapeInstance'
ts/tsMesh.cc:786: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc:792: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc: In static member function `static void TSMesh::setMaterial(S32, TSMaterialList*)':
ts/tsMesh.cc:1029: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapMethod'
ts/tsMesh.cc:1029: error: `LIGHT_MAP_MULTI' is not a member of `TSShapeInstance'
ts/tsMesh.cc:1032: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc:1034: error: 'class TSMaterialList' has no member named 'getLightMap'
ts/tsMesh.cc: In member function `void TSMesh::renderLightMap(S32, S32, TSMaterialList*)':
ts/tsMesh.cc:1455: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc:1456: error: 'class TSMaterialList' has no member named 'getLightMap'
ts/tsMesh.cc:1461: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc:1464: error: 'struct TSShapeInstance::RenderData' has no member named 'lightMapTE'
ts/tsMesh.cc:3436:2: warning: no newline at end of file
make[1]: *** [out.GCC4.RELEASE/ts/tsMesh.obj] Error 1
make: *** [default] Error 2
#9
05/22/2007 (9:44 pm)
I been out so sorry for lack of response.

The update I plan to do unvolves using method pointers to remove the switch/case statement requirement [switch( processNum )] which will also eliminate the need for the TSSkinMesh::processThread method as the multiprocess class will then just call the method pointer passed to it by the calling thread. It will make more sense when you see it.

@Tom, thanks, I'll look into that soon. I was thinking about splitting the thread loop into 2 or 4 etc according to how many processors there are and have each cpu handle that loop segment which will eliminate the need for the mutex around the counter. But I'll check out that link and see how they do things.

@Lee, thanks for trying a linux build. Mine compile OK. I'll get back to you in a bit.
#10
05/22/2007 (10:27 pm)
@Lee, I just re-checked and my Linux version compiles without a hitch. Are you using 1.5.2? I know 1.5.1 had many Linux compile problems.

Edit

Oh, did you add the new files to the engine\targets.torque.mk file? I'll add mine to the zip file to make it easier.
#11
05/23/2007 (3:00 pm)
I was on 1.5.0. I think I'm downloading 1.5.2 right now. Although the download says it's 1.5.0...
Actually it was my own game, a combination of 1.4/5. So I'll try it with vanilla 1.5.2.
#12
05/23/2007 (3:06 pm)
I don't think GG updated the linux download link, if it does not say its version 1.5.2 then it's probably not.

I took the windows 1.5.2 download, extracted it then copied the files to Linux and compiled without any errors, just the usual "no new line" warnings.
#13
05/23/2007 (4:37 pm)
Yep, that was it. Built stock 1.5.2 and it was fine. Added your stuff, and it builds but segfaults :-(
Built it debug, and got a whole bunch of identical: Fatal: (core/dataChunker.cc @ 27) Data chunk too large.

I did carefully check to make sure all your files were in place, and used your targets.torque.mk, for what it's worth.
#14
05/23/2007 (4:46 pm)
My Linux box has no 3d graphic card or monitor, its basically a server setup so I just test compiles on it.

I will try to setup a remote DDD debug session by running an X-windows setup on my XP box and see if I can duplicate and debug the segfault, but it's a mission and could take a while.

The Linux forums have many eager to please experts who may be able to help. Perhaps make a post there in the meantime. USA based guys can log in remotely and see whats going on.
#15
05/23/2007 (7:28 pm)
I just updated the code with the changes to use method pointers rather than a switch/case setup. It should be a tiny bit faster. I have not got round to looking at lock free programming yet.
#16
05/26/2007 (5:11 pm)
Updated to included a built in benchmark test

The test runs automatically when the multi-process class is created. Check the very top of your log file for the results (or the console window)

You can also re-run the test by typing MP_doBenchmark() in the console window.
#17
05/26/2007 (8:33 pm)
Added MP_forceThreads(count) console method to give more flexibility to test options

Try doing MP_forceThreads(32) followed by
MP_doBenchmark()
#18
05/28/2007 (10:39 am)
Duncan, very nice work! Your time and effort on making this happen is very much appreciated!
#19
07/04/2007 (2:47 pm)
Is there a better location to try and do a benchmark or such? I've put this into TGEA and doing it at the beginning of execution isn't the best, as the results are pretty bad for dualcore..

Doing it in console via dobenchmark/forcethreads gives better results
#20
07/04/2007 (6:14 pm)
Jeremiah I've stopped working on this because

(a) The results of other peoples test are rather inconclusive. It worked for some but not for others meaning if you put this in a game, it won't be useful for all CPU's

(b) I don't have a duel core so can't test/develop it further myself.

I suggest you look at www.openmp.org Its very similar in concept and is supposed to work with GCC and VC2005 Pro compilers
Page «Previous 1 2