Multi-processor parallel programming in TGE
by Duncan Gray · 05/15/2007 (11:58 am) · 23 comments
Download Code File
Latest Update
I have finally upgraded to a duel core AMD Athlone system and done some further testing of this resource.
The benchmark test in the resource has a minor bug where the single CPU test only does 1000 iterations instead of 10000. Set it to 10000 and you get a fair comparison between single and multi core times.
Here is my benchmark test results:
So data parallelism does work. But trying to apply it to the two sections of TGE (skinning, vehicle wheel physics) actually causes a 3-5 % decrease in frame-rate which is bad, not what was expected.
Here is what I believe is happening. This threading model works by releasing threads into the computational loop and then halting the main game thread to wait for the computational threads to complete. It takes a bit of time and CPU cycles to turn threads on and off When the computational threads are done, this model releases the main game thread but again it takes a tiny bit of time for things to get going in the main thread.
All these delays add up to a 'cost' in time which is brought on by using threads. To overcome this 'cost' the computational loops need to consume more time that the time lost waiting while the O/S goes off to attend to other threads. So the benchmark test has 10000 iterations and the results look good because it consumes enough CPU time to outweigh the thread wait time by a large degree.
The CPU time of the computational loops tested in TGE are just to short to be useful candidates for data-parallelism. Either that or there is some bug I missed.
The original resource follows below:
--------------------------------------------------------------------------------------------------------------------------------------
Note: At the time of writing, I don't own a multi-processor computer. I have thus depended on feedback from other community members to validate this.
Warning: this can be complex, lengthy explanations to follow.
Installing
Unzip the zip file in you engine folder.
You need to add 4 new files to your project before compiling. (platformMultiprocess.h, platformMultiprocess.cc, winMultiprocess.cc, x86UNIXMultiprocess.cc, macCarbMultiprocess.cc)
The changes to main.cc was simply a one line call to instanciate a global instance of the multiprocess class
The changes to consolObject.h was just to derive it from MP_base.
I have added the necessary Mac and Linux code but have only compiled it on Windows and Linux.
Design Overview
The MultiProcess class pre-creates a thread pool, one thread per CPU. Computational loops are then circulated through this multi-processor class so that loop cycle's are shared among different CPU's which should result in the overall loop computation time being divided by the number of CPU's, resulting in a significant increase in performance.
The threads, being pre-created, don't add any time overhead. But two semaphores and a mutex are required to keep things in order and these will add a tiny bit of overhead.
For that reason, the class uses a non threaded loop if it detects a single CPU which means you can add this to the engine with zero negative impact on single CPU machines but significant performance boost on multi CPU machines.
Used within TGE
The zip file contains all the code plus examples of using the threaded loop concept in the TSSkinMesh::updateSkin method as well as the WheeledVehicle::UpdateForces method.
updateSkin() is a good choice because the standard code goes through each player mesh vertex (about 3000 for LOD 0) and does a bone matrix multiplication on each. That's 3000 computations (per frame!) which can be shared among all available CPU's to save time.
The WheeledVehicle::UpdateForces method does intensive and drawn out computations for each wheel of the vehicle. Imagine if you have 4 CPU's and each handles the computations for a wheel...
Where else can you use this in the engine? Any place which has a loop (for, while etc) containing enough 'work' to justify sharing that work load among various CPU's
TSanimate.cc contains many such loops which cycle through each bone and applies translations, rotations, scales to each for each frame. Collisions cycle through each poly of a collision hull. AI cycles through it's list of options and chooses the best solution etc. I'm sure there are plenty more.
Where can't you use it? Any loop which has to be processed in a definite order. (Think about it)
What precautions to look out for
Any loop where some part of each cycle of the loop depends on data from the previous cycle.
global and static variables
Loops must not contain a 'break' statement
Below is how you fix that to make it work.
The cumulative_result variable was kept global (with respect to the threads) but it was made thread safe with a mutex set.
If you don't understand why the above was necessary then you need to study up on threaded programming.
It's not rocket science but you do need to think in parallel or you are going to trip up.
The above bit of code is a poor example of reality but it's ok for instructional purposes. In the example TGE code in the zip file, it was not necessary to use any mutex locking. Avoid such mutex locking it if you can but sometimes it's necessary.
Detailed Explanation
The multiprocess class actually consists of two classes. A per instance base class and a single instance 'working' class. The base class, called MP_base is a pure virtual class contain a pure virtual 'processThread' method which derived classes must implement. This means you need to derive the relevant TGE base class from MP_base and then implement the 'processThread' method as and where you need it.
The single instance multiprocess class then calls the 'processThread' method of the class instance which needs to run the loop.
The concept is that the above loop contents gets put into it's own private method within the class it came from, as shown below.
mp.start(this, loopStartValue, loopEndValue,&anyDataOrStructureYouMayNeed);
mp is the single instance of the multiprocess class. The first parameter is always 'this' because mp needs to know who called it. Next two parameters are for loop control and the final parameter is a void pointer for passing anything you may need.
In the wheeledvehicle example you will see a call which looks as follows
The wheeledvehicle header file has the following additions
The second line is the implementation of the base class pure virtual execute() method. Depending on your setup, you may also need to add a Parent::execute(counter, data) call to the implementation.
The third line is the method which the multiprocess class will finally call to do the work. These methods must all be of a form which takes 2 parameters (S32, void*)
The MultiProcess class constructor will get a CPU count and if more than one is found, it creates a thread pool equal to the CPU count. These threads then run smack bang into a blocking semaphore forcing them to wait for further instructions.
If the CPU count was one, then no threads are created. For performance reasons, a method-pointer is used to store the appropriate loop implementation (threaded or not).
The 'start' method is what the TGE derived class instance will call.
The 'start' method calls the method-pointer to run either a same-thread loop or a multi-threaded loop.
If the startThreaded method was called then the blocking semaphores are released while the main thread waits for completion of the released threads.
The processThreaded method is designed as for-ever loop which just calls the base class instance processThread method for the required number of loop cycles.
Simple.... :)
Other ways of getting CPU count
Here
Here

Latest Update
I have finally upgraded to a duel core AMD Athlone system and done some further testing of this resource.
The benchmark test in the resource has a minor bug where the single CPU test only does 1000 iterations instead of 10000. Set it to 10000 and you get a fair comparison between single and multi core times.
Here is my benchmark test results:
Parallel Processing Actvated on 2 CPU's Setting single thread modet Benchmark test took 187 milliseconds in single thread mode Parallel Processing activated Benchmark test took 94 milliseconds with 2 threads Best performance [94] was with 2 threads setting thread pool accordingly
So data parallelism does work. But trying to apply it to the two sections of TGE (skinning, vehicle wheel physics) actually causes a 3-5 % decrease in frame-rate which is bad, not what was expected.
Here is what I believe is happening. This threading model works by releasing threads into the computational loop and then halting the main game thread to wait for the computational threads to complete. It takes a bit of time and CPU cycles to turn threads on and off When the computational threads are done, this model releases the main game thread but again it takes a tiny bit of time for things to get going in the main thread.
All these delays add up to a 'cost' in time which is brought on by using threads. To overcome this 'cost' the computational loops need to consume more time that the time lost waiting while the O/S goes off to attend to other threads. So the benchmark test has 10000 iterations and the results look good because it consumes enough CPU time to outweigh the thread wait time by a large degree.
The CPU time of the computational loops tested in TGE are just to short to be useful candidates for data-parallelism. Either that or there is some bug I missed.
The original resource follows below:
--------------------------------------------------------------------------------------------------------------------------------------
Note: At the time of writing, I don't own a multi-processor computer. I have thus depended on feedback from other community members to validate this.
Warning: this can be complex, lengthy explanations to follow.
Installing
Unzip the zip file in you engine folder.
You need to add 4 new files to your project before compiling. (platformMultiprocess.h, platformMultiprocess.cc, winMultiprocess.cc, x86UNIXMultiprocess.cc, macCarbMultiprocess.cc)
The changes to main.cc was simply a one line call to instanciate a global instance of the multiprocess class
The changes to consolObject.h was just to derive it from MP_base.
I have added the necessary Mac and Linux code but have only compiled it on Windows and Linux.
Design Overview
The MultiProcess class pre-creates a thread pool, one thread per CPU. Computational loops are then circulated through this multi-processor class so that loop cycle's are shared among different CPU's which should result in the overall loop computation time being divided by the number of CPU's, resulting in a significant increase in performance.
The threads, being pre-created, don't add any time overhead. But two semaphores and a mutex are required to keep things in order and these will add a tiny bit of overhead.
For that reason, the class uses a non threaded loop if it detects a single CPU which means you can add this to the engine with zero negative impact on single CPU machines but significant performance boost on multi CPU machines.
Used within TGE
The zip file contains all the code plus examples of using the threaded loop concept in the TSSkinMesh::updateSkin method as well as the WheeledVehicle::UpdateForces method.
updateSkin() is a good choice because the standard code goes through each player mesh vertex (about 3000 for LOD 0) and does a bone matrix multiplication on each. That's 3000 computations (per frame!) which can be shared among all available CPU's to save time.
The WheeledVehicle::UpdateForces method does intensive and drawn out computations for each wheel of the vehicle. Imagine if you have 4 CPU's and each handles the computations for a wheel...
Where else can you use this in the engine? Any place which has a loop (for, while etc) containing enough 'work' to justify sharing that work load among various CPU's
TSanimate.cc contains many such loops which cycle through each bone and applies translations, rotations, scales to each for each frame. Collisions cycle through each poly of a collision hull. AI cycles through it's list of options and chooses the best solution etc. I'm sure there are plenty more.
Where can't you use it? Any loop which has to be processed in a definite order. (Think about it)
What precautions to look out for
Any loop where some part of each cycle of the loop depends on data from the previous cycle.
global and static variables
Loops must not contain a 'break' statement
int cumulative_result = 0;
int result=0;
for(int j=0;j<something;j++)
{
[b]result[/b] = do_some_maths(data[j]);
[b]cumulative_result += result;[/b]
}If the above loop was handled in a data parallelism manner, it would give incorrect results because there are two or more threads running through that loop.Below is how you fix that to make it work.
int cumulative_result = 0;
[b]_mutex = createmutex();[/b]
for(int j=0;j<something;j++)
{
[b]int[/b] result = do_some_maths(data[j]);
[b]_mutex.aquire();[/b]
cumulative_result += result;
[b]_mutex.release()[/b]
}The declaration of the 'result' variable was moved into the threaded area so that each thread would have its own instance of that variable to prevent threads from overwriting the 'result' of another thread.The cumulative_result variable was kept global (with respect to the threads) but it was made thread safe with a mutex set.
If you don't understand why the above was necessary then you need to study up on threaded programming.
It's not rocket science but you do need to think in parallel or you are going to trip up.
The above bit of code is a poor example of reality but it's ok for instructional purposes. In the example TGE code in the zip file, it was not necessary to use any mutex locking. Avoid such mutex locking it if you can but sometimes it's necessary.
Detailed Explanation
The multiprocess class actually consists of two classes. A per instance base class and a single instance 'working' class. The base class, called MP_base is a pure virtual class contain a pure virtual 'processThread' method which derived classes must implement. This means you need to derive the relevant TGE base class from MP_base and then implement the 'processThread' method as and where you need it.
The single instance multiprocess class then calls the 'processThread' method of the class instance which needs to run the loop.
The concept is that the above loop contents gets put into it's own private method within the class it came from, as shown below.
myclass::MP_process1(S32 counter, void* anyOtherData)
{
[b]int[/b] result = do_some_maths(data[counter]);
[b]_mutex.aquire();[/b]
cumulative_result += result;
[b]_mutex.release()[/b]
}Then the for(int i=0;imp is the single instance of the multiprocess class. The first parameter is always 'this' because mp needs to know who called it. Next two parameters are for loop control and the final parameter is a void pointer for passing anything you may need.
In the wheeledvehicle example you will see a call which looks as follows
// Multi-Process load method pointer and call mp.start() pMethod = &WheeledVehicle::MP_updateForces1; mp.start(this,0,wend - mWheel,&dt);The method pointer was introduced to speed up the repetitive function call in the processing loop. Instead of repeatedly checking a int to determine which method to run, the loop just calls the execute() method which in turn calls whatever method the pMethod pointer has been loaded with.
The wheeledvehicle header file has the following additions
//--- Multi-process additions ------------------------------------------------------------------------------
void (WheeledVehicle::*pMethod)(S32 counter, void* data ); // pointer to a method in THIS class
virtual void execute(S32 counter, void* data){(this->*pMethod)(counter, data );} // execute the above method
void MP_updateForces1(S32 counter, void *data);The fist line is a method pointer definition. You will need to replace WheeledVehicle with the respective class name where you are implementing the method pointer. i.e. if you are going to add a similar method pointer to the Vehicle class then it will be void (Vehicle::*pMethod)(S32 counter, void* data );The second line is the implementation of the base class pure virtual execute() method. Depending on your setup, you may also need to add a Parent::execute(counter, data) call to the implementation.
The third line is the method which the multiprocess class will finally call to do the work. These methods must all be of a form which takes 2 parameters (S32, void*)
The MultiProcess class constructor will get a CPU count and if more than one is found, it creates a thread pool equal to the CPU count. These threads then run smack bang into a blocking semaphore forcing them to wait for further instructions.
If the CPU count was one, then no threads are created. For performance reasons, a method-pointer is used to store the appropriate loop implementation (threaded or not).
The 'start' method is what the TGE derived class instance will call.
The 'start' method calls the method-pointer to run either a same-thread loop or a multi-threaded loop.
If the startThreaded method was called then the blocking semaphores are released while the main thread waits for completion of the released threads.
The processThreaded method is designed as for-ever loop which just calls the base class instance processThread method for the required number of loop cycles.
Simple.... :)
Other ways of getting CPU count
Here
Here

About the author
#22
09/04/2008 (11:22 pm)
And just a little more data. The break even point where two threads takes as long as a no-thread version is 31 milliseconds on my computer. If the computational loop takes longer than 32ms then it's faster to use two threads. This is way longer than the time it takes for a semaphore to change state so I'm assuming it's an O/S limitation in determining whether to push a thread onto the second CPU or not.
#23
I only have 1 processor, but it's a Core 2 Duo. Since I have a dual-core, will I get better performance?
07/25/2009 (5:10 pm)
Noob question:I only have 1 processor, but it's a Core 2 Duo. Since I have a dual-core, will I get better performance?

Torque Owner Duncan Gray
I have finally upgraded to a duel core AMD Athlone system and done some further testing of this resource.
The benchmark test in this resource has a minor bug where the single CPU test only does 1000 iterations instead of 10000. Set it to 10000 to match the multi-core test and you get a fair comparison between single and multi core times.
Here is my benchmark test results:
So data parallelism does work well. But trying to apply it to the two sections of TGE (skinning, vehicle wheel physics) actually causes a 3-5 % decrease in frame-rate which is bad, not what was expected.
See the explanation at the top of the resource.