Game Development Community

Multithreading in games- the future, the scam!

by Eric Preisz · 12/20/2006 (4:27 am) · 11 comments

I had lunch today with the CTO of a large organization. As his title implies, it is his responsibility to keep up-to-date with the trends of technology to enable his organization to react proactively, not just reactively. During our discussion, we stumbled on the multi-core processor.

If you haven't made it to the last 5 GDCs let me save you the giga price of a giga pass. Sort your polygons in a way that requires the least number of draw calls and multi-thread your application. And for the past 5 years, hardly anyone has listened. I'm sure that hardly anyone went home and reorganized their engine to abide by these two suggestions. In fact, in response to the lack of compliance, Microsoft has spent time engineering DX10 so that the draw call and poorly batched render states is not as slow as it is in DX9.

In keeping with the title of this blog, let's move from Microsoft to Intel. Gordon Moore, the co-founder of chipmaking giant Intel made the empirical observation that the number of transistors on a chip would double every 24 months. There is a direct correlation between more transistors and more processing.

Moore's law does have a limit -and we've reached it. There are two factors which are currently limiting Moore's law from continuing.

The first is c. That's right, the speed of light. An electron can travel at the speed of light across a wire similar to the way a Newton's cradle clanks on someone's desk. And although the speed of light is fast, the distance from transistor to transistor is becoming long enough to cause a latency that reduces the amount you are saving by adding the transistor in the first place. For example, if adding a new transistor saves you 4 units, the distance of the wire is costing us 5 netting us a 1 unit overall penalty.

The second is density- or rather, the result of the density. It's estimated that if we continue adding transistors at the same pace, a square centimeter of a chip will reach temperatures similar to that of the sun within the next 10 years. Therefore, the second factor limiting Moore's law is heat.

The solution? Multi-threading! We must divide and conquer.

So that's the future. Intel, who just released the quad core chip, foresees numbers such as 100 cores in the not so distant future. What was that? Did I just hear the cell processor's pseudo 9 cores ascend into its stomach?

Here's why I think multi-threading is a scam. (ok, it's not a scam, it's just going to be a bumpy ride for a while, but you wouldn't have read this otherwise )

1) Multi-threading doesn't solve memory issues

Many people add multi-threading to their games and see no speed improvement. One reason may be poor implementation. Poor use of critical sections may cause your program to spend a lot of time running in a sequential, privileged mode. Another reason may be that your program is suffering from an issue that mult-threading doesn't solve - cache misses.

There is a book written by Mike Abrash called, The Zen Of Optimization. Mike Abrash, for all accounts is the godfather of modern optimization. In his book, written in 1996, Mike states that memory is so fast that it outpaces the speed at which the CPU can process operations. That was true in 1996, but the opposite is true today.

Today, CPUs spend many cycles idle while the memory system fetches memory from either L1 cache, L2 cache, or, even worse, the dreaded system memory (System memory resides across the death valley of the FSB).

There are many things that cause cache misses. Mostly, it's due to a program not moving through memory sequentially. Any section of code that is suffering performance due to cache misses will not run faster when multi-threading on a dual core. (note that hyper-threading does solve some cache missing issues).

In fact multi-threading on a multi-core processor is likely to cause more cache misses than that of a single core. Why? Two reasons. Caches are a finite size. If you have four cores fighting for the amount of space in the L2 cache you will have many cache evictions and therefore more cache misses. On the other end of the spectrum, a cache line is usually 64 bytes. If one processor updates the first byte in the cache line and the other processor updates the last byte, the cache line must update each other's view of the cache line. This phenomenon is known as false sharing.

So if memory systems are so slow then why the push for multiple cores? I'm not sure. Marketing maybe? Could it be that someone is more likely to buy a processor because it has more cores rather than a blazing memory system. I think so. In fact, for years consumers have been trained to purchase a processor based on the clock speed, which isn't the same as measuring instructions retired. Some believe that the P4 architecture was the result of a marketing decision since the design allows for a deeper pipeline and a faster clock speed. To quote an ex-girlfriend, it's not the speed of the clock, it's how you use it.

2) We are API crazy

Every day I feel as though my job as a programmer is being replaced by a code merger. Only people that develop open source get to write code. Since we do rely on so many APIs, how can we rely on multi-threading. Writing multi-threaded code is very difficult and error prone. In fact, a program may contain race conditions that don't result in crashes (that's right, there are worse multi-threading errors than random crashes).

So if we rely on third party APIs, we also rely on their implementations of multi-threading. It will take many years for these APIs to truly be multi-thread safe. Ask Microsoft.



In conclusion.


Will multi-threading change the way we program. Yes. Is it going to change the world overnight? No. I suggest that a quicker overnight solution is to move parallel work from your CPU to your GPU. The GPU behaves like a multi core processor that specializes in math on vectors similar to that of 8 cores in the Playstation's cell processor.

In the meantime, re-write your engine from scratch, use algorithms that don't require critical sections, and plan to have the best multi-threaded game engine in 2013.

#1
12/20/2006 (4:46 am)
Great plan, though I'm still a fan of multi-threading.
#2
12/20/2006 (5:32 am)
I'm an artist, I read the whole thing and I actually understood it....that scares the hell out of me...the understanding it part that is.
#3
12/20/2006 (9:01 am)
I would second that Todd!!
Artist here, learning to program, and understood...yep scary:)
#4
12/20/2006 (2:27 pm)
I agree with much of what you said. However, I want to add my own insight to a few points.

AMD/ATI and Nvidia have both multithreaded their OpenGL and D3D drivers. I think the 10-20% gains at lower resolution with multicore on most tier 1 commercial games (ie graphically intense games) show that it is possible to successfully multithread APIs. Much larger gains could be had if games were to do the multithreading themselves at a higher level. It is a lot of work, and there can be some truly evil bugs. The gains to be had are significant even considering the bandwidth concerns you pointed out. With 8 or more cores I think this bandwidth problem might really start to affect l things.

I do think that the future is to write specialized cores that are very deeply pipelined instead of just adding more and more standard cores. Think of how GPUs work. They run 100's of threads at a time in a pipeline that is 1000's of cycles long. When one thread needs to fetch something from memory they just pause that thread and move on to the next one. This way they can high memory latency very well. Bandwidth can be addressed with specialized data buses. I think this sort of approach applies to a lot of areas outside of graphics. AMD has recently announced this sort of initiative under the name "Fusion".
#5
12/20/2006 (4:56 pm)
> So if memory systems are so slow then why the push for multiple cores? I'm not sure.

Compared to other computer architecture techniques for improving system performance, designing multi-core processors is comparatively easy, and give you a great bang for your design buck. Many of the best techniques for getting single-thread performance in microprocessors have already been implemented in current designs. Going to a deeper re-order buffer or wider issue or better branch predictors can have some benefit, but that benefit is incremental, and facing diminishing returns. And power issues make increasing the clock rate hard. But thread-level parallelism offers a huge area for increasing system-level performance at relatively little design effort (comparatively speaking, I don't want to give the impression that designing or architecting a microprocessor is easy). This is why computer architects like multi-core processors. Simultaneous multi-threading in a single core has benefits that multi-core doesn't, but is MUCH, MUCH harder to implement, which is why you haven't seen too many multi-threaded cores actually on the market yet. From the computer architect's perspective, the big problem with the technique of relying on thread-level parrallelism to increase system performance is that it requires buy-in from software writers...
#6
12/21/2006 (4:20 am)
@Dan,

I understand the deep pipelining benefits and execution unit parallelism. I even understand how adding cores provide a scalable design. What I don't understand is all the focus on processing as opposed to memory fetching performance.

I think that memory fetching performance is largely overlooked. Everyone is so concerned with how fast you can mulitply x*y, but does that really matter if the slow part of your program is loading and storing the values for x and y?

I think the chip makers are focusing on benchmarks that aren't representive of an application as diverse as a game.
#7
12/21/2006 (6:00 am)
@Eric,

You're wrong that it's overlooked. It's just not easy to improve. Hardware prefetching algorithms and so forth are a heavy area of development. There just aren't that many things that a microprocessor architect can do about DRAM latency. System bandwidth is generally easier to improve, but tends to push up cost of both the microprocessor package and the rest of the system, since you tend to either need more wires or better electrical performance. You also need to consider that changing system interconnect can be very difficult from a business point of view, since there are a lot of players who all have a lot invested (in terms of existing products and engineering effort) in whatever the current interconnect system is, which puts a lot of inertia behind anything -- the uArch of cores from Intel and AMD seems to have a faster generation than system interconnects. Now, it's not that there's nothing that can be done. In fact, I'd be surprised if you didn't see lots of incremental improvement with each generation. It's just unlikely that there's some kind of big breakthrough coming. However, if software writers DO buy in to heavily multi-threaded apps, then there IS a big breakthrough in system performance, which is why it is being focused on now. Basically, it's currently an underexploited vector of performance, where lots of other areas are already facing heavy diminishing returns, so the ROI is currently much better along that vector (again, with the critical assumption that software writers buy into it).

So it's not like Intel and AMD don't WANT to improve memory subsystem performance (and if the architects could deliver it, I'd be surprised if the marketing departments said "no"), it's just very, very hard to improve.

Now, not ALL of the reasons that multi-core is being pushed nowadays are "pure" computer architecture reasons, but there are some good computer architecture reasons.
#8
12/21/2006 (10:33 am)
Software writers don't really have any choice, they have to buy into it. You have some powerful but complex hardware, and some of your competitors will certainly use it to its full extent (because if they do, they can make better games, and they can get more money). What will you do then, can you really sit back and watch?
#9
12/22/2006 (2:33 am)
I've long since come to the conclusion that multi-threaded programming with complex interactions between threads won't become prevalent until some sort of sea-change in how multi-threaded programming works. We need a new model for concurrency -- a new abstraction. One that is simpler to deal with and easier to validate. Heck, just a better way to validate ALONE might do the trick. I suspect that such an abstraction will come out of the world of functional programming (where, for example, one is more acutely aware of the existence of side effects in a function).

Even in the world of enterprise applications, threading isn't really that much and where it IS used it's either for weakly-interacting tasks (say, threads in an application server), simple interactions (cache invalidation, for example), or is implemented by Real Programmers once and reused by everyone else as a black box (database server).

-JF
#10
12/30/2006 (4:02 pm)
Two threads waiting half the time are still faster than one waiting half the time :)

At my day job we have used multi-threading for a long time, and we've seen a huge improvements passing from a dual core to a quad core. Memory is still an issue, but not because of speed, just because we have trouble fitting enough memory on a single computer to feed our 4 computing threads. But we use gargantuant amounts of memory.

I disagree that writing multithread programs is difficult. Like any other programming, it only requires proper design and planning. I see brillant applications to multi-cores processors in games, but not in rasterization. The first one is AI. Running several threads in parallel for AI thinking would require minimal synchronization. Another one is ray-tracing. We are still a long way from realtime raytracing, but the parallelization of computers is a necessary first step.
The number of cores will continue to increase. Game engines that cannot fully use 8 cores in two years will fall behind. Forget graphics, and use that CPU power to give proper AI and physics and things you never considered like voice analysis.

To the design table!
#11
04/11/2008 (5:58 am)
@Mathieu The Canadian Guy:

Well Now you can Look here:
http://www.theinquirer.net/gb/inquirer/news/2008/04/06/ddr3-metaram-spotted

and here:
http://www.theinquirer.net/gb/inquirer/news/2008/04/06/ddr3-metaram-spotted

That will get you started.