Do we really need to recenter clipmaps each frame?
by Stefan Lundmark · in Torque Game Engine Advanced · 12/22/2008 (4:38 pm) · 19 replies
I've been playing around with only recentering the clipmap every 128 ms or so, and I can't notice any difference visually at all. On my laptop this gives a boost in FPS from the lower 20's to mid 40's in my test scene.
Obviously, the clipmap won't be perfectly up-to-date and if we move fast enough we'll run into lower LOD levels until it is recentered again, but other than that, is there any downside to this?
Obviously, the clipmap won't be perfectly up-to-date and if we move fast enough we'll run into lower LOD levels until it is recentered again, but other than that, is there any downside to this?
#2
No idea. I tested with different values and anything above 500 ms was noticable as you could see blurry parts getting crispy as you move. A value of 64 wouldn't make my laptop crawl either, but I couldn't spot any increase in quality and the FPS did drop a little bit, so.
12/22/2008 (5:48 pm)
Quote:
Is 128ms the sweet spot?
No idea. I tested with different values and anything above 500 ms was noticable as you could see blurry parts getting crispy as you move. A value of 64 wouldn't make my laptop crawl either, but I couldn't spot any increase in quality and the FPS did drop a little bit, so.
#3
Putting the clipmapper in a seperate thread is something I tested, but this was my first experience with threads and I ran into issues when I tried to upload the texture.
Yeah, I guess that'll mean that each time the clipmap updates there are now more changes between updates and hence more to send to the GPU at once rather than in smaller chunks?
12/22/2008 (5:49 pm)
Not sure why Jaimi decided to delete his post, it was interesting.Putting the clipmapper in a seperate thread is something I tested, but this was my first experience with threads and I ran into issues when I tried to upload the texture.
Quote:
If you are paging data from disk..
Yeah, I guess that'll mean that each time the clipmap updates there are now more changes between updates and hence more to send to the GPU at once rather than in smaller chunks?
#4
Sorry, I was going to do some more research and then repost, and then real life came up. Since I haven't had time, and don't know when I will, I'll repost the gist of it here.
My thought was it would be cool if we could move the recentering of the clipmap into a separate thread. Basically, having it run all the time, submit the recenter request, and then next frame check to see if it was done, etc. There's a lot of unused CPU that we could benefit from, even if the machine has only one core. (basically it could be sitting there churning while your app is waiting on the GPU to finish). You might need to do some prediction, based on where you think the recenter would need to be in a frame or two. Or you may just not care, since for most people it wouldn't matter.
A separate thread seems the easiest method (since it's just building a clipmap). You might need to take special care to be able to cancel the thread if you no longer need it - unloading the map, etc.
Or, if a thread is too problematic, perhaps just updating small parts of the clipmap each frame - doing 25% of the work each frame, and then on the fourth frame the clipmap is complete. This would amortize the cost of the rebuilding over 4 frames, but wouldn't leverage unused CPU that is going to waste while the GPU is working (or for the folks with other cores...)
What I would suggest is to keep the texture upload in the main thread, and just build a new one in the clipmap thread, and then swap them out in the main thread when the other one has done it's work. basically, the point now where it calls recenter would (in the world worst pseudo code):
1. Check: Do we need a clipmap rebuilt or recentered?
2. No - return.
3. Is a thread already building texture(s) for us?
4. Yes - If thread is done, Swap out old textures for new textures, return. If not done, return (we're waiting).
5. No - Copy current texture(s) for rebuilding purposes to new texture.
6. Create thread with new texture(s). Rebuild like normal. Save thread handle for checking later what's going on.
7. Return.
And, for the sake of brevity, I left out all the exception handling, canceling, etc.
12/22/2008 (9:13 pm)
@Stefan -Sorry, I was going to do some more research and then repost, and then real life came up. Since I haven't had time, and don't know when I will, I'll repost the gist of it here.
My thought was it would be cool if we could move the recentering of the clipmap into a separate thread. Basically, having it run all the time, submit the recenter request, and then next frame check to see if it was done, etc. There's a lot of unused CPU that we could benefit from, even if the machine has only one core. (basically it could be sitting there churning while your app is waiting on the GPU to finish). You might need to do some prediction, based on where you think the recenter would need to be in a frame or two. Or you may just not care, since for most people it wouldn't matter.
A separate thread seems the easiest method (since it's just building a clipmap). You might need to take special care to be able to cancel the thread if you no longer need it - unloading the map, etc.
Or, if a thread is too problematic, perhaps just updating small parts of the clipmap each frame - doing 25% of the work each frame, and then on the fourth frame the clipmap is complete. This would amortize the cost of the rebuilding over 4 frames, but wouldn't leverage unused CPU that is going to waste while the GPU is working (or for the folks with other cores...)
Quote:but this was my first experience with threads and I ran into issues when I tried to upload the texture.
What I would suggest is to keep the texture upload in the main thread, and just build a new one in the clipmap thread, and then swap them out in the main thread when the other one has done it's work. basically, the point now where it calls recenter would (in the world worst pseudo code):
1. Check: Do we need a clipmap rebuilt or recentered?
2. No - return.
3. Is a thread already building texture(s) for us?
4. Yes - If thread is done, Swap out old textures for new textures, return. If not done, return (we're waiting).
5. No - Copy current texture(s) for rebuilding purposes to new texture.
6. Create thread with new texture(s). Rebuild like normal. Save thread handle for checking later what's going on.
7. Return.
And, for the sake of brevity, I left out all the exception handling, canceling, etc.
#5
In the context of a blended clipmap: Right now the clipmap is (normally) doing very small updates, all on the GPU. In my tests I found it was very, very difficult to get the CPU to blend at an acceptable rate, and never anywhere near as fast as the GPU. Maybe an assembly guru could get it to go fast enough, but that person is not me. :) Maybe threading would bear a reward here, though, especially if you kept a cached area larger than the clipmap was using and then uploaded directly. Then your threads could blend a cache tile at a time, and act like the disk pager does. It might be fast enough to work. It probably depends on the resolutions you're trying to push - 8tx/meter would be pretty doable, 128tx/m might not be feasible.
In the context of a unique clipmap: I'm not sure you could thread it to go any faster. Right now all it is doing is literally a copy from system memory to VRAM from the paged data. So not sure that threading could get it any faster, unless the memcpy is CPU bound, which I hope isn't the case! :)
12/23/2008 (1:26 am)
Hmm... That's an interesting idea. In the context of a blended clipmap: Right now the clipmap is (normally) doing very small updates, all on the GPU. In my tests I found it was very, very difficult to get the CPU to blend at an acceptable rate, and never anywhere near as fast as the GPU. Maybe an assembly guru could get it to go fast enough, but that person is not me. :) Maybe threading would bear a reward here, though, especially if you kept a cached area larger than the clipmap was using and then uploaded directly. Then your threads could blend a cache tile at a time, and act like the disk pager does. It might be fast enough to work. It probably depends on the resolutions you're trying to push - 8tx/meter would be pretty doable, 128tx/m might not be feasible.
In the context of a unique clipmap: I'm not sure you could thread it to go any faster. Right now all it is doing is literally a copy from system memory to VRAM from the paged data. So not sure that threading could get it any faster, unless the memcpy is CPU bound, which I hope isn't the case! :)
#6
Isn't the actual texture upload quite time consuming in itself? I've seen it in my profiler dumps in a few tests I'm sure. Would be really cool to get that one into a seperate thread too. (Together with the rest of the clipmapper)
12/23/2008 (3:43 am)
Jaimi,Isn't the actual texture upload quite time consuming in itself? I've seen it in my profiler dumps in a few tests I'm sure. Would be really cool to get that one into a seperate thread too. (Together with the rest of the clipmapper)
#7
12/23/2008 (9:36 am)
Threading the upload would be interesting. You would have to be careful that the synchronization overhead doesn't outweigh the benefit of threading it.
#8
If the upload is the problem, then perhaps we should just not have one big texture. Perhaps split the giant texture that is being recentered (1024x1024, right?) into small bits, and instead of recentering the texture, just readjust the texture projection and use smaller textures (128x128), throwing them away and creating new ones when needed?
12/23/2008 (10:23 am)
:) That's why I was wanting to do more research. I'm really not that familiar with the clipmap code.If the upload is the problem, then perhaps we should just not have one big texture. Perhaps split the giant texture that is being recentered (1024x1024, right?) into small bits, and instead of recentering the texture, just readjust the texture projection and use smaller textures (128x128), throwing them away and creating new ones when needed?
#9
Broadly, there are two scenarios in the clipmap updates. One is for the higher levels. As you move, they only need a single row or column of new data to keep up with your detail needs. The overhead in that case is likely acquiring the lock on the texture to begin with, because the amount of data being streamed is only a few kb. Most data stays the same between updates.
The other is for the more detailed levels. If the player is moving quickly then you may be uploading a whole new texture (or nearly so) every frame. In that case the bottleneck might shift to the CPU as you're uploading megabytes of data.
12/23/2008 (11:20 am)
It already maintains several textures, one for each clip map stack entry. You might want to check out the SGI paper before digging in to the system too much. It helps set out some theoretical limits + explains the general idea pretty clearly.Broadly, there are two scenarios in the clipmap updates. One is for the higher levels. As you move, they only need a single row or column of new data to keep up with your detail needs. The overhead in that case is likely acquiring the lock on the texture to begin with, because the amount of data being streamed is only a few kb. Most data stays the same between updates.
The other is for the more detailed levels. If the player is moving quickly then you may be uploading a whole new texture (or nearly so) every frame. In that case the bottleneck might shift to the CPU as you're uploading megabytes of data.
#10
12/23/2008 (12:20 pm)
Oh yeah - also a big limiting factor is Shader Model 1/Fixed Function support, since that means you can only have a few levels of the clipmap bound at a time. If you can require SM2 or higher, you can be more flexible with the shader and do fancier things. Of course, if it can run in SM1 it can run fast in SM2...
#11
12/23/2008 (3:04 pm)
Interesting! Thanks for the information. Hopefully I'll get down to this in a few weeks.
#12
12/23/2008 (4:42 pm)
It'll be cool to see what you come up with!
#13
12/23/2008 (4:50 pm)
Hi Ben - I checked out the paper sometime earlier. What I was suggesting is probably not practical - Instead of having a single texture for each clipmap stack entry, having multiple small textures for each entry in the stack (lets say that one entry is 1024x1024, instead we would have 64 128x128 textures). That way, you would never have to lock or update the new "virtual" texture, you would instead just discard unneeded textures in that stack entry, and add new ones. Would this be better? I don't really know, my hardware experience is several years old now, and things that were important back then are just not as important now.
#14
If you could figure out for a given chunk of geometry what tiles were needed, that might help you some, but you'd frequently be sampling at least from groups of 4 (since you'd almost be guaranteed to never be aligned to tiles) for each layer, and maybe from much more. So you'd still have to have a TON of texture samplers active. And of course on shader model 1, you're limited to only a few samplers, so you'd probably not be able to run there. You could always go to SM2 but that would cut out whatever percentage of your market can't do SM2 at all (small) or can't do SM2 quickly (larger).
That's just my sense based on your description so far. If you put together a prototype you'd automatically know a lot more than me about how well it would work. :)
12/23/2008 (5:07 pm)
The issue is how the hardware will sample from the textures. If you bound them all and had a shader you would hugely exceed the available texture bandwidth. Rendering one pixel of clipmap'ed geometry would cost the same as 64 pixels of normal geometry - actually, probably a lot worse as most hardware can't do that many samplers at once.If you could figure out for a given chunk of geometry what tiles were needed, that might help you some, but you'd frequently be sampling at least from groups of 4 (since you'd almost be guaranteed to never be aligned to tiles) for each layer, and maybe from much more. So you'd still have to have a TON of texture samplers active. And of course on shader model 1, you're limited to only a few samplers, so you'd probably not be able to run there. You could always go to SM2 but that would cut out whatever percentage of your market can't do SM2 at all (small) or can't do SM2 quickly (larger).
That's just my sense based on your description so far. If you put together a prototype you'd automatically know a lot more than me about how well it would work. :)
#15
12/26/2008 (7:29 am)
Of course I'm being thick in the head. I was thinking that much of the work was still being done on the CPU, but this technique has of coursed moved it all to the GPU in a shader. It's actually quite clever. I'll slink back under my rock.
#16
Had you ever concluded experiments with this? I would be fascinated to review what you have come up with.
01/21/2009 (11:47 am)
@Stefan Had you ever concluded experiments with this? I would be fascinated to review what you have come up with.
#17
Only minor tests on a dozen or so different computers, so nothing scientific.
For our gametype the lower recenter rate wasn't visibly different, but upon testing this in Stronghold I couldn't personally tell any difference either so I thought the change was good and it dramatically increased framerates on my laptop. We kept the implementation.
If you want to get picky, there are also a few calculations that can be merged in recenter (), and only calculated once (at least back then, they were calculated 4 times per stack level). This is something that never showed on the profiler, so I guess compiler optimization takes care of it or it's just so minor it won't matter.
01/22/2009 (9:27 am)
@Caylo:Only minor tests on a dozen or so different computers, so nothing scientific.
For our gametype the lower recenter rate wasn't visibly different, but upon testing this in Stronghold I couldn't personally tell any difference either so I thought the change was good and it dramatically increased framerates on my laptop. We kept the implementation.
If you want to get picky, there are also a few calculations that can be merged in recenter (), and only calculated once (at least back then, they were calculated 4 times per stack level). This is something that never showed on the profiler, so I guess compiler optimization takes care of it or it's just so minor it won't matter.
#18
I decided to start with simple lowering mMaxTexelUploadPerRecenter by one forth and returned a decent (well around 6%) performance increase, and all seems well. Im going to leave it like this and see if any ill effects show up in the next few days. I just dont have the time right now to test this on my lower end system, what is who i am thinking will show the most benefit from this.
EDIT: crazy a few more test runs and i lost what i THOUGHT i was seeing.. I gotta take a nap....
01/22/2009 (10:38 am)
Thanks Stefan,I decided to start with simple lowering mMaxTexelUploadPerRecenter by one forth and returned a decent (well around 6%) performance increase, and all seems well. Im going to leave it like this and see if any ill effects show up in the next few days. I just dont have the time right now to test this on my lower end system, what is who i am thinking will show the most benefit from this.
EDIT: crazy a few more test runs and i lost what i THOUGHT i was seeing.. I gotta take a nap....
#19
But yeah, sleeping sounds useful. Screwed me up once or twice. ;)
01/22/2009 (11:21 am)
That can backfire if you never get to upload the whole clipmap before changing it again, though. It's increased in debug builds to be able to compensate for lower code performance.But yeah, sleeping sounds useful. Screwed me up once or twice. ;)
Associate Ben Garney
If you are paging data from disk, updating more frequently is better (as it helps the paging system avoid loading stuff that won't be used).
Is 128ms the sweet spot? What about every N frames with a min-fps cap?