Interesting problem with memory allocation and the profiler
by Cliff · in Torque Game Engine Advanced · 09/29/2007 (10:48 pm) · 7 replies
So, I've discovered a rather interesting problem... as my memory utilization approached a certain point in a dedicated server, the server would just freeze.
Using a debug version would make the problem go away for a while, but then when I'd get above another certain point, it would once again start freezing.
In digging in with a debugger, what I found was that while allocating memory it hit the profiler, which tried to construct a profile path... which near as I can tell appeared to be in the same memory block that was already having more memory allocated to it. The end result was an infinite loop, hence the freeze.
Here's the call-stack:
As you can see, the bottom was in _StringTable::resize, where the infinite loop is happening while it's trying to walk the table and add new strings to the bottom of the bucket (starting around line 184). The string table its walking self-references at the end, causing the loop. The entries are:
"bumpTex" : "root -> MainLoop -> TimeManagerProcessMain -> ProcessTimeEvent -> MemoryRealloc" : "Bip01Spine" : "" : "bumpTex" (LOOP)
Unfortunately, disabling the profiler in torqueConfig.h resulted in 284 compile errors because a huge volume of the code assumes that the profiler is enabled. Rather than try to fix that assumption (which I suspect would've resulted in a lot of cascading errors, and I wasn't up to dealing with), I went to the part of the profiler where the bug exhibited itself. I commented out the offending line of code in profiler.cpp around line 326, like so:
I don't know whether this will act as a final solution to my problem, or just fix it for now until I once again hit a certain magic number of memory utilization. However, since this problem really kept biting me repeatedly, I thought I'd share my solution here for any others that might run into it, and perhaps see if someone else has come up with a better solution than I have.
Using a debug version would make the problem go away for a while, but then when I'd get above another certain point, it would once again start freezing.
In digging in with a debugger, what I found was that while allocating memory it hit the profiler, which tried to construct a profile path... which near as I can tell appeared to be in the same memory block that was already having more memory allocated to it. The end result was an infinite loop, hence the freeze.
Here's the call-stack:
> TGEA_DEBUG.exe!_StringTable::resize(const unsigned int newSize=7339) Line 188 C++
TGEA_DEBUG.exe!_StringTable::insert(const char * val=0x0012ec6c, const bool caseSens=false) Line 125 C++
TGEA_DEBUG.exe!ProfilerInstance::constructProfilePath(ProfilerData * pd=0x09c722b8) Line 767 C++
TGEA_DEBUG.exe!ProfilerInstance::hashPush(ProfilerRoot * pr=0x01192b90) Line 326 + 0xc bytes C++
TGEA_DEBUG.exe!Profiler::hashPush(ProfilerRoot * root=0x01192b90) Line 848 C++
TGEA_DEBUG.exe!Memory::realloc(void * mem=0x01ab2bc0, unsigned int size=58720) Line 1215 C++
TGEA_DEBUG.exe!dRealloc(void * in_pResize=0x01ab2bc0, unsigned int in_size=58712) Line 1377 + 0xd bytes C++
TGEA_DEBUG.exe!_StringTable::resize(const unsigned int newSize=7339) Line 190 + 0x12 bytes C++
TGEA_DEBUG.exe!_StringTable::insert(const char * val=0x01151448, const bool caseSens=false) Line 125 C++
TGEA_DEBUG.exe!Con::setVariable(const char * name=0x01151448, const char * value=0x01190840) Line 623 + 0x15 bytes C++
TGEA_DEBUG.exe!fpsUpdate() Line 655 + 0x28 bytes C++
TGEA_DEBUG.exe!DemoGame::processTimeEvent(TimeEvent * event=0x0012fc9c) Line 766 C++
TGEA_DEBUG.exe!GameInterface::processEvent(Event * event=0x0012fc9c) Line 74 + 0x13 bytes C++
TGEA_DEBUG.exe!GameInterface::postEvent(Event & event={...}) Line 156 + 0x13 bytes C++
TGEA_DEBUG.exe!TimeManager::process() Line 1049 + 0x19 bytes C++
TGEA_DEBUG.exe!DemoGame::main(int argc=4, const char * * argv=0x016e5878) Line 548 C++
TGEA_DEBUG.exe!run(int argc=4, const char * * argv=0x016e5878) Line 922 + 0x1c bytes C++
TGEA_DEBUG.exe!main(int argc=4, const char * * argv=0x016e5878) Line 1000 + 0xd bytes C++
TGEA_DEBUG.exe!__tmainCRTStartup() Line 586 + 0x19 bytes C
TGEA_DEBUG.exe!mainCRTStartup() Line 403 C
kernel32.dll!7c816fd7()
[Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]As you can see, the bottom was in _StringTable::resize, where the infinite loop is happening while it's trying to walk the table and add new strings to the bottom of the bucket (starting around line 184). The string table its walking self-references at the end, causing the loop. The entries are:
"bumpTex" : "root -> MainLoop -> TimeManagerProcessMain -> ProcessTimeEvent -> MemoryRealloc" : "Bip01Spine" : "" : "bumpTex" (LOOP)
Unfortunately, disabling the profiler in torqueConfig.h resulted in 284 compile errors because a huge volume of the code assumes that the profiler is enabled. Rather than try to fix that assumption (which I suspect would've resulted in a lot of cascading errors, and I wasn't up to dealing with), I went to the part of the profiler where the bug exhibited itself. I commented out the offending line of code in profiler.cpp around line 326, like so:
#ifdef TORQUE_ENABLE_PROFILE_PATH // nextProfiler->mPath = constructProfilePath(nextProfiler); #endif
I don't know whether this will act as a final solution to my problem, or just fix it for now until I once again hit a certain magic number of memory utilization. However, since this problem really kept biting me repeatedly, I thought I'd share my solution here for any others that might run into it, and perhaps see if someone else has come up with a better solution than I have.
About the author
#2
It happens during server startup, so it's my initial variable setup that's causing this.
Yeah, I'm planning on fixing that before we ship, but for right now what I needed was something that didn't get in the way development of our main features.
I'll take a look at that, thanks.
I took a deeper look into the compile errors, and what it actually looks like is that it assumes if you're running a debug compile that you're using the profiler, in gfxStructs.h around line 314. There's a specific line, though, line 315, that if I add this:
I'm going to just pop #ifdefs around that and I think that should fix it. I'm also pretty sure that it wouldn't have given me any kind of problem in a production compile, but when I couldn't get a debug to compile previously I didn't even try a release compile.
There appears to be one more place, line 64 in gfxTextureObject.cpp, that also makes that assumption, so I change it to:
That appears to have fixed it, so I've removed the commenting out of the line at 326 of profiler.cpp, I'll give it some testing but I'm pretty sure that will have fixed it.
Thanks, Clark!
09/30/2007 (10:34 am)
Quote:1) you may want to figure out why string table continues to grow as your dedicated server runs. In theory, the strings in your game should cap out and the string table won't grow. That's unless you keep adding new names all the time or something like that.
It happens during server startup, so it's my initial variable setup that's causing this.
Quote:2) you should be able to disable the profiler in a shipping build. Otherwise, you'll get a lot of time wasted pushing and popping the profile markers.
Yeah, I'm planning on fixing that before we ship, but for right now what I needed was something that didn't get in the way development of our main features.
Quote:3) Instead of commenting out hte line you did, I'd set mPath to "" just so you don't get a dangling pointer!
I'll take a look at that, thanks.
Quote:4) If you look at platform memory, the alloc code has #ifndef guards around profile pushes in the case that profile path is set. Looks like we haven't added them everywhere they are needed. You could add them around the profile code in realloc and it would probably handle your problem (but I think disabling profiler in shipping code is a better idea).
I took a deeper look into the compile errors, and what it actually looks like is that it assumes if you're running a debug compile that you're using the profiler, in gfxStructs.h around line 314. There's a specific line, though, line 315, that if I add this:
#ifdef TORQUE_ENABLE_PROFILE_PATH
mDebugCreationPath = gProfiler->getProfilePath();
#endifI'm going to just pop #ifdefs around that and I think that should fix it. I'm also pretty sure that it wouldn't have given me any kind of problem in a production compile, but when I couldn't get a debug to compile previously I didn't even try a release compile.
There appears to be one more place, line 64 in gfxTextureObject.cpp, that also makes that assumption, so I change it to:
#ifdef TORQUE_ENABLE_PROFILE_PATH mDebugCreationPath = gProfiler->getProfilePath(); #endif
That appears to have fixed it, so I've removed the commenting out of the line at 326 of profiler.cpp, I'll give it some testing but I'm pretty sure that will have fixed it.
Thanks, Clark!
#3
Like Cliff this issue only came up when using the "release" build; everything worked fine in the Debug build. I finally noticed that right after that particular function would run when loading a mission the debug build would output that the memory manager had allocated a new page of memory.
After working through the memory manager for awhile, I wasn't able to track down the final issue, but my solution ended up being to define an initial page size of 128mb to start (I got this number by calculating how much memory my mission was taking, in this case, around 90mb, and giving it some more on top of that that was a power of 2).
This solved my problem.
Just now I saw this thread (I was about to make a new one about this issue) and tried the above fix proposed by Cliff (simply adding the idfefs). I can definitely confirm that they solve this issue. Not only does the mission load correctly with a "release" build but it SIGNIFICANTLY improved the performance of my mission.
I noticed that when using the replicator for some reason my framerate was dropping from ~120fps (with no replicator) to 20fps, regardless of where the camera was pointing. Even with my "fix" of increasing the initial memory page size in order to get the mission to load I was still getting those low frame rates. Now I'm getting a pretty solid 100fps again.
I'm not entirely sure why these two calls to getProfilePath would cause the drastic problems that they do but Cliff's code definitely resolves it. Thanks!
09/30/2007 (6:21 pm)
I spent several hours today trying to track down an issue with a replicator I was using. It appeared that a function that contained a loop was dieing for some reason. I put in some code to print to the console each iteration of the loop, at the start of the function and at the end of the function. I would get the console print outs at the start, all throughout the loop, but never at the very end of the function (which ended right after the loop finished). I tried watching the value of the variables in the compiler, and even using a try catch but no exception was ever thrown.Like Cliff this issue only came up when using the "release" build; everything worked fine in the Debug build. I finally noticed that right after that particular function would run when loading a mission the debug build would output that the memory manager had allocated a new page of memory.
After working through the memory manager for awhile, I wasn't able to track down the final issue, but my solution ended up being to define an initial page size of 128mb to start (I got this number by calculating how much memory my mission was taking, in this case, around 90mb, and giving it some more on top of that that was a power of 2).
This solved my problem.
Just now I saw this thread (I was about to make a new one about this issue) and tried the above fix proposed by Cliff (simply adding the idfefs). I can definitely confirm that they solve this issue. Not only does the mission load correctly with a "release" build but it SIGNIFICANTLY improved the performance of my mission.
I noticed that when using the replicator for some reason my framerate was dropping from ~120fps (with no replicator) to 20fps, regardless of where the camera was pointing. Even with my "fix" of increasing the initial memory page size in order to get the mission to load I was still getting those low frame rates. Now I'm getting a pretty solid 100fps again.
I'm not entirely sure why these two calls to getProfilePath would cause the drastic problems that they do but Cliff's code definitely resolves it. Thanks!
#4
For my own fix, I also had to remove all the profiler defines from the torqueConfig.h.
Interestingly enough, I'm also using a replicator... I wonder if the issue is somehow related to replicators.
09/30/2007 (6:29 pm)
Matt - Glad to hear it's working for you. So far in my own testing, it seems to have improved my problem as well. I'm still not quite ready to call it licked until it's had some time to sink in, but I'm feeling hopeful.For my own fix, I also had to remove all the profiler defines from the torqueConfig.h.
Interestingly enough, I'm also using a replicator... I wonder if the issue is somehow related to replicators.
#5
10/01/2007 (9:50 am)
@Matt - the slowing might be happening due to the multithread code in the getProfilePath. I think we compile with TORQUE_MULTITHREAD defined by default.
#6
10/01/2007 (12:36 pm)
@Clark, you are correct in that TORQUE_MULTITHREAD is defined by default. I also recall seeing that handling mutex locks/unlocks were taking a lot of time although I don't remember any specifics. When I have some time I will see if I there's anything that stands out with profile path stuff with regards to multithreading being enabled. Thanks for the lead.
#7
I'm not too sawwy about this stuff, just wanted to note it still exists.
02/07/2009 (3:27 pm)
Running into this exact same problem, same callstack and symptoms. Looks like it recursively calls the same function until it runs out of frame allocator memory.I'm not too sawwy about this stuff, just wanted to note it still exists.
Torque Owner Clark Fagot
1) you may want to figure out why string table continues to grow as your dedicated server runs. In theory, the strings in your game should cap out and the string table won't grow. That's unless you keep adding new names all the time or something like that.
2) you should be able to disable the profiler in a shipping build. Otherwise, you'll get a lot of time wasted pushing and popping the profile markers.
3) Instead of commenting out hte line you did, I'd set mPath to "" just so you don't get a dangling pointer!
4) If you look at platform memory, the alloc code has #ifndef guards around profile pushes in the case that profile path is set. Looks like we haven't added them everywhere they are needed. You could add them around the profile code in realloc and it would probably handle your problem (but I think disabling profiler in shipping code is a better idea).