some notes on the robustness of dedicated TGE servers
by Orion Elenzil · 03/09/2009 (10:54 am) · 12 comments
we had a server crash in vSide over the weekend and i was just looking through the logs to see what the cause might have been and found some numbers which i thought might some folks might find interesting.
background: vSide is a TGE 1.3.5 - based MMO providing an urban setting.
we run three main "city" servers, and about twenty others servers to host customizable user apartments. at any given time, most of the online users are in the various apartment servers, but because there's twenty of them and only three city servers, the city servers each have the highest concurrency.
so, the interesting stats:
our most popular server (for city "New Venezia") has currently been up for about 14 days. it has hosted 64,000 client sessions. it has instantiated about 4,000,000 SimObjects (ie new SimObject().getId() would equal about 4M), and has 11,000 SimObjects currently instantiated.
city "Raijuku" sees less traffic, but has some activities in it which change the profile. Raijuku was up for about twelve days and then crashed due to, i suspect, too much nesting of SimGroups. So there's some bug there. Over that period it had hosted about 31,000 client sessions. it had instantiated about 50,700,000 SimObjects, and had about 7,000 objects instantiated when it crashed.
so, i think these numbers are pretty cool.
people occasionally ask in the forums what the maximum number of SimObjects you can create is, and here is direct evidence that it's at least five million! and 64,000 client sessions and still ticking is also nothing to sneeze at.
although raijuku finally crashed, by and large i feel like this is really pretty stable. we didn't get to this level of stability accidentally, however. here are a few things we did to help ourselves ferret out problems when they came up:
* created a torquescript function which recursively walks the entire RootGroup tabulating how many of each class of SimObject are instantiated. we call this every five minutes and dump the results to the log.
* when there is an object leak, we enable a custom SimObject allocation tracker. this feature instruments SimObject::new() and delete() so that for every SimObject is associated with the script callStack which allocated it. this is incredibly powerful. suppose you know that you're accumulating lots of say ScriptObjects. In a large application, that knowledge is good, but you are quickly faced with the question "which of my 100 different types of ScriptObject allocations is the culprit ?" And even knowing the name of the function which allocates the objects can sometimes not be enough: in our case we have a wrapper to allocate an object of a specified type and automatically add it to MissionCleanup, so almost everything is allocated by that. What you really want is the entire callstack that resulted in the allocation. Writing this infrastructure took about two man-days, but once we had it, it becomes a snap to zero in on the problem.
* cleaned up memory allocation in TGE. Stock TGE comes with many one-time memory leaks. ie, memory which is allocated once during a session and not freed upon shutdown. this makes it difficult to use some standard profiling tools. we invested about a man-week cleaning up the C-side for this, and it paid off.
* write lots of good log statements. what's "good" ? "good" is information that helps you figure out what went wrong over the weekend by looking at the log. generally this means a callstack and all the pertinent local variables. it's sort of an artform deciding when to dump info to the log and when not to. in general our approach is to be liberal with putting info in to the log, and then as the product matures we see which statements weren't really interesting after all and remove them. be sure to keep the log clean. We use four levels of log messages: "debug" is for info that's interesting during development, but not in production. We don't record these entries in production. "info" is for stuff that's generally intersting, such as "user X just got N points" or "user X logged out". "warning" is for stuff that is out of the ordinary, but not technically wrong. and finally "error" means that something is wrong and should be addressed. It takes a lot of self-discipline to keep the "error" channel noise-free, but it's valuable to do so: an "error" either needs to be addressed, or it's just a warning. here's a typical "warning"-level log line:
[3/9/2009 09:51:58][Wrn][General] userPropertiesMgr::requestProperties() - Requesting user properties when we've already got them. user "orion" userPropertiesMgr::requestProperties() <- ValidateRequest::onDone() <- CURLObject::onDonePreDelay()
in summary,
TGE has been pretty robust !
afaik, all this stuff should apply to other Torque Engine flavours which use the same SimObject and TorqueScript structures, which i believe includes TGEA, TGB, and i expect T3D. Not sure about TX and the iPhone versions.
background: vSide is a TGE 1.3.5 - based MMO providing an urban setting.
we run three main "city" servers, and about twenty others servers to host customizable user apartments. at any given time, most of the online users are in the various apartment servers, but because there's twenty of them and only three city servers, the city servers each have the highest concurrency.
so, the interesting stats:
our most popular server (for city "New Venezia") has currently been up for about 14 days. it has hosted 64,000 client sessions. it has instantiated about 4,000,000 SimObjects (ie new SimObject().getId() would equal about 4M), and has 11,000 SimObjects currently instantiated.
city "Raijuku" sees less traffic, but has some activities in it which change the profile. Raijuku was up for about twelve days and then crashed due to, i suspect, too much nesting of SimGroups. So there's some bug there. Over that period it had hosted about 31,000 client sessions. it had instantiated about 50,700,000 SimObjects, and had about 7,000 objects instantiated when it crashed.
so, i think these numbers are pretty cool.
people occasionally ask in the forums what the maximum number of SimObjects you can create is, and here is direct evidence that it's at least five million! and 64,000 client sessions and still ticking is also nothing to sneeze at.
although raijuku finally crashed, by and large i feel like this is really pretty stable. we didn't get to this level of stability accidentally, however. here are a few things we did to help ourselves ferret out problems when they came up:
* created a torquescript function which recursively walks the entire RootGroup tabulating how many of each class of SimObject are instantiated. we call this every five minutes and dump the results to the log.
* when there is an object leak, we enable a custom SimObject allocation tracker. this feature instruments SimObject::new() and delete() so that for every SimObject is associated with the script callStack which allocated it. this is incredibly powerful. suppose you know that you're accumulating lots of say ScriptObjects. In a large application, that knowledge is good, but you are quickly faced with the question "which of my 100 different types of ScriptObject allocations is the culprit ?" And even knowing the name of the function which allocates the objects can sometimes not be enough: in our case we have a wrapper to allocate an object of a specified type and automatically add it to MissionCleanup, so almost everything is allocated by that. What you really want is the entire callstack that resulted in the allocation. Writing this infrastructure took about two man-days, but once we had it, it becomes a snap to zero in on the problem.
* cleaned up memory allocation in TGE. Stock TGE comes with many one-time memory leaks. ie, memory which is allocated once during a session and not freed upon shutdown. this makes it difficult to use some standard profiling tools. we invested about a man-week cleaning up the C-side for this, and it paid off.
* write lots of good log statements. what's "good" ? "good" is information that helps you figure out what went wrong over the weekend by looking at the log. generally this means a callstack and all the pertinent local variables. it's sort of an artform deciding when to dump info to the log and when not to. in general our approach is to be liberal with putting info in to the log, and then as the product matures we see which statements weren't really interesting after all and remove them. be sure to keep the log clean. We use four levels of log messages: "debug" is for info that's interesting during development, but not in production. We don't record these entries in production. "info" is for stuff that's generally intersting, such as "user X just got N points" or "user X logged out". "warning" is for stuff that is out of the ordinary, but not technically wrong. and finally "error" means that something is wrong and should be addressed. It takes a lot of self-discipline to keep the "error" channel noise-free, but it's valuable to do so: an "error" either needs to be addressed, or it's just a warning. here's a typical "warning"-level log line:
[3/9/2009 09:51:58][Wrn][General] userPropertiesMgr::requestProperties() - Requesting user properties when we've already got them. user "orion" userPropertiesMgr::requestProperties() <- ValidateRequest::onDone() <- CURLObject::onDonePreDelay()
in summary,
TGE has been pretty robust !
afaik, all this stuff should apply to other Torque Engine flavours which use the same SimObject and TorqueScript structures, which i believe includes TGEA, TGB, and i expect T3D. Not sure about TX and the iPhone versions.
About the author
#2
Very interesting read, lots of good ideas and hints, thanks for sharing Orion!
03/09/2009 (1:48 pm)
wow! thats a huge numbers. The max SimID I've seen in my project is about 2.5 mil, while the max connection count (for our login/chat server) is about 50k.Very interesting read, lots of good ideas and hints, thanks for sharing Orion!
#3
03/09/2009 (1:50 pm)
@Orion - this is great food for thought for those of us working on MMOs. Thanks for the notes and suggestions!
#4
03/09/2009 (4:29 pm)
@Orion: Really good info! I've added it to the MMO sticky for posterity ;)
#6
re cleaning up stock TGE memory de-allocation,
there was a forum thread in late 2008 i think where someone went through and did the same thing, detailing all the places. i'd love to link to that but haven't been able to find it. anyone know where it is ?
03/09/2009 (4:45 pm)
thanks guys!re cleaning up stock TGE memory de-allocation,
there was a forum thread in late 2008 i think where someone went through and did the same thing, detailing all the places. i'd love to link to that but haven't been able to find it. anyone know where it is ?
#7
03/10/2009 (12:30 am)
I just wanted to say, over the years i have found your(Orion) resources to be so very helpful, often saving me 100's of hours of work, and offering a fresh creative view on subjects i had already invested my own hours of work (sometimes) without success. Anyone that do not know Orion's contributions, should really check out his many community supporting efforts.
#8
and thanks for all your feedback.
I think the thread you're looking for is this : www.garagegames.com/community/forums/viewthread/80267
Nicolas Buquet
www.buquet-net.com/cv/
03/10/2009 (1:29 am)
Hi Orion,and thanks for all your feedback.
I think the thread you're looking for is this : www.garagegames.com/community/forums/viewthread/80267
Nicolas Buquet
www.buquet-net.com/cv/
#9
I just wanted to echo this for agreement. I particularly like Orion's script array resource, it really helped me understand of how a traditional array relates to how TorqueScript handles arrays.
Thanks a bunch!
03/10/2009 (7:25 am)
"just wanted to say, over the years i have found your(Orion) resources to be so very helpful, often saving me 100's of hours of work, and offering a fresh creative view on subjects i had already invested my own hours of work (sometimes) without success. Anyone that do not know Orion's contributions, should really check out his many community supporting efforts."I just wanted to echo this for agreement. I particularly like Orion's script array resource, it really helped me understand of how a traditional array relates to how TorqueScript handles arrays.
Thanks a bunch!
#10
03/10/2009 (12:35 pm)
Thanks Orion. It's really interesting to see technical / performance stats like this on a product that's been "released into the wild". I wish more developers did this!
#11
All - many thanks for the complements! GG's community is one of the best i've ever encountered, so it's nice to be able to give back to it when i can.
03/10/2009 (2:45 pm)
Nicolas - indeed that's the one, thanks for finding it!All - many thanks for the complements! GG's community is one of the best i've ever encountered, so it's nice to be able to give back to it when i can.
#12
03/10/2009 (4:10 pm)
You are the Man, O.
Employee Michael Perry
GarageGames