Network monkey assistance needed - apply within
by Dusty Monk · in Torque Game Engine Advanced · 06/15/2009 (2:37 pm) · 7 replies
So I'm stumped. I'm putting this out here in the hopes that one of the TNL guru's might be able to point me in a better direction to go it. Here's the sitrep:
Currently working in the single player server & virtual client set up with TGEA 1.7.1. So no real clients, just the single app running both server and client in the same exe.
I'm attempting to debug a problem where I get a noticeable hitch when I spawn AI players. That is, the entire game freezes for a moment whenever a mob spawns. I started with profiling the app in dev studio, but nothing simwise appeared near the top -- everything is pretty much render related, which I would expect. Though I *am* interested in why we spend so much time in TSMesh::SkinMesh, but that's a different story. I then turned on the in-game profiling stuff, and again, all the main culprits appear to be in render related functions.
So I finally narrowed it down to the pause being when the actual AI player is created (the new AIPlayer code in script), and I noticed that there was a 200 ms delay between when the player is spawned on the server, and when the ghost creation code on the client creates it. So this has me thinking it's something with the networking. I pulled over the netgraph code from TGE, and added some max tracking to it. (Max Bits Received, Max Latency, etc). My usual latency runs around 37 ms. When the AI player is spawned, it peaks to 150+ ms. Now I also hit a max bitsReceived of around 500 bits at that time, but I don't think it's just packUpdate vs. unpackUpdate. Several times during gameplay, I often hit my peak bitsReceived of around 500, and at no time do I get the latency spikes like this.
So I could use some help on where to go to next to track down why I'm getting a 150ms spike in latency in a single player game during mob spawns. My player packUpdate has grown considerably, but like I said I don't think it's just that. Does it have to do with ghost creation? Is there an easy way for me to tell if the delay is on the server commit side, or on the client recieve side?
Any suggestions would be greatly appreciated.
Dusty Monk
Windstorm Studios
Currently working in the single player server & virtual client set up with TGEA 1.7.1. So no real clients, just the single app running both server and client in the same exe.
I'm attempting to debug a problem where I get a noticeable hitch when I spawn AI players. That is, the entire game freezes for a moment whenever a mob spawns. I started with profiling the app in dev studio, but nothing simwise appeared near the top -- everything is pretty much render related, which I would expect. Though I *am* interested in why we spend so much time in TSMesh::SkinMesh, but that's a different story. I then turned on the in-game profiling stuff, and again, all the main culprits appear to be in render related functions.
So I finally narrowed it down to the pause being when the actual AI player is created (the new AIPlayer code in script), and I noticed that there was a 200 ms delay between when the player is spawned on the server, and when the ghost creation code on the client creates it. So this has me thinking it's something with the networking. I pulled over the netgraph code from TGE, and added some max tracking to it. (Max Bits Received, Max Latency, etc). My usual latency runs around 37 ms. When the AI player is spawned, it peaks to 150+ ms. Now I also hit a max bitsReceived of around 500 bits at that time, but I don't think it's just packUpdate vs. unpackUpdate. Several times during gameplay, I often hit my peak bitsReceived of around 500, and at no time do I get the latency spikes like this.
So I could use some help on where to go to next to track down why I'm getting a 150ms spike in latency in a single player game during mob spawns. My player packUpdate has grown considerably, but like I said I don't think it's just that. Does it have to do with ghost creation? Is there an easy way for me to tell if the delay is on the server commit side, or on the client recieve side?
Any suggestions would be greatly appreciated.
Dusty Monk
Windstorm Studios
About the author
Dusty Monk is founder and president of Windstorm Studios, an independant game studio. Formerly a sr. programmer at Ensemble Studios, Dusty has worked on AAA titles such as Age of Empires II & III, and Halo Wars.
#2
I went ahead and printed out getPosition on the stream, and sure enough, my bit position flies out to 499 when I spawn a new player.
Can you point me to where the code is that makes the decision to split the packets in two? This 100ms delay between splitting packets sounds *exactly* like what's going on.
I also thought I understood that if tge/a split your packets up, then you were screwed, because there wasn't really functinality to stitch them back together on the client side. I guess I assumed because I seem to be getting everything I've sent, that it wasn't splitting the packets..
06/15/2009 (3:35 pm)
So it looks like getStreamSize just returns the size of the packet buffer, which on the server send side is always 450 -- though I'm not sure if that's bits or bytes. I went ahead and printed out getPosition on the stream, and sure enough, my bit position flies out to 499 when I spawn a new player.
Can you point me to where the code is that makes the decision to split the packets in two? This 100ms delay between splitting packets sounds *exactly* like what's going on.
I also thought I understood that if tge/a split your packets up, then you were screwed, because there wasn't really functinality to stitch them back together on the client side. I guess I assumed because I seem to be getting everything I've sent, that it wasn't splitting the packets..
#3
But I still can't track down the code where the splitting of the packets happens.
In any event, either the delay is not the splitting of the packet, it's something else, or the packet is still getting split.
Dusty
06/15/2009 (4:16 pm)
So just for shits and giggles, I updated my packet size from 450 to 700 bits. No, I don't think it's a good idea to leave it this way, but I just wanted to see if this would allow it to send the packet across in one fell swoop, and not force the packet split delay. At the time of th mob spawn, my largest packet set was 640 bits, underneath the packet size. However, I still got the giant pause and latency spike. I'm wondering if the decision to split the packet is not based on NetRate::packetSize, but on some arbitary constant -- like 500. But I still can't track down the code where the splitting of the packets happens.
In any event, either the delay is not the splitting of the packet, it's something else, or the packet is still getting split.
Dusty
#4
Since you're dealing with ghosts here, I would check the following part:
If the statement that checks BitStream::isFull () hits frequently, I would guess the ghost manager is splitting up packets, and in this case your AIPlayer creation will be scheduled for the next tick.
Is there a particular reason you need to cram so much info into these packets and are you sure you couldn't do this elsewhere that wouldn't disturb the simulation?
06/16/2009 (11:14 am)
That's definatly too much, but I'm sure you know that already.Since you're dealing with ghosts here, I would check the following part:
void NetConnection::ghostWritePacket(BitStream *bstream, PacketNotify *notify)
...
for(i = mGhostZeroUpdateIndex - 1; i >= 0 && !bstream->isFull(); i--)
{If the statement that checks BitStream::isFull () hits frequently, I would guess the ghost manager is splitting up packets, and in this case your AIPlayer creation will be scheduled for the next tick.
Is there a particular reason you need to cram so much info into these packets and are you sure you couldn't do this elsewhere that wouldn't disturb the simulation?
#5
lol yeah no it's admittedly a bit ugly right now. This is very much a case of a tale that's grown in the telling. I've added quite a bit to player, for the sake of the human player, and because AIplayer is derived, it's been burdened down with those additions. I'm definitely overdue to go back and refactor both the player packUpdates and AIPlayer packUpdates, but before I spent a bunch of time doing that, I wanted to verify that this is, indeed, where the hitches are coming from.
Thanks for taking the time to answer this, and I'll post up what I find out, for the sake of completeness and anyone else that might benefit.
Dusty
Windstorm Studios
06/16/2009 (12:15 pm)
Awesome. I'll put a breakpoint at that point and see if my packets are getting split up.. Quote:
Is there a particular reason you need to cram so much info into these packets and are you sure you couldn't do this elsewhere that wouldn't disturb the simulation?
lol yeah no it's admittedly a bit ugly right now. This is very much a case of a tale that's grown in the telling. I've added quite a bit to player, for the sake of the human player, and because AIplayer is derived, it's been burdened down with those additions. I'm definitely overdue to go back and refactor both the player packUpdates and AIPlayer packUpdates, but before I spent a bunch of time doing that, I wanted to verify that this is, indeed, where the hitches are coming from.
Thanks for taking the time to answer this, and I'll post up what I find out, for the sake of completeness and anyone else that might benefit.
Dusty
Windstorm Studios
#6
Just avoid packing strings like the plague and you should be fine, mostly.
Although the more of those objects you have, the higher the strain. I won't say it's unmanagable, but I wouldn't want to be in your shoes testing it. ;)
06/16/2009 (4:27 pm)
Hehe, no problem. It's refreshing to see some questions about the Torque networking, and it's an interesting (and vast!) subject.Just avoid packing strings like the plague and you should be fine, mostly.
Although the more of those objects you have, the higher the strain. I won't say it's unmanagable, but I wouldn't want to be in your shoes testing it. ;)
#7
In digging, and logging, I learned quite a bit about how the ghosting of objects work within Torque. I learned that it's actually quite easy (and common) for the number of ghost updates, combined with the events updates, to exceed the size of your packet on any given update. In this case, Torque fits the maximum number of ghosts it can in the update, sorted by priority, and sends what it can. What didn't get sent though, doesn't necessarily get sent later. Torque will update and keep track of how many updates any particular ghost has been skipped, but each update it looks at the entire list, resorts it, and only sends what it can. That means low priority items could be skipped indefinitely.
I also learned that in the Local Client configuration, when the server is ready to "send over the packet", it fills up the buffer, and then call's ProcessRawPacket from within the send_packet call. ProcessRawPacket, in turn, calls read_packet, which in turn reads the ghost objects, creates new objects as needed, and unpacks the updates.
This means that what you would think of as the clientside operation of reading and unpacking updates *actually* happens as a part of the processServer functionality, not as a part of the processClient functionality, as you might think. Again, this is only in the localClient configuration (ie., single player mode).
As for my problem, it was two-fold. One was really the amount of data I was replicating for the player had just grown way too large. And the vast majority of it was intended for the human player, but because AI player was a derivativ of player, it was sending and reading all this data as well. I took the hacky approach and wrapped those things in "isAIPlayer" conditionals, because I'm workign on a prototype and all of this is subject to being tossed. The right way to fix that, imho, would be to create HumanPlayer and AIPlayer components, each with their own pack and unpack functionality, and attach those things to Player as appropriate. In any event, some brutal slashing resulted in the pack and unpack for AIPlayers going from around 2200 bits to just around 600 bits.
The second thing that was wrong was more insideous. My only defense was that I wrote the offending code over 3 years ago, when I was first learning the Torque engine. And at that time, I didn't care one whit about performance. In any event, because my player's have inventory, and I want the inventory to be viewed by other players, as well as to allow players to view merchant's inventory, inventory and equipment are replicated (ie., ghosted). Now, becuase I wanted UI elements holding the inventory to be updated dynamically when you recieved or got rid of items, I, erm, put code in that refreshed the UI on the client side whenever an inventory state change was detected. Fast forward 3 years, when the number of UI elements has more than tripled, as has the amount of data being sent across the wire. In effect, whenever an AI object was spawned, the player's entire UI was being refreshed. As you can imagine, this is "not fast".
The moral of this story is kids, DO NOT update your entire UI inside of unpackUpdate!
Thanks again to Stefan Lundmark for putting me at the write spot in the code to begin my search.
Cheers!
Dusty Monk
Windstorm Studios
06/17/2009 (12:24 pm)
So I'm happy to say this bug has been found and ferreted out. The problem lay entirely within my code (doesn't it almost aways?), and as is often the case with these things, wasn't one single thing, but a number of things. In digging, and logging, I learned quite a bit about how the ghosting of objects work within Torque. I learned that it's actually quite easy (and common) for the number of ghost updates, combined with the events updates, to exceed the size of your packet on any given update. In this case, Torque fits the maximum number of ghosts it can in the update, sorted by priority, and sends what it can. What didn't get sent though, doesn't necessarily get sent later. Torque will update and keep track of how many updates any particular ghost has been skipped, but each update it looks at the entire list, resorts it, and only sends what it can. That means low priority items could be skipped indefinitely.
I also learned that in the Local Client configuration, when the server is ready to "send over the packet", it fills up the buffer, and then call's ProcessRawPacket from within the send_packet call. ProcessRawPacket, in turn, calls read_packet, which in turn reads the ghost objects, creates new objects as needed, and unpacks the updates.
This means that what you would think of as the clientside operation of reading and unpacking updates *actually* happens as a part of the processServer functionality, not as a part of the processClient functionality, as you might think. Again, this is only in the localClient configuration (ie., single player mode).
As for my problem, it was two-fold. One was really the amount of data I was replicating for the player had just grown way too large. And the vast majority of it was intended for the human player, but because AI player was a derivativ of player, it was sending and reading all this data as well. I took the hacky approach and wrapped those things in "isAIPlayer" conditionals, because I'm workign on a prototype and all of this is subject to being tossed. The right way to fix that, imho, would be to create HumanPlayer and AIPlayer components, each with their own pack and unpack functionality, and attach those things to Player as appropriate. In any event, some brutal slashing resulted in the pack and unpack for AIPlayers going from around 2200 bits to just around 600 bits.
The second thing that was wrong was more insideous. My only defense was that I wrote the offending code over 3 years ago, when I was first learning the Torque engine. And at that time, I didn't care one whit about performance. In any event, because my player's have inventory, and I want the inventory to be viewed by other players, as well as to allow players to view merchant's inventory, inventory and equipment are replicated (ie., ghosted). Now, becuase I wanted UI elements holding the inventory to be updated dynamically when you recieved or got rid of items, I, erm, put code in that refreshed the UI on the client side whenever an inventory state change was detected. Fast forward 3 years, when the number of UI elements has more than tripled, as has the amount of data being sent across the wire. In effect, whenever an AI object was spawned, the player's entire UI was being refreshed. As you can imagine, this is "not fast".
The moral of this story is kids, DO NOT update your entire UI inside of unpackUpdate!
Thanks again to Stefan Lundmark for putting me at the write spot in the code to begin my search.
Cheers!
Dusty Monk
Windstorm Studios
Torque Owner Stefan Lundmark
The skinning in TGE/TGEA is ancient. :P
Well that would mean that you're going to have to wait until the next update happens, and that could mean 100 ms delay. Are you sure that you're not stuffing too much into your packets?
Call getStreamSize () on the BitStream after you've packed everything and see if Torque is splitting packets up because you hit the maximum size.