Torque networking, applied to VOIP
by Anthony Lovell · in Torque Game Engine · 02/04/2004 (9:18 am) · 92 replies
I have made some solid progress at adding VOIP into Torque (I can sample the microphone and compress it using Speex... code should work on Mac and Windows, and I could explore Unix). I intend to eventually share my work with the community when I have it ready, but now I need to learn much more about Torque -- primarily, its networking gestalt.
For my first implementation, I would like to just use Torque's unreliable UDP platformNet to send streaming voice packets through the server to only those other clients who are near enough to hear the speaker (a simple distance limit). Then, on each of those clients, I'd like them to hear the sound emanate from the position of the speaking player.
I had been thinking that my means of coding this was to be a direct use of Net::sendto() but another idea occurs to me:
Is the proper way to do this to add the mouth of the Player as part of the data state that the network layer is supposed to disseminate to other clients, much like his orientation in the world? In such a model, a non-talking player would have no state to transmit to the net, but when my microphone layer hands me some compressed audio, I'd hand it to my local Player instance's mouth, and trust that the neat differencing logic could then realize that it needs to go out to the net. If this is the approach, would I have to take special measures to ensure you could hear people behind you (as they are out of camera scope -- I take it this is different than setting scopeAlways, as the distance is still pertinent)?
Also, on the receive side, would the key be to add an AudioEmitter (or a subclass of one) to the mouth of the speaker? Is there any existing high-level support for audio which is provided on a frame-by-frame basis, and not from a soundfile resource?
Thanks in advance for any counsel you can provide.
tone
For my first implementation, I would like to just use Torque's unreliable UDP platformNet to send streaming voice packets through the server to only those other clients who are near enough to hear the speaker (a simple distance limit). Then, on each of those clients, I'd like them to hear the sound emanate from the position of the speaking player.
I had been thinking that my means of coding this was to be a direct use of Net::sendto() but another idea occurs to me:
Is the proper way to do this to add the mouth of the Player as part of the data state that the network layer is supposed to disseminate to other clients, much like his orientation in the world? In such a model, a non-talking player would have no state to transmit to the net, but when my microphone layer hands me some compressed audio, I'd hand it to my local Player instance's mouth, and trust that the neat differencing logic could then realize that it needs to go out to the net. If this is the approach, would I have to take special measures to ensure you could hear people behind you (as they are out of camera scope -- I take it this is different than setting scopeAlways, as the distance is still pertinent)?
Also, on the receive side, would the key be to add an AudioEmitter (or a subclass of one) to the mouth of the speaker? Is there any existing high-level support for audio which is provided on a frame-by-frame basis, and not from a soundfile resource?
Thanks in advance for any counsel you can provide.
tone
#2
In the case of speex on linux, i managed to compile it like a normal torque lib (e.g. zib, lpng, ljpeg), and have it compile on all the platforms GCC supports fine.
As for the Microphone input, how are you getting the audio? Last time i checked openal doesn't appear to have an interface for that =/
I myself experimented with sending speex encoded data down torque's networking system (via unreliable network packets), and playing the sound down the line with a StreamingSource. However, i must've gone wrong somewhere since it was very buggy - which is why i never got round to the microphone input :)
02/04/2004 (9:42 am)
Anthony,In the case of speex on linux, i managed to compile it like a normal torque lib (e.g. zib, lpng, ljpeg), and have it compile on all the platforms GCC supports fine.
As for the Microphone input, how are you getting the audio? Last time i checked openal doesn't appear to have an interface for that =/
I myself experimented with sending speex encoded data down torque's networking system (via unreliable network packets), and playing the sound down the line with a StreamingSource. However, i must've gone wrong somewhere since it was very buggy - which is why i never got round to the microphone input :)
#3
Working fine for me on OS X so far, and am going to try to make it go on XP next.
02/04/2004 (11:08 am)
I am using PortAudio (www.portaudio.com) for input. Not a library, per se, but a code project from which each platform draws a subset of files to create a common interface for sound i/o. Naturally, I am only using the input side. The code you write around these drop-in elements looks like this in my case, and this is a nearly complete listing!Working fine for me on OS X so far, and am going to try to make it go on XP next.
int
myPortAudioCallback(
void *inputBuffer,
void *outputBuffer,
unsigned long framesPerBuffer,
PaTimestamp outTime, void *userData )
{
Vox *talk = (Vox *)userData;
// since we never pass output through PortAudio, we should only be handling input
if (inputBuffer == NULL) {
Con::printf("portAudioCallBack has null input buffer");
return 0;
}
talk->getVoxEncoder()->queueForEncode((short *)inputBuffer, framesPerBuffer * sizeof(short));
return 0;
}
int
Vox::initInstance()
{
mEncoder = new VoxEncoder(myVoxEncoderCallback);
PaError paErr = Pa_Initialize();
if( paErr != paNoError ) {
Platform::AlertOK("Pa_Initialize failed", Pa_GetErrorText( paErr ));
return -1;
}
paErr = Pa_OpenDefaultStream(
&mPAStream,
1, // input channels (mono)
0, // output channels
paInt32, // vs paFloat32
8000, // sample rate
mEncoder->getInputFrameSizeShorts(), // frames per buffer
1, // number of buffers, 0 == use minimum
myPortAudioCallback, // our callback
this); // userData to be handed to callback
if( paErr != paNoError ) {
Platform::AlertOK("Pa_OpenDefaultStream failed", Pa_GetErrorText( paErr ) );
return -2;
}
return 0;
}
int
Vox::setIsTalking(bool b)
{
PaError paErr;
if (b == mIsTalking) {
return 0;
}
if (b) {
paErr = Pa_StartStream( mPAStream );
if( paErr != paNoError ) {
Platform::AlertOK( "Pa_StartStream failed", Pa_GetErrorText( paErr ) );
return -1;
}
} else {
// write the BitStream we are currently composing to the network
flushVoxToNet();
paErr = Pa_StopStream( mPAStream );
if( paErr != paNoError ) {
Platform::AlertOK( "Pa_StopStream failed", Pa_GetErrorText( paErr ) );
return -2;
}
}
mIsTalking = b;
return 0;
}
#4
02/04/2004 (2:11 pm)
As I recall, Tribes2 implemented voice communication by means of unguaranteed events passed to the server, thence to the other clients. The clients could set who the voice should be sent to.
#5
1. dialup users don't have the bandwidth
2. not all broadband users have the bandwidth. Some like the Charter customer around here have symetrical pipes ( upload is capped way low and counts again download speed ). Most can only get <8kb upstream before the downstream degrades to a trickle.
3. every client has to do "server" like calculations on who to send to and what not.
4. it is hackable, a griefer could hack the client to send to others regardless of whether they want to here the dumbshit or not.
5. it would be way more overhead to syncronize all the mute/ignore settings instead of just letting the server decide who to send audio to and when.
02/04/2004 (2:42 pm)
Peer to peer voice comms ( which is what it sounds like you are working on ) is not a good idea for a couple of reasons.1. dialup users don't have the bandwidth
2. not all broadband users have the bandwidth. Some like the Charter customer around here have symetrical pipes ( upload is capped way low and counts again download speed ). Most can only get <8kb upstream before the downstream degrades to a trickle.
3. every client has to do "server" like calculations on who to send to and what not.
4. it is hackable, a griefer could hack the client to send to others regardless of whether they want to here the dumbshit or not.
5. it would be way more overhead to syncronize all the mute/ignore settings instead of just letting the server decide who to send audio to and when.
#6
I may look at the events, Ben.
tone
02/04/2004 (2:59 pm)
I am proposing first client-server (server reflects audio to those who should hear it), and if I have time from there I may go for a dynamic model similar to one I designed in the past.I may look at the events, Ben.
tone
#7
Here are some topic areas where more information would assist me:
1. Do I dispatch within an event handler based on my NetEvent subclass's getClassName() or getClassId()? If the latter, what is the meaning of the U32 parameter passed to getClassId()?
2. How can I determine which player sent the event (i.e.: who is speaking?). Is that for me to encode within the data packed into my VoxEvent? If so, what means of reference (to the speaking person) is more convenient to employ, and what mapping function allows the receiving client to get a Player * from it?
3. Most importantly: do any of the sound output tools (ideally, a 3D positional audio source) support streaming audio? Which is best to use, and do I need to have a thread for each to keep its input buffer stuffed with sound data?
Thanks in advance for any help.
tone
03/13/2004 (7:30 am)
I am close to getting this working (or so it appears), but I need some pointers on how to handle the arriving speech events and stream their decoded audio out through a sound source located at the position of the speaking player.Here are some topic areas where more information would assist me:
1. Do I dispatch within an event handler based on my NetEvent subclass's getClassName() or getClassId()? If the latter, what is the meaning of the U32 parameter passed to getClassId()?
2. How can I determine which player sent the event (i.e.: who is speaking?). Is that for me to encode within the data packed into my VoxEvent? If so, what means of reference (to the speaking person) is more convenient to employ, and what mapping function allows the receiving client to get a Player * from it?
3. Most importantly: do any of the sound output tools (ideally, a 3D positional audio source) support streaming audio? Which is best to use, and do I need to have a thread for each to keep its input buffer stuffed with sound data?
Thanks in advance for any help.
tone
#8
The Ogg Vorbis code should give you a good example of streaming audio.
03/13/2004 (8:48 am)
You should check the process() method. Have you read the NetEvent docs? They should explain all this.The Ogg Vorbis code should give you a good example of streaming audio.
#9
03/13/2004 (9:49 am)
I wouldnt bother with the position info... I doubt it'll make a lot of sense. Just get it streaming in as a simple stream and that'll do ya. for instance, what possible usage is there in hearing someone talking off "in the distance" either you want to hear them, or not.
#10
@Anthony: Cool beans, this would be a very cool thing if you could get it working.
03/13/2004 (11:21 am)
@Phil: It's so you can implement voice chat in a 'realistic' fashion. Take an RPG for instance, players whose characters are close enough ingame to hear one another can converse, those whose characters are too far away cannot.@Anthony: Cool beans, this would be a very cool thing if you could get it working.
#11
Though Phil's advice might eliminate a small amount of complexity, Jeff has my goal exactly ... I will want to code up "radio" style chat in future but want first to code up what no one ever seems to deliver: chat modeled on sounds passing through air so that proximity helps you determine which are most likely directed at you (and hence demand the most attention), working toward eventually having the speaking avatar lip-sync so that again, a natural and proper mechanism can help identify the speaker (rather than glowing neon text).
I've looked at NetEvent a fair bit, but have been working from a Mac and frankly it is very hard to navigate between C++ and headerfiles (which are not in the Mac projects... would it hurt if I added them?). I see a mSourceId in the code, but it is not really used. It is just hard to divine if that is a good means of identifying the speaker.
I'll look for process()... I wonder Which::process()?
tone
03/13/2004 (11:31 am)
Ok, let me try to digest this and see where I get.Though Phil's advice might eliminate a small amount of complexity, Jeff has my goal exactly ... I will want to code up "radio" style chat in future but want first to code up what no one ever seems to deliver: chat modeled on sounds passing through air so that proximity helps you determine which are most likely directed at you (and hence demand the most attention), working toward eventually having the speaking avatar lip-sync so that again, a natural and proper mechanism can help identify the speaker (rather than glowing neon text).
I've looked at NetEvent a fair bit, but have been working from a Mac and frankly it is very hard to navigate between C++ and headerfiles (which are not in the Mac projects... would it hurt if I added them?). I see a mSourceId in the code, but it is not really used. It is just hard to divine if that is a good means of identifying the speaker.
I'll look for process()... I wonder Which::process()?
tone
#12
03/13/2004 (12:04 pm)
Ahh gotcha ... MyEvent::process()
#13
I know what you mean about the "no one ever seems to deliver" bit. There are a number of features like this that would add so much to CRPGs that devs just don't seem interested in at all. I think it has a lot to do with the nature of the industry, but that's a whole other discussion. :)
Wow, to me (a non-coder) that seems like quite a tall order, but I know exactly what you mean about trying to figure ways of identifying speakers without heavy-handed methods like highlighted text or glowing avatars, etc. I think one thing that would do a lot in this direction would be a good emote system. In other words, animated gestures as clues to who is speaking - head bobs, hand gestures (a lot of people 'talk with their hands' anyways), posture, etc.
edit: then again, pseudo-lip-synch wouldn't be hard, and it would go a long way towards identifying the speaker and it would add that extra bit of immersion.
This might not be strictly on-topic, but I've always thought a great way to make 'distance-smart' voice chat even more useful would be voice disguisers. I don't know how good they sound, but I know there are telephonic devices that allow people to disguise their voices so that women can sound like men and vice-versa, children can sound like adults, etc. It might go a long way towards identifying who is speaking when that 20 year old woman playing the 67 year old dwarf male actually sounds like an old male dwarf. This would be especially helpful to Dungeonmasters/Referees in the aforementioned NWN model, who have to roleplay as characters from a wide variety of races, genders, and cultures.
Anyhoo, as I said I'm not a coder so I can't really be of help to you Anthony, other than to again wish you good luck with this.
03/13/2004 (12:54 pm)
This has always been an interesting feature to me, and I'm always on the lookout for info on this on the 'Net. I think a feature like this would add a lot to CRPGs, particularly CRPGs in the Neverwinter Nights model, where the aforementioned griefer problem is less salient because NWN-style CRPGs are often run on invite only.Quote:Though Phil's advice might eliminate a small amount of complexity, Jeff has my goal exactly ... I will want to code up "radio" style chat in future but want first to code up what no one ever seems to deliver: chat modeled on sounds passing through air so that proximity helps you determine which are most likely directed at you (and hence demand the most attention),
I know what you mean about the "no one ever seems to deliver" bit. There are a number of features like this that would add so much to CRPGs that devs just don't seem interested in at all. I think it has a lot to do with the nature of the industry, but that's a whole other discussion. :)
Quote:working toward eventually having the speaking avatar lip-sync so that again, a natural and proper mechanism can help identify the speaker (rather than glowing neon text).
Wow, to me (a non-coder) that seems like quite a tall order, but I know exactly what you mean about trying to figure ways of identifying speakers without heavy-handed methods like highlighted text or glowing avatars, etc. I think one thing that would do a lot in this direction would be a good emote system. In other words, animated gestures as clues to who is speaking - head bobs, hand gestures (a lot of people 'talk with their hands' anyways), posture, etc.
edit: then again, pseudo-lip-synch wouldn't be hard, and it would go a long way towards identifying the speaker and it would add that extra bit of immersion.
This might not be strictly on-topic, but I've always thought a great way to make 'distance-smart' voice chat even more useful would be voice disguisers. I don't know how good they sound, but I know there are telephonic devices that allow people to disguise their voices so that women can sound like men and vice-versa, children can sound like adults, etc. It might go a long way towards identifying who is speaking when that 20 year old woman playing the 67 year old dwarf male actually sounds like an old male dwarf. This would be especially helpful to Dungeonmasters/Referees in the aforementioned NWN model, who have to roleplay as characters from a wide variety of races, genders, and cultures.
Anyhoo, as I said I'm not a coder so I can't really be of help to you Anthony, other than to again wish you good luck with this.
#14
I can help you to create this layer if you want.
Cya
03/14/2004 (4:20 pm)
I think that the best approch will be to create an additional network layer (like the events or ghosts layer) to send the voice traffic, using events will give priority to the voice instead of to the ghosts. In other words, the engine will prefer to send a voice packet than a ghost update, which is bad anyway. :)I can help you to create this layer if you want.
Cya
#15
The network layer only seems to deliver the netevent to the node that sent it! The other client on the test network is not receiving it. Can someone tell me a filename where I can identify where these arriving in the server (and where choices of who to relay it to are made?)
Lipsyncing is just an idea, mind you. I've used near-realtime lipsynching avatars in the past, and they require a lot of gnarly code -- essentially speech recognition to figure out the phonemes in the speech signal. Overall, I am more than willing to punt on such niceties in favor of visual indications, but do regard this as a longterm goal for such things. TF2's prototype showed this at E3 some 4 years ago!
tone
03/15/2004 (12:04 pm)
Well, I'm just trying to coerce the event relaying system for now. Not having much luck, sadly.The network layer only seems to deliver the netevent to the node that sent it! The other client on the test network is not receiving it. Can someone tell me a filename where I can identify where these arriving in the server (and where choices of who to relay it to are made?)
Lipsyncing is just an idea, mind you. I've used near-realtime lipsynching avatars in the past, and they require a lot of gnarly code -- essentially speech recognition to figure out the phonemes in the speech signal. Overall, I am more than willing to punt on such niceties in favor of visual indications, but do regard this as a longterm goal for such things. TF2's prototype showed this at E3 some 4 years ago!
tone
#16
and here is the point in my code from which I send out VoxEvents:
03/15/2004 (12:08 pm)
I should post my code, just in case:void
VoxEvent::process(NetConnection *conn)
{
NetConnection *ourClient = NetConnection::getLocalClientConnection();
if (mSourceId == ourClient->getId()) {
// this printf fires off on the client of the person talking! nowhere else tho
Con::printf("process VoxEvent from ourselves? srcId=%d numBytes=%d", mSourceId, mDataLengthInBytes);
return;
}
SceneObject *talker = dynamic_cast<SceneObject*>(Sim::findObject(mSourceId));
AssertFatal(talker != NULL, "Unable to cast source of VoxEvent to a SceneObject?");
// TODO: actually add code here to decode and play the audio
// through the location of the talker
}and here is the point in my code from which I send out VoxEvents:
void
Vox::flushVoxToNet() {
Mutex::lockMutex(mSendBufferMutex);
if (mNumFramesInSendBuffer > 0) {
NetConnection *conn = NetConnection::getServerConnection();
VoxEvent *event = new VoxEvent(mSpeexSendBuffer, mNumBytesInSendBuffer);
conn->postNetEvent(event);
Con::printf("writing %d frames vox data to net" , mNumFramesInSendBuffer);
}
mNumFramesInSendBuffer = mNumBytesInSendBuffer = 0;
Mutex::unlockMutex(mSendBufferMutex);
}
#17
The command flushBoxToNet() is only sending the voice data to the server, when you receive the event on the server, you should iterate thru the ClientGroup, broadcasting the messages to all the clients.
ex:
The audio decompression and playing should only occur at the clients, the server should act only as a broadcaster.
03/15/2004 (1:55 pm)
Anthony, The command flushBoxToNet() is only sending the voice data to the server, when you receive the event on the server, you should iterate thru the ClientGroup, broadcasting the messages to all the clients.
ex:
SimGroup* pClientGroup = Sim::getClientGroup();
for (SimGroup::iterator itr = pClientGroup->begin(); itr != pClientGroup->end(); itr++) {
NetConnection* nc = static_cast<NetConnection*>(*itr);
if (nc != NULL)
{
VoxEvent *event = new VoxEvent(mSpeexSendBuffer, mNumBytesInSendBuffer);
nc->postNetEvent(event);
}
}The audio decompression and playing should only occur at the clients, the server should act only as a broadcaster.
#18
tone
03/15/2004 (2:20 pm)
Thank you Marcelo .. do I need to run that within process() only if I am the server? Or will the client group have zero members otherwise? Many thanks!tone
#19
If you are in a server, broadcast the incoming data to the clients, otherwise decompress and play the data. There's a function to check if the current object instance is on server, I think it is isServerObject(), but I'm not sure if this works on Events.
03/15/2004 (2:35 pm)
You should to the following at the event process():If you are in a server, broadcast the incoming data to the clients, otherwise decompress and play the data. There's a function to check if the current object instance is on server, I think it is isServerObject(), but I'm not sure if this works on Events.
#20
How are these other events relayed across the network?
tone
Here is my code at present.. I think I am getting an infinite loop (probably because the server is in its own list of clients)? Is there not an example in the code tree of an event that is sent by one client, relayed to some or all of the others by the server, and then processed by the various clients?
03/16/2004 (7:58 am)
I'm sad to report that I'm at a near-total loss here. None of the other NetEvents marked as being sendable from client or server in the code tree perform logic like this (relaying to clients from within ::process()) and my code is going badly awry.How are these other events relayed across the network?
tone
Here is my code at present.. I think I am getting an infinite loop (probably because the server is in its own list of clients)? Is there not an example in the code tree of an event that is sent by one client, relayed to some or all of the others by the server, and then processed by the various clients?
void
VoxEvent::process(NetConnection *conn)
{
// Con::printf("process VoxEvent from %d numBytes = %d", mSourceId, mDataLengthInBytes);
// relay the group to any attached clients
SimGroup* pClientGroup = Sim::getClientGroup();
if (pClientGroup != NULL) {
for (SimGroup::iterator itr = pClientGroup->begin(); itr != pClientGroup->end(); itr++) {
NetConnection* nc = static_cast<NetConnection*>(*itr);
if (nc != NULL) {
VoxEvent *event = new VoxEvent(this);
nc->postNetEvent(event);
}
}
}
// then handle it ourselves...
// TODO: do not play own vox
NetConnection *ourClient = NetConnection::getLocalClientConnection();
if (mSourceId == ourClient->getId()) {
Con::printf("process VoxEvent from ourselves? srcId=%d numBytes=%d", mSourceId, mDataLengthInBytes);
return;
}
SceneObject *talker = dynamic_cast<SceneObject*>(Sim::findObject(mSourceId));
// this fires off at some point
AssertFatal(talker != NULL, "Unable to cast source of VoxEvent to a SceneObject?");
Con::printf("got VoxEvent from srcId=%d numBytes=%d, name='%s'", mSourceId, mDataLengthInBytes, talker->getIdString());
}
Torque Owner Anthony Lovell
voxInitialize() must be called before any others, and returns 0 on success or an error code
voxStartTalking() opens the mike and permits the local user to transmit to other avatars near him in the game world
voxStopTalking() closes the transmit.
By binding a key to start/stop talking on keyDown/keyUp would give you a nice proximity voice chat, I hope.
Future features before I'd consider this truly complete:
1. add channel-based tuning, as is most commonly used to mimic use of game/team/squad radios
2. ability to mute obnoxious louts
3. ability to assign begin-xmit and end-xmit sound effects, to mimic radios
4. ability to have 3D objects that function as the audio source of tuned reception (e.g.: a radio)
5. (far future) facial animation of speakers for lip-syncing.
Not on the list:
1. voice-activated transmission. It just creates more problems than it solves, IMO, but someone else could probably hook it in.
tone