Torque networking, applied to VOIP
by Anthony Lovell · in Torque Game Engine · 02/04/2004 (9:18 am) · 92 replies
I have made some solid progress at adding VOIP into Torque (I can sample the microphone and compress it using Speex... code should work on Mac and Windows, and I could explore Unix). I intend to eventually share my work with the community when I have it ready, but now I need to learn much more about Torque -- primarily, its networking gestalt.
For my first implementation, I would like to just use Torque's unreliable UDP platformNet to send streaming voice packets through the server to only those other clients who are near enough to hear the speaker (a simple distance limit). Then, on each of those clients, I'd like them to hear the sound emanate from the position of the speaking player.
I had been thinking that my means of coding this was to be a direct use of Net::sendto() but another idea occurs to me:
Is the proper way to do this to add the mouth of the Player as part of the data state that the network layer is supposed to disseminate to other clients, much like his orientation in the world? In such a model, a non-talking player would have no state to transmit to the net, but when my microphone layer hands me some compressed audio, I'd hand it to my local Player instance's mouth, and trust that the neat differencing logic could then realize that it needs to go out to the net. If this is the approach, would I have to take special measures to ensure you could hear people behind you (as they are out of camera scope -- I take it this is different than setting scopeAlways, as the distance is still pertinent)?
Also, on the receive side, would the key be to add an AudioEmitter (or a subclass of one) to the mouth of the speaker? Is there any existing high-level support for audio which is provided on a frame-by-frame basis, and not from a soundfile resource?
Thanks in advance for any counsel you can provide.
tone
For my first implementation, I would like to just use Torque's unreliable UDP platformNet to send streaming voice packets through the server to only those other clients who are near enough to hear the speaker (a simple distance limit). Then, on each of those clients, I'd like them to hear the sound emanate from the position of the speaking player.
I had been thinking that my means of coding this was to be a direct use of Net::sendto() but another idea occurs to me:
Is the proper way to do this to add the mouth of the Player as part of the data state that the network layer is supposed to disseminate to other clients, much like his orientation in the world? In such a model, a non-talking player would have no state to transmit to the net, but when my microphone layer hands me some compressed audio, I'd hand it to my local Player instance's mouth, and trust that the neat differencing logic could then realize that it needs to go out to the net. If this is the approach, would I have to take special measures to ensure you could hear people behind you (as they are out of camera scope -- I take it this is different than setting scopeAlways, as the distance is still pertinent)?
Also, on the receive side, would the key be to add an AudioEmitter (or a subclass of one) to the mouth of the speaker? Is there any existing high-level support for audio which is provided on a frame-by-frame basis, and not from a soundfile resource?
Thanks in advance for any counsel you can provide.
tone
#42
07/15/2005 (11:44 am)
This is looking really good. Keep it up. I could really use something like this.
#43
tone
07/29/2005 (10:09 am)
I am still working on voice, but have shelved VOIP to await the delivery of TGE 1.4 functionality that should assist in the playback of streamed audio. In the interrim, I have been bringing up the basic function of speech recognition which my game will employ side-by-side with VOIP. Alas, the speech recognizer I am using is a commercially licensed one (and POWERFUL), but I should be able to deliver my framework for feeding it.tone
#44
07/29/2005 (1:14 pm)
Sounds interesting, so you are saying that you will have voice commands, such as saying "Attack my current target" and then your guards per say would start attacking it. Or you could say "Red 1 flank my target from the left" And off it would trot to do that?
#45
In my favor, I have a prototype application that works on many of the same expressions as I hope to employ (though I was using a typing chat and parsing the text to "recognize" the commands). However, there is always something more sophisticated to reach for. I'll try not to hang myself with the rope.
I think I'll have to concentrate on how the AI players provides feedback, and whether or not the dialog model will have transactions more complicated than one person ordering another to do something (for instance: will AI men ever ask a question of another player or AI man? Will a person react differently if one person says "reload" vs anotherperson asking the same thing? If you walk into a room and ask what time it is, what will prevent all the people within from blurting out the answer in an annoying chorus?)
I hope to use Torque Script for most of my AI. In my app, AI is actually a low-compute endeavor, as I don't have AI avatars who path-find or do many sophisticated tasks, and I will be coding up a very small set of primitives for them in C++.
tone
07/29/2005 (2:17 pm)
Yes, in essence. Basic speech recognition may be possible from open-source recognizers, but I am hoping to use quite advanced grammar features, and this will require not only a good recognizer (so far, I'm happy with the one I'm testing), but naturally I will need a fairly sound AI layer to process the recognized tokens and act accordingly. The example commands you cite illustrate fairly demanding tasks, such as the latter one especially (it is addressed to certain individuals only -- red 1, expresses an action -- a flanking maneuver, and specifies a variation on that action -- from the left)In my favor, I have a prototype application that works on many of the same expressions as I hope to employ (though I was using a typing chat and parsing the text to "recognize" the commands). However, there is always something more sophisticated to reach for. I'll try not to hang myself with the rope.
I think I'll have to concentrate on how the AI players provides feedback, and whether or not the dialog model will have transactions more complicated than one person ordering another to do something (for instance: will AI men ever ask a question of another player or AI man? Will a person react differently if one person says "reload" vs anotherperson asking the same thing? If you walk into a room and ask what time it is, what will prevent all the people within from blurting out the answer in an annoying chorus?)
I hope to use Torque Script for most of my AI. In my app, AI is actually a low-compute endeavor, as I don't have AI avatars who path-find or do many sophisticated tasks, and I will be coding up a very small set of primitives for them in C++.
tone
#46
08/01/2005 (7:02 am)
All I can say is :0 wow.
#47
09/01/2005 (12:11 pm)
How's it coming along
#48
I am letting VOIP wait until v1.4 RC2 comes out, on the vague promise that it might have a streaming audio class (real streaming, as in it just asks you for more data when it needs more). If RC2 does not have it (or say 2 months go by), I might start to write such a class myself. I am slowly getting better at TGE internals.
My present priority is to make it so SceneObjects can be "children" of other SceneObjects, so they will attach to them and follow them around. This will be helpful to support having a speaker's voice come out of his avatar's "mouth".
So .. bottom line... I'm stalled, but not giving up on this concept. If I give up on it, it will be because I have given up hope of making my game on TGE.
tone
09/01/2005 (1:07 pm)
Hi Steven.. checking in, eh?I am letting VOIP wait until v1.4 RC2 comes out, on the vague promise that it might have a streaming audio class (real streaming, as in it just asks you for more data when it needs more). If RC2 does not have it (or say 2 months go by), I might start to write such a class myself. I am slowly getting better at TGE internals.
My present priority is to make it so SceneObjects can be "children" of other SceneObjects, so they will attach to them and follow them around. This will be helpful to support having a speaker's voice come out of his avatar's "mouth".
So .. bottom line... I'm stalled, but not giving up on this concept. If I give up on it, it will be because I have given up hope of making my game on TGE.
tone
#49
09/01/2005 (3:05 pm)
Thanks for the update.
#50
09/12/2005 (2:17 pm)
Hey with TGE 1.4 RC2 out have you made any progress :) Can you tell I'm a bit eager about the project your working on.
#51
If worse comes to worse, I will present what I have and leave it to others to drive home the implementation.
tone
09/13/2005 (6:59 am)
I had not even noticed it was released. I will look into it, but frankly I have wavered on my commitment to TGE and am presently exploring a different engine.If worse comes to worse, I will present what I have and leave it to others to drive home the implementation.
tone
#52
09/13/2005 (7:16 am)
I would definately be interested in your work. I hope that you come back soon. I too wavered, and went to a different engine, but I'm back now.
#53
The datablock-centric nature of TGE objects continues to baffle me.
I simply need an AudioEmitter that I can install a simple audio source for, in the form
AudioEmitter::setStreamingAudioSource(StreamingAudioSource *s);
and which will ask the source for more data when it requires it... with a simple callback
class StreamingAudioSource
{
enum { // all formats MONO
FMT_LINEAR16BIT,
FMT_FLOAT16BIT
};
enum {
RATE_8000,
RATE_16000
};
// returns how many samples were supplied
// zero == none to be had
// -1 == unsupported format
/// -2 == unsupported rate
int supplyAudioSamples(void *putItHerePlease, int channels, int format, int rate, int noMoreThanThisManySamples);
}
tone
09/13/2005 (8:28 am)
Is there a changelog that details what is actually new in 1.4RC2 as regards audio output?The datablock-centric nature of TGE objects continues to baffle me.
I simply need an AudioEmitter that I can install a simple audio source for, in the form
AudioEmitter::setStreamingAudioSource(StreamingAudioSource *s);
and which will ask the source for more data when it requires it... with a simple callback
class StreamingAudioSource
{
enum { // all formats MONO
FMT_LINEAR16BIT,
FMT_FLOAT16BIT
};
enum {
RATE_8000,
RATE_16000
};
// returns how many samples were supplied
// zero == none to be had
// -1 == unsupported format
/// -2 == unsupported rate
int supplyAudioSamples(void *putItHerePlease, int channels, int format, int rate, int noMoreThanThisManySamples);
}
tone
#54
Datablocks are an integral part of why Torque networking is so fast and optimized. They allow us to deliver very large amounts of static data only once during the entire play session, and at a time where the user is more psychologically willing to accept net delays (mission startup).
09/13/2005 (9:21 am)
Quote:
The datablock-centric nature of TGE objects continues to baffle me.
Datablocks are an integral part of why Torque networking is so fast and optimized. They allow us to deliver very large amounts of static data only once during the entire play session, and at a time where the user is more psychologically willing to accept net delays (mission startup).
#55
If you check the theoraPlayer.cc file you'll find a working implementation of streaming arbitrary audio as a source, pretty much like what you describe (except it's a polling interface); this source can then be hooked into the sound layer more or less normally.
Ought to be super easy to do streaming audio source.
Want to write us a new audio layer? :)
09/13/2005 (10:06 am)
Hey Anthony,If you check the theoraPlayer.cc file you'll find a working implementation of streaming arbitrary audio as a source, pretty much like what you describe (except it's a polling interface); this source can then be hooked into the sound layer more or less normally.
Ought to be super easy to do streaming audio source.
Want to write us a new audio layer? :)
#56
tone
09/13/2005 (12:09 pm)
I didn't mean to imply that datablocks don't have a valuable role, but there is a comparative paucity of objects whose design is geared toward dynamic applications. I'll check the Theora player a bit later and see if my level of know-how will permit me to extract its charms, but I wonder if I am better off trying to code from OpenAL directly by mimicking classes like AudioEmitter.tone
#57
09/13/2005 (2:25 pm)
AudioEmitter uses the sound layer as well... But mimicking it is probably not a bad idea.
#58
Anthony, Ben, etc...
Can someone bring me up to speed on the state of the VoIP project and what it would take to finalize it?
Let's get this thing done!
:)
Sumner
09/14/2005 (7:46 am)
I've been away for a little while and just now saw that this thread was getting some attention...Anthony, Ben, etc...
Can someone bring me up to speed on the state of the VoIP project and what it would take to finalize it?
Let's get this thing done!
:)
Sumner
#59
Here is my current state:
I can sample the microphone just fine and I know the format the speech samples are in (I have speech recognition working really well on the speech). I'd have to dummy out the speech recognizer, as its SDK is under a commercial license.
I have classes written for compressing and decompressing the speech for VOIP. IIRC, I compress them and send out NetEvents containing the data. Right now, the server's handling of them is to relay the speech to all clients within 200 meters.
No attempt is currently made to pass the compressed data to the (possibly unfinished) decoder handler, as I had no place to put the data at that point anyhow though it might be good for me to try to decode it and write it to disk as PCM data to verify that I'm getting the codec working correctly and have solved endian issues, if any.
My thinking was to first drive for speech playback as though it was coming from a new streaming audioemitter that I'd position at the mouth of the avatar of the person talking. It would be equally valid to tie it to a channel-based tuning model (team send and broadcast send) for 2D playback at the receiving clients.
Let me see if I can package this up in a zip file.
IF YOU CAN HELP FINISH THIS WORK, email me privately, please.
tone
09/14/2005 (1:45 pm)
I'd love help of the sort that could actually get this working. I'm losing the passion to learn more about the gestalt of Torque as part of the exercise... I'm results-driven in this way.Here is my current state:
I can sample the microphone just fine and I know the format the speech samples are in (I have speech recognition working really well on the speech). I'd have to dummy out the speech recognizer, as its SDK is under a commercial license.
I have classes written for compressing and decompressing the speech for VOIP. IIRC, I compress them and send out NetEvents containing the data. Right now, the server's handling of them is to relay the speech to all clients within 200 meters.
No attempt is currently made to pass the compressed data to the (possibly unfinished) decoder handler, as I had no place to put the data at that point anyhow though it might be good for me to try to decode it and write it to disk as PCM data to verify that I'm getting the codec working correctly and have solved endian issues, if any.
My thinking was to first drive for speech playback as though it was coming from a new streaming audioemitter that I'd position at the mouth of the avatar of the person talking. It would be equally valid to tie it to a channel-based tuning model (team send and broadcast send) for 2D playback at the receiving clients.
Let me see if I can package this up in a zip file.
IF YOU CAN HELP FINISH THIS WORK, email me privately, please.
tone
#60
Don't give up. I've had similar frustrations with torque's audio code when i was attempting to add some simplistic voice transmission.
To test your speech compression and decompression routines, i would recommend testing the routines directly - pass in some raw data, set the codec options, compress, dump to a file, then load again accordingly from the file, *then* finally decompress and dump to another raw data file and listen to your result in a sound editor :)
You may want to look at my AudioFilter resource, which implements a two-way speex codec, which acts as a torque FilterStream with added util functions for accessing audio properties.
I also believe i implemented some testing functions that convert from speex -> raw / raw -> speex for that resource.
Good luck :)
09/14/2005 (2:34 pm)
Anthony,Don't give up. I've had similar frustrations with torque's audio code when i was attempting to add some simplistic voice transmission.
To test your speech compression and decompression routines, i would recommend testing the routines directly - pass in some raw data, set the codec options, compress, dump to a file, then load again accordingly from the file, *then* finally decompress and dump to another raw data file and listen to your result in a sound editor :)
You may want to look at my AudioFilter resource, which implements a two-way speex codec, which acts as a torque FilterStream with added util functions for accessing audio properties.
I also believe i implemented some testing functions that convert from speex -> raw / raw -> speex for that resource.
Good luck :)
Torque Owner Anthony Lovell
if(!mSendingEvents) { if (!theEvent->mRefCount) delete theEvent; return false; }I made two such alterations to NetEvent.cc
For the benefit of others, here is the code template you provided me, with syntax changes necessary to make it compile in TGE 1.4 HEAD as of 2-3 weeks ago. It has not been run, and you'll note the places where functionality is incomplete, but it has been observed to be pinging around the clients as it should, with printf()'s indicating your serverside proximity logic is largely on the mark.
My relayable NetEvent header:
#ifndef _VOXEVENT_H_ #define _VOXEVENT_H_ #include "vox.h" #include <sim/netConnection.h> class VoxEvent : public NetEvent { typedef NetEvent Parent; char *speexData; int numBits; public: VoxEvent(char *data = NULL, int bitCount = 0); ~VoxEvent(); int getNumBits() { return numBits; }; const char *getSpeexData() { return speexData; }; void pack(NetConnection *conn, BitStream *bstream); void write(NetConnection *conn, BitStream *bstream); void unpack(NetConnection *conn, BitStream *bstream); void process(NetConnection *conn); DECLARE_CONOBJECT(VoxEvent); }; #endifand the .cc file:
#include "voxEvent.h" #include "core/bitStream.h" #include "sim/sceneObject.h" #include "game/gameConnection.h" IMPLEMENT_CO_NETEVENT_V1(VoxEvent); VoxEvent::VoxEvent(char *data, int nBits) { mGuaranteeType = Unguaranteed; // no sweat if lost if (data == NULL || nBits <= 0) { speexData = NULL; numBits = 0; } else { int nBytes = 1 + (nBits >> 3); speexData = new char[nBytes]; numBits = nBits; dMemcpy(speexData, data, nBytes); } } VoxEvent::~VoxEvent() { if (speexData != NULL) { delete [] speexData; speexData = NULL; } } void VoxEvent::write(NetConnection *conn, BitStream * b) { pack(conn, b); } void VoxEvent::pack(NetConnection *conn, BitStream * b) { b->writeInt(numBits, 16); b->writeBits(numBits, speexData); } void VoxEvent::unpack(NetConnection *conn, BitStream * b) { numBits = b->readInt(16); int nBytes = 1 + (numBits / 8); speexData = new char[nBytes]; b->readBits(numBits, speexData); } void VoxEvent::process(NetConnection *conn) { GameConnection *sourceCon = (GameConnection*)conn; // if we are on the client side. if (sourceCon->isConnectionToServer()) { // TODO: play sound //Con::printf("we received voxEvent as client\n"); } else { // if we are on the server side, relay it to nearby clients ShapeBase *sourceCamera = sourceCon->getCameraObject(); if(!sourceCamera) { Con::errorf("server received voxEvent from client who has no camera\n"); return; } Point3F sourcePos = sourceCamera->getTransform().getPosition(); F32 maxVoxDistanceSquared = Vox::getExtremeAudibleRange() * Vox::getExtremeAudibleRange(); for (SimSetIterator it(Sim::getClientGroup()); *it; ++it) { GameConnection *gc = dynamic_cast<GameConnection*>(*it); if (gc == NULL) { Con::errorf("server found a NULL GameConnection\n"); continue; } if (gc == sourceCon) { //Con::printf("server will not relay VoxEvent back to its sender\n"); continue; } GameBase *clientCamera = gc->getCameraObject(); Point3F clientPos = clientCamera->getTransform().getPosition(); F32 distSquared = (clientPos - sourcePos).lenSquared(); if (distSquared <= maxVoxDistanceSquared) { //Con::printf("server relaying voxEvent over distance %f\n", mSqrt(distSquared)); gc->postNetEvent(this); // NetEvents are reference counted, so this is sane } else { //Con::printf("not relaying a voxEvent over a distance of %f > %f\n", mSqrt(distSquared), Vox::getExtremeAudibleRange()); } } } }