Game Development Community

Container::castRay infinite loop while sometimes...

by Clint S. Brewer · in Torque Game Engine · 03/28/2005 (3:28 pm) · 16 replies

Symptom:
on occasion my program would freeze up. I started running it under the debugger in hopes of tracking it down. It doesn't seem to crash just freeze.
When I was able to break it under the debugger here is what I found

Under Debugger:

looks like an infinite loop in sceneobject.cc inside of
bool Container::castRay(const Point3F &start, const Point3F &end, U32 mask, RayInfo* info)

it broke at around line 1339 in my version

getBinRange(currStartX, currEndX, subMinX, subMaxX);

currStartX and currEndX are both NAN's

going out on a limb...I'd bet those NANs will lead to the problem :)

subMinX looked like a very large number
subMaxX was 0


Other info
This just started showing up recently when I've been working on some ai code which is doing a lot more raycasts than before.

It seemed to occur when I would switch into the editor. I'm not positive that it only occurs then, but the last few times it happend was when I was switching into the editor with F11

in case it matters:
my ai are using schedules to get a chance to think. each ai manages it's own schedule which can vary anywhere from 500 to 20000 milliseconds based on the distance to closest PC.


it's very possible this is my problem somewhere in either script or code, but not sure how to track it down yet..

I'm hoping this rings a bell for someone else here

thanks.

#1
03/28/2005 (4:24 pm)
I had a similiar problem, and it was because my code to create projectiles was create nonsense (NAN) values, and then the poor engine was trying to deal with it.

Quickest way to confirm that is to look at your call stack when you end up feeding the NAN to the raycast and see where the number is being fed from.

Hope that helps a little.
#2
03/28/2005 (4:52 pm)
Heeeeyy I sware It just happened again, and I just grabbed a copy of the callstack then came here and read your post :)
thanks Brian..definitely a good idea.

ProjectLazarus_Internal.exe!Container::castRay(const Point3F & start={...}, const Point3F & end={...}, unsigned int mask=16, RayInfo * info=0x0012fbdc)  Line 1339 + 0x3a	C++
 	ProjectLazarus_Internal.exe!Player::collidingWithWater(Point3F & waterHeight={...})  Line 4533 + 0x39	C++
>	ProjectLazarus_Internal.exe!Player::updateFroth(float dt=0.010000001)  Line 4471 + 0xc	C++
 	ProjectLazarus_Internal.exe!Player::advanceTime(float dt=0.010000001)  Line 1312	C++
 	ProjectLazarus_Internal.exe!ProcessList::advanceClientTime(unsigned int timeDelta=10)  Line 200 + 0xb	C++

when collidingWithWater is called the value sent in has a denatured number, that might be the root of the NANs, will have to check more next time it happens
#3
03/28/2005 (5:00 pm)
Ok it's a bit nutty...
just happened again but through a different path
ProjectLazarus_Internal.exe!_ctrlfp(unsigned int newctrl=639, unsigned int _mask=65535)  + 0x16	C
 	ProjectLazarus_Internal.exe!_87except(int opcode=639, _exception * exc=0x0000ffff, unsigned short * pcw16=0x80000000)  + 0xc2	C
 	ProjectLazarus_Internal.exe!_ctrandisp2(__int64 parm1=507018896435723896, __int64 parm2=281526537261696)  + 0x189	Asm
 	ProjectLazarus_Internal.exe!Container::castRay(const Point3F & start={...}, const Point3F & end={...}, unsigned int mask=65548, RayInfo * info=0x0012f4e8)  Line 1346 + 0x5	C++
>	ProjectLazarus_Internal.exe!ccalcExplosionCoverage(SimObject * __formal=0x04f1c6d0, int argc=4, const char * * argv=0x00763e78)  Line 57 + 0x1b	C++

this time the root came from ccalcExplosionCoverage
in that function it gets the center of an object like so
sceneObject->getObjBox().getCenter(&center);
   center.convolve(sceneObject->getScale());
   sceneObject->getTransform().mulP(center);

then passes that center down to the other functions. the center was all NANs

so...I guess I must be setting some bad transforms somewhere...hmm.
#4
03/28/2005 (7:20 pm)
Ok..I do see something that I changed recently that could be a culprit and would explain the recent crashes.

I started using radius damage with one of my spells
and for the position I'm using

%mount_Pos = %player.getNodePosition("mount0");


that's a function I added a while ago and have been using in various places with no problem, but maybe it'll explain some other random crashes I've had in the past.

in case it rings any bells here's the code
ConsoleMethod(ShapeBase, getNodePosition, const char*, 3, 3, 
			  "point getNodePosition(string nodeName)")
{
   Point3F nodePoint;
   MatrixF shapeTransform;
   nodePoint = object->getNodePoint(argv[2]);
   shapeTransform = object->getTransform();
   shapeTransform.mulP(nodePoint);
   char* buff = Con::getReturnBuffer(100);
   dSprintf(buff, 100, "%g %g %g", nodePoint.x, nodePoint.y, nodePoint.z);
   return buff;
}

Point3F ShapeBase::getNodePoint(const char* nodeName) {
   Point3F nodePoint;
   MatrixF nodeMat;
   S32 ni = mDataBlock->shape->findNode(nodeName);
   if(ni)
   {
	nodeMat = mShapeInstance->mNodeTransforms[ni];
	nodeMat.getColumn(3, &nodePoint);
   }
   else
   {
	  Con::errorf("Error: searching for unknown node %s",  nodeName);
   }
   return nodePoint;
}

You'll notice the woeful lack of error checking
bad programmer no biscuit!

I'll add some checks to that and see if anything pops up.
#5
04/07/2005 (1:05 am)
For what it's worth, I did not change anything about the node getting code. nor did I change anything specifically related to this code. I did change some gui code, completely unrelated but this crash hasn't shown up in a while.

I hate bugs like this. lurking..in their foul places....waiting for just the right moment....


If it ever happens again, and I find out what's going on I'll post here.
#6
04/07/2005 (6:39 am)
Clint, have the same problem, tracked it down to the same sequence of calls : castRay, getBinRange.
Only happens in release mode, and it seems enabling the TORQUE_DEBUG define prevents it from happening in release...
Will post more info when I have it :)
#7
04/07/2005 (11:49 am)
Nicolas, glad to hear it's happening elsewhere, maybe we'll track it down.

I'm running on an INTERNAL_RELEASE build myself.

if it doesn't happen in debug there are two things I've seen in the past that usually are the problem, uninitialized memory/variables, timing.

will have to check into that.
#8
04/07/2005 (7:23 pm)
@Clint: Aha! Now, I've been having issues with this too. I am using the improved 3rd person cursor (which does raycasts per frame).

What's weird is I have never seen this lockup on my dev machine, in debug or release build. But on other systems, the game invariably locks up after a short amount of time.

I suspected it had something to do with the raycasting but I wasn't entirely sure, and since I can't re-create it on the machine that has my debugger and such, I haven't been able to trace it down.

This little bug has been nagging for a while and I glad to see other people are encountering it too. Perhaps with all of us hacking on it, we can figure what's up and stamp it out once and for all.
#9
04/08/2005 (3:38 am)
Have you tried making a journal?
#10
04/08/2005 (7:15 am)
Doh. Had forgotten about the journal stuff. That and Phil Carlisle float validation code has so far (taking a break ;)) led me to realize that the voodoo is NOT happening in raycast, etc.
It's happening a lot earlier actually, if you enable asserts in a release build and check for float validity (Phil's code should be in a forum post or resource, but I can't find it anymore. Going to ask him if I can post it here, it's pretty simple stuff, but oh, so useful ;)

I'm up to being in updateMove from a duplicate of the Player/AiPlayer class hierarchy, and acc not being valid before being applied to mVelocity.

More news as the situation develops.
#11
04/08/2005 (12:15 pm)
Journalling is cool, didn't know that's what that did. too bad it doesn't quite work with my ai stuff. one of my creatures did something slightly different on playback that blocked my character, that made all the later events invalid for interacting with and navigating the world.
#12
04/08/2005 (1:07 pm)
Clint, don't know if you wrote your AI stuff yourself, but basically the problem I was getting and was rippling all the way to the castRay, getBinRange stuff was that the mOrientation member of AIPlayer of the AIPack (and my twin of it) is not initialized by the ctor of the class.
Just added mOrientation.set(0.0f, 0.0f, 0.0f) to the AIPlayer constructor, and the job was done...
Took a while to find it, but with enabling Asserts using the floating point checking code probably cut that time by an order of magnitude (still waiting for Phil to ok it and I'll post the relevant code here)


Edit : TAP rocks :) (that's the Torque Application Platform if anyone doesn't know yet, get used to it, we'll be hearing lots about it ;))
#13
04/08/2005 (1:40 pm)
Good find on mOrientation! I suspect that is a root of the problem. Would definitely explain why things would work sometimes and not others, if mOrientation is filled with random bits of memory.

I think this bug comes from AIPlayer.cs in the CVS release of torque.


I did my own AI stuff, but mostly in script right now. I did e-mail someone about possibly using the ai pack in my game but never got a reply. I'm pretty happy with the way things are going now though.

thanks again for the the find Nicolas. since I haven't been having the problem for a while, I'm not sure if I can reliably say this will fix it for me, but if it _does_ happen again I'll definitely post here.
#14
04/08/2005 (1:44 pm)
Having the orientation set might actually make journalling work with my ai now as well, I'll have to try again :)
#15
04/08/2005 (2:07 pm)
Okay, Phil says ok for the float validation code, so here it is :

First, a caveat : windows only right now, someone wants to do the Mac and LInux versions, that'd be nice, and we can throw a new resource together for it, as Phil's seems to have vanished...

In engine\platform\platform.h, after static void advanceTime(U32 delta);
// each platform should define this..
	static bool isValidFloat(F32 f);

In engine\platformWin32\winWindow.cc
in the includes :
#include <float.h> //For fpclass in isValidFloat

Then in Platform section, around line 1091, for example
bool Platform::isValidFloat(float f)
{
	return (_fpclass(f) != _FPCLASS_QNAN);
}

In engine\math\mMathFn.h, as the end of the file, before the #endif :
inline bool isvf(F32 f)
{
     return Platform::isValidFloat(f);
}

In engine\math\mPoint.h :

in the includes :
// for _fpclass
#include "float.h"

after bool equal( Point3F &compare);
bool  isValid() const;

A bug fix if it's not in your codebase already for Point2F::operator==, as it was testing _test.x for both x and y at some point....
inline bool Point2F::operator==(const Point2F& _test) const
{
   return (x == _test.x) && (y == _test.y);
}

Somewhere, for example after inline bool Point3F::isZero() const function definition
inline bool Point3F::isValid() const
{
	return ( Platform::isValidFloat(x) && Platform::isValidFloat(y) && Platform::isValidFloat(z) );
   //return ( (_fpclass(x) != _FPCLASS_QNAN) && (_fpclass(y) != _FPCLASS_QNAN) && (_fpclass(z) != _FPCLASS_QNAN) );
}

All the above is from Phil Carlisle :)

My small contribution, goes into engine/math/mMatrix.h :

At the end of MatrixF class declaration :
bool isValid() const;

then at the end of the file, right before the #endif :
inline bool MatrixF::isValid() const
{
	bool isValid = true;
	for(int i=0; i <=15; i++)
	{
		isValid &= isvf(m[i]);
	}
	return isValid;
}

Hope this helps others with finding QNANs : use with the ASSERT functionality in TGE, which is enabled in DEBUG builds, and can be enabled in any build by doing :
#define TORQUE_ENABLE_ASSERTS

Happy coding !!!
#16
09/12/2007 (3:05 pm)
A cross platform way to check for NAN is to see if the number is equal to itself. NAN is not equal to anything ever, so equality tests will fail.

if (floatVal != floatVal) {
   // NAN
}