Getting nothing but a red screen at CroquetCollaborative.org? Here’s why.
Croquet keeps track of everything ever created, so that anyone can tell each object to do stuff. Most of the demo applications in the current SDK keep track as long as they are running. That creates a problem for our KidsFirst Application Toolkit demo,
and its public space at the Collaborative for Croquet. The public space is meant to be a long-lived environment, in which you can come and create (or destroy) stuff and rearrange it, and come back later to see things as you left them (perhaps evolved by someone else).
So we resort to a very old programming technique. And if you’re a developer, we need your help!
Many modern programming languages, including Smalltalk that Croquet is based on, can tell when no part of the environment is using an object, and the system gets rid of it. This is called “garbage collection.” This is very difficult to do in Croquet, because Croquet allows one person on the network to create an object that the person will later use as part of another object. Everyone needs to know about the first object as it is created, so that they can be told later to put it inside the second one. This makes it hard for any one Croquet machine to automatically know whether everyone is done using an object.
In older computer languages, people used to manage objects explicitly. The programmer created them, and the programmer explicitly “freed” them. Until we have distributed memory management in Croquet, most Croquet applications never “free” anything. Eventually, that fills up computer memory, which is not acceptable in our K.A.T. Each time someone enters a world, a lot of memory gets allocated for the user’s avatar and related information. If we don’t free it, it never goes away during the life of the world. Repeated comings and goings make the definition of the world so large that someone joining the party late has to wait a long time for a bunch of junk that has really already gone. So for the time being we’re freeing objects by explicitly unregistering them from the table that keeps track of all the objects in a world. (Technically, this is the Island nameMap.)
We do this whenever an object is explicitly deleted by the user, and whenever a user’s avatar leaves one world to enter another. When an object is unregistered, we also unregister everything in it, everything it is carrying, and so forth.
Alas, there’s at least one complication. When you put your mouse pointer over an object, the current default user interface reserves that object for exclusive use by you, until your mouse pointer moves off the object or click on it. The way it does that is to send the object a message telling it to that it is no longer under the mouse pointer. So what happens when an object has a “delete” button? The delete button removes the object and unregisters it, along with everything attached to the object, such as the delete button itself. But after sending “pointer down”, the current user interface immediately sends the button “pointer leave.” But the button isn’t there any more! There’s no known object to receive the message. What to do?
For now, instead of unregistering the objects immediately when they are deleted, we schedule them to “self destruct” after five seconds. We immediately remove the object from the scene, so that it looks like they’re deleted, but we don’t actually unregister them until a little later. The idea is to let them handle any cleanup activities such as “pointer leave.” Not very elegant, but it ought to be ok until we get real memory management and a redesigned default user interface.
Only it isn’t ok. Looking at the crash logs at CroquetCollaborative.org, I see that the only error we’re getting is “No such object”, which is repeatedly crashing the connection.
But what’s causing it? Alas, the crash log doesn’t give enough detail. But you developers can just tell us what you’re doing when you get this error! (It’s supposed to be a collaborative, after all.) If you can, go down the stack in the debugger until you see the TIsland>>decode: frame, highlight ‘aTMessage selector’, and press cmd/alt-p (to print the value). Tell us what it says.
This raises a couple of issues of things we should do:
- The #decode: method creates a message object in a local variable, but doesn’t assign the message name (the selector) until after the place where it does a check to conditionally signal an error. It ought to fill in what it can about the message first, so that crash logs that prints local variables gives us some meaningful data.
- IT happens. Systems should be designed to recover from inevitable failure. We need to put in a watch dog to reset the system when that happens. Right now, I’m checking on the system periodically and kicking it when needed. Ugh.
- It might just be that five seconds isn’t long enough on the network. Sometimes latency really sucks, and things get way behind. Our digital clutch is designed to drop messages that make the system behave worse, while also telling you with an annoying click to stop trying to make it worse. It’s our equivalent of having you not hit the “refresh” button over and over again. But we just came up with this, and I know it needs some refinement. Right now, you can get behind. When this happens, you’re quite likely to go click on things, and this might include the delete button. But if you’re way behind, maybe you send a delete and then a “mouse down” on some other sub-part, but the mouse down doesn’t “arrive” until more than five seconds after the delete. Maybe the business end of the self destruct should be minutes or even hours after it is received? Why not? All that we care about is that it eventually happens. Or since we’re measuring latency, maybe the timeout should be a function of the actual latency at the time of the delete?