Getting nothing but a red screen at CroquetCollaborative.org? Here’s why.
Croquet keeps track of everything ever created, so that anyone can tell each object to do stuff. Most of the demo applications in the current SDK keep track as long as they are running. That creates a problem for our KidsFirst Application Toolkit demo,
and its public space at the Collaborative for Croquet. The public space is meant to be a long-lived environment, in which you can come and create (or destroy) stuff and rearrange it, and come back later to see things as you left them (perhaps evolved by someone else).
So we resort to a very old programming technique. And if you’re a developer, we need your help!
Many modern programming languages, including Smalltalk that Croquet is based on, can tell when no part of the environment is using an object, and the system gets rid of it. This is called “garbage collection.” This is very difficult to do in Croquet, because Croquet allows one person on the network to create an object that the person will later use as part of another object. Everyone needs to know about the first object as it is created, so that they can be told later to put it inside the second one. This makes it hard for any one Croquet machine to automatically know whether everyone is done using an object.
In older computer languages, people used to manage objects explicitly. The programmer created them, and the programmer explicitly “freed” them. Until we have distributed memory management in Croquet, most Croquet applications never “free” anything. Eventually, that fills up computer memory, which is not acceptable in our K.A.T. Each time someone enters a world, a lot of memory gets allocated for the user’s avatar and related information. If we don’t free it, it never goes away during the life of the world. Repeated comings and goings make the definition of the world so large that someone joining the party late has to wait a long time for a bunch of junk that has really already gone. So for the time being we’re freeing objects by explicitly unregistering them from the table that keeps track of all the objects in a world. (Technically, this is the Island nameMap.)
We do this whenever an object is explicitly deleted by the user, and whenever a user’s avatar leaves one world to enter another. When an object is unregistered, we also unregister everything in it, everything it is carrying, and so forth.
Alas, there’s at least one complication. When you put your mouse pointer over an object, the current default user interface reserves that object for exclusive use by you, until your mouse pointer moves off the object or click on it. The way it does that is to send the object a message telling it to that it is no longer under the mouse pointer. So what happens when an object has a “delete” button? The delete button removes the object and unregisters it, along with everything attached to the object, such as the delete button itself. But after sending “pointer down”, the current user interface immediately sends the button “pointer leave.” But the button isn’t there any more! There’s no known object to receive the message. What to do?
For now, instead of unregistering the objects immediately when they are deleted, we schedule them to “self destruct” after five seconds. We immediately remove the object from the scene, so that it looks like they’re deleted, but we don’t actually unregister them until a little later. The idea is to let them handle any cleanup activities such as “pointer leave.” Not very elegant, but it ought to be ok until we get real memory management and a redesigned default user interface.
Only it isn’t ok. Looking at the crash logs at CroquetCollaborative.org, I see that the only error we’re getting is “No such object”, which is repeatedly crashing the connection.
But what’s causing it? Alas, the crash log doesn’t give enough detail. But you developers can just tell us what you’re doing when you get this error! (It’s supposed to be a collaborative, after all.) If you can, go down the stack in the debugger until you see the TIsland>>decode: frame, highlight ‘aTMessage selector’, and press cmd/alt-p (to print the value). Tell us what it says.
This raises a couple of issues of things we should do:
- The #decode: method creates a message object in a local variable, but doesn’t assign the message name (the selector) until after the place where it does a check to conditionally signal an error. It ought to fill in what it can about the message first, so that crash logs that prints local variables gives us some meaningful data.
- IT happens. Systems should be designed to recover from inevitable failure. We need to put in a watch dog to reset the system when that happens. Right now, I’m checking on the system periodically and kicking it when needed. Ugh.
- It might just be that five seconds isn’t long enough on the network. Sometimes latency really sucks, and things get way behind. Our digital clutch is designed to drop messages that make the system behave worse, while also telling you with an annoying click to stop trying to make it worse. It’s our equivalent of having you not hit the “refresh” button over and over again. But we just came up with this, and I know it needs some refinement. Right now, you can get behind. When this happens, you’re quite likely to go click on things, and this might include the delete button. But if you’re way behind, maybe you send a delete and then a “mouse down” on some other sub-part, but the mouse down doesn’t “arrive” until more than five seconds after the delete. Maybe the business end of the self destruct should be minutes or even hours after it is received? Why not? All that we care about is that it eventually happens. Or since we’re measuring latency, maybe the timeout should be a function of the actual latency at the time of the delete?
I have several possible suggestions.
It sounds like that in the long run y’all are looking to create some kind of “meta-garbage collector.” Correct?
As far as objects needing to stay around for people to use later, would it be possible that the objects could have rules that would cause them to go into some kind of “stub/hibernate” state?
They could reduce they footprint in the system(s) to a minimal state until someone decides that they want to use it later or it is decided to actually delete or GC the object.
As for the latency issues, could the “objects” be wrapped in some kind proxy object? When the object is deleted or GC, the object could tell the proxy that it is going away. The proxy could noop all the actual execution of any messages up until the proxy itself were deleted.
Or is the problem that the latency actually prevents the existing proxies from realizing that what they represent has gone away?
If so then the proxy approach could lead to rabbit holes within rabbit holes. (If you’ll excuse the expression.)
Would any of these work, or am I so behind the curve that I suggesting ideas that y’all have already tried. 🙂
James T. Savidge, Friday, January 19, 2007
I’m not sure what we’re going to do long-term about “meta garbage collection.”
We do have a general principle related to your stub idea. From the beginning, textures were represented outside of the definition of a Croquet world. We have been experimenting with a mechanism for keeping any immutable object outside. The data is identified by SHA, and kept in one big world-wide cache, and the same data used by any Croquet world that needs it. We want to have mesh definitions stored this way as well. All this has some very nice engineering characteristics, including having the effect of making the in-world definition of an object much smaller. This is something like your “stub/hibernated” object, except that it’s true all the time – no special hibernated state. The approach greatly alleviates the practical effect of the gc problem, but it doesn’t solve it.
I’m attracted to your stub/proxy idea, too. Alas, I don’t yet know why we’re sometimes getting messages to deleted objects, so I don’t know that a “sink” object would fix the problem. For example, the messages might not be for side-effect, but to get information out. If so, replacing a deleted object with a sink would just trade one error for another. But if this is NOT the case, then I think replacing deleted objects with sinks are a good way to “mark” an object ID for deletion, which can then be “swept” away during periodic “stop the world” maintenance/clean-up periods. Between such sweeps, the object IDs and the (single) sink(s) wouldn’t take up much space at all.
P.S. We did end up lengthening the fuse, to good effect. See http://wetmachine.com/i…