If I recall correctly, the machine was an ARPA-era Burroughs, and it handled crashes by taking the offending module off-line and completing an automatic restart within 15-30 seconds. (No Blue Screen of Death here, folks.)
In Croquet today, as with all software, things will go wrong. We would like there to never be errors, and we will hunt down every one we encounter. But that will take a while, and what work can ever be said to be perfect?
But Croquet has special issues. Unlike a stateless Web page, we need to maintain an open connection between machines (through the router), and Internet services do go down. No one else is affected by your connection going down, but you will need to get back in sync with everyone. Actual errors are worse. When Croquet is working properly any error will be replicated on each participating machine! This includes any “continuity server” machines that provide worlds to newcomers – like the next guy to come along after you get frustrated and leave.
Here’s what we do now. A while back, Josh Gargus created a pluggable “Connection Strategy” that implements the connection to a world, authentication, joining, and synchronization. Any network error is delegated to the Connection Strategy. The default Strategy repeats the connection process for you. Now we have finally gotten around to making use of this for replicated errors (e.g., those that occur while processing messages for a world, as opposed to those that occur for the user interface, rendering, or other activity on your computer). The Controller for each world can be told how to handle these errors:
- Our Continuity Server handles such errors by asking the router to disconnect all — breaking the connection for everyone including itself — and letting all reconnect in the usual way.
- An ordinary participant connecting to something other than itself will, by default, just tell the user nicely about any error it gets, and allow itself to reconnect when the router breaks the connection. Of course, you won’t always see the error notification because the router might reset everyon’s connection before you even process the message that would inevitably fail.
- Otherwise (e.g., a participant connecting to itself during development), an error produces the usual Squeak debugger.
In each case, the whole Croquet application and each other individual world continues uninterrupted.
So what does everything reset to when someone gets an error? The last saved state. Administrators can save whenever we want, but the current automatic behavior is to save a world whenever someone successfully connects to it. If the system can manage to set someone up with the current state, then it’s probably a good checkpoint to go back to, and it fits the user’s expectation of going back to the way things were when they (or anyone since) “started.” A major consequence of this is that long-lived worlds can now begin to make progress and evolve. Even with errors, a world will reset to a point in its history that is no earlier than the last time the last person started to do something. (Unless an administrator brings things down and explicitly goes to an earlier archived snapshot.)
All this can be improved on. In addition to not getting errors in the first place, we can develop better ideas of good checkpoints. There are errors in the error handling stuff itself, and it doesn’t always reset all that reliably. And for completely separate reasons, joining (and therefore resetting) can currently take longer than the 15-30 seconds we would like. But the big idea is that things don’t suck and they can only get better from here. Mostly it means that maybe I can stop compulsively checking the logs at CroquetCollaborative.org to make sure the damn thing hasn’t crashed.
“Designed to Fail” is just an offshoot of David Reed’s End-to-end approach, or the capabilities folks’ “Don’t count on precluding what you can’t prevent.” David and Andreas have already made Croquet along a “fail early” model that means that most broken application code will fail as soon as you try it: if it smoke-tests ok, it probably works. This happens because the basic design makes application code inherently free of complex timing and reference issues, and the infrastructure checks up on you. But this designed-to-fail approach is a nice backstop against the many bugs still in the (non-application) infrastructure code.