We’re getting some practical experience using Croquet away from of the confines of the lab and out on the wide open Internet. One problem we found is that our “outgoing” bandwidth (from each machine to the others) is often limited by, e.g., consumer Internet Service Providers. At my home, if I try to send more than about 30 KBytes/sec, my ISP kicks in a sort of governor in which it transmits the bits more slowly to keep my upload speed constant.
When this happens, it takes longer for the bits to reach the Croquet router that timestamps and redistributes them to all the participating machines. No participating machine, including our own, will act on this until it comes back to us from the router. So when the messages take longer to get to the router, they get timestamped for execution farther and farther from when they were sent. If we keep getting throttled, we end up falling further and further behind. It doesn’t take long before you do something and it seems like nothing ever happens in response. So you really don’t want to get your upload speed clamped.
Whenever we move the mouse around, the mouse position is sent along with a bunch of other stuff. When we send voice or video, much more data goes. This is fine on a high-speed Local Area Network, but not so good in the real world. We can and should send a lot less data. But how efficient is efficient enough? With different networks, there isn’t a single target number. The limits could even vary with the time of day or other traffic.
We’ve had some good preliminary results with a rather elegant solution.
A clutch is a mechanism for dealing with changes in relative speed between two or more communicating devices. In this case, the speed we’re interested in is the latency: the time lag between sending a message to the router and getting the same message back for execution. Latency can be affected by lots of things, and might reasonably range between 20 and 500 ms (half a second), even from one moment to the next. It’s useless to try to anticipate a particular target value, but fortunately we’re only interested in the change at any given moment. Is the latency getting worse right now, or better? If it’s getting worse (not counting fluctuations within the normal range), then it’s a sure sign that we need to stop sending so much data right now. If it’s getting better, then we can afford to send more.
This is easy enough to measure. For now, we’re just keeping track of the time we send each message. When it comes back to us, we measure the difference from now. That’s the latency for this message. Since we’re not interested in normal variations, we max this measured latency against a nominal max of, say, 500 ms. We compare that to the last maxed latency we measured – whatever message was involved. (I could imagine taking a windowing average, but so far it doesn’t seem necessary.) Our latency trend at this moment is then (this – last) / last. A positive number means we’re getting worse. Usually, the trend is 0.0.
Now, there are two ways to use this information.
For automatically generated repeated transmission – such as video frames that nominally get sent, say, every 100 ms – we can vary our sampling rate. For example, if we’re waiting N ms before the next frame, we can now add to that N * latencyTrend * dampingFactor. (A damping factor of, e.g., 0.9 keeps us from bouncing our rates around. It’s like a shock absorber.) The total inter-frame time gets a minimum so that we don’t go to zero (constant frames) when the latency is getting better for a long spurt. At my home, the video tends to stabilize quickly at about 300 ms between frames, but this can vary, and that’s the point.
Note that if I’m doing other things, then I won’t have a full 30 KB/s available for video. What really matters is whether my total end-to-end performance is getting worse or better – not what caused it or how much of that performance is going towards video. This adjustment works regardless of what else is going on, and regardless of whether other data sources are adjusting as they should.
We take a different approach for data that is sent in response to persistent action by the user. For example, if I wave my mouse around just a bit, then I’m not likely to hit my ISP’s data throttle. But if I wave it around fast and long enough, I can fill up my 30 KB/s with just mouse data. (Or less if I’m doing video!) In this case, we want to drop data. It’s not easy to simulate “less movement” by the user. But we don’t want to drop data without telling the user. In the case of video, they either see that the frame rate is lower, or they don’t see the difference and don’t care. But here, the user can alter their behavior if they’re told about it. So for each mouse position event that we drop, we give a small click. When the user is waving their mouse around frantically, they may get a “braaaap” sound made up of a whole series of little clicks. It sounds very much like the sound that the screwdriver clutch on a drill makes when you go to far. The user doesn’t need to understand exactly what is going on, but just intuitively stops doing what they’re doing. Of course, this brings the data rate down and things get back to normal very quickly.
The effect of all this is a very nice “direct manipulation” sort of feel. You get immediate feedback when you’re making things worse, and you get to make things better through your (in-)action. It has this lovely end-to-end behavior that keeps things going even when conditions change, and without an omniscient central-planning style of load balancing. And it practices a “design to fail” mentality that keeps things from getting out of hand even though not every part of either the Internet or our Croquet application is similarly designed.
As we improve the design, we’ll encounter these limitations less often. I’ll get my video rate up even on my home network. But regardless, this mechanism will stay in place.
I’m sure we’ll find a lot of things that need tweaking about this digital clutch concept, but I’m quite happy with the preliminary results.
[N.B.: The first version of this went out with a huge typo: I had consistently said 30 MB/s instead of 30 KB/s. Thanks to Ian, below for catching this!]