Edges, Corners, and Malformed Pickles

28 May
May 28, 2013


A recent blog post here mentioned the difficulty that edge cases present. While we sometimes use ‘edge case’ to mean ‘rare condition’; if you want to be technical (and I do) specifically it refers to a case that occurs at an extreme of an operating parameter (a stereo speaker that fails at maximum volume).

In Jason’s example, the ‘edge’ is on the minimum side; if there’s no record in one system, the process stopped even though the other system could possibly have records in it. By analogy a ‘corner case’ is when more than one parameter is near it’s ‘edge’; two edges meet, and make a corner. (Say, when that same speaker only fails at maximum volume AND when in an environment of high humidity.)

Of course, in common parlance, we’ll use both terms to just mean a ‘rare case’. But of course, as a system sees more and more use, a rare case can become more and more common. Tonight, for your entertainment, two such strange cases. Normally, at the end of these technical posts there will be some little lesson or insight to take away; hopefully a simple trick or rule to follow to avoid a pitfall or pratfall. In this case, there won’t be one, really. Sometimes things are tricky; to find out what’s wrong you need to check the logs, read and re-read the code, think about the process, check a hunch, follow your gut, try a fix, fail, and start over again. A few times.

The Tale of the Vanishing Remainder

Submitted for your approval. One Ms. Crypto; a simple backend program that, together with its partner (Mr. Crypto) works to keep files encrypted and safe. Yet, some of the files are not making it to their intended destinations on mobile devices. The device simply sits and waits… for a packet that never comes. Lost in the ether? Blocked by a firewall? Or did the packet… never exist at all?

def encryptedSize (fileSize, chunkSize):
return (fileSize / chunkSize) * (chunkSize + 16) + (fileSize % chunkSize) + 16

Here we see a bit of code to get the size of the file that the device is waiting for. The device will wait until such time as it gets the whole file. Now, this file is being chopped into chunks and encrypted. Each chunk will have a little extra in it (16 bytes) because it’s encrypted. So how much bigger will the file be? Well, as you can see, we divide the file by the size of the chunks, and multiply that by the chunk size plus 16. Then we add in the remainder, what’s ever left over. Then we add 16 more for the size of this last chunk.

This works great, almost all of the time. You might already see the issue, though. You can think of this as putting pizza slices into boxes. Each box holds 8 slices. How many boxes do you need? Divide the number of slices by 8, of course. And then you’ll have one more box for whatever is left over. So add one. That’s more or less what the code above does.

What if you have 16 slices? Well, 16/8 is 2, so you fill two boxes with pizza. Then you take what’s left and put it in another box, so that’s 3 total. Wait, there’s nothing left? Why would you send out an empty box? Listen, you! That’s a very good question.

So yes, this will fail when the remainder is zero. Now, our chunk size is much more than ’8′ so the failures are going to be very, very rare. But they’ll happen. Oh, they’ll happen. Fortunately, the fix is simple, once you know the problem: Don’t send out empty pizza boxes.

def encryptedSize (fileSize, chunkSize):
if (fileSize % chunkSize) > 0:
return (fileSize / chunkSize) * (chunkSize + 16) + (fileSize % chunkSize) + 16
return (fileSize / chunkSize) * (chunkSize + 16)

The Case of the Malformed Pickle

Another class of problems are known as ‘race conditions’. The name comes from analog wiring, and the idea of two signals ‘racing’ to the end of the circuit. If one gets to a certain component first, you’ll have a different result than if the other one does. In programming, we get something similar when two processes are accessing the same data store.

In Python, we ‘pickle’ objects, which is just an overly-clever way of saying that we change them into a format suitable for storing. In one of our systems, we would occasionally get an EOF (end of file) error while trying to access a pickle. This, however, turned out to be a bit of a red herring, since no file was being accessed. Why was there a red herring in our pickle? Well, turns out the REAL error was that the data was encrypted, and something was wrong with the decryption. Thus, we were trying to de-pickle a string of still-encrypted data. This was discovered by checking the logs and seeing that the device was sending a key that didn’t match what the server had stored. But the device gets the key from the server… what’s going on here?

Well, let’s go back to 2003; when on a sunny Summer day the lights went out… all over the world. Well, okay, it was just the northeast U.S. and Canada, but that’s still a pretty big deal. There were a variety of reasons for the blackout that we won’t go into here, but the most interesting one occurred right here in the great state of Ohio, where a bug called a ‘race’ condition caused an alarm not to go off. Now, generally, a single alarm in a power control room in Ohio shouldn’t take down the whole eastern seaboard, but that’s another story.

My understanding of what happened is basically this: let’s say that when a power line sags and hits a tree and is knocked off line (happens a lot in the summer) you want to know about it. And let’s say when three of these occur in a certain time frame, you need to take some recourse (hitting the big button that says ‘don’t let the entire east coast go dark’). So the code would look something like this: (This isn’t the actual code used… I hope.)

10 PRINT "TreeHitLine Oh Noes"
20 GET NumberOfTreeAlarms FROM box
30 SET NumberOfTreeAlarms to (NumberofTreeAlarms + 1)
40 PUT NumberOfTreeAlarms IN box
50 IF NumberofTreeAlarms >= 3 THEN WhoopWhoopWhoopBigAlarm

Looks good, huh? So what’s the problem? Well, let’s say this program is being run by multiple ‘processes’ that use the same box and run at the same time.  So imagine signals coming in from two tree alarms at the same time, one in Parma and one in Solon. They’ll both run line 20 and get the value in the tree alarm box. Let’s say it’s one. Parma reads the box and gets one. Then Solon reads the value… after Parma read it, but before Parma stored the new value in the box. These things don’t happen instantly, after all. They’ll both read one, and add another one and get two. Then each one will store that value in the box. Line 50 will not trip the big alarm, since the value isn’t three; it’s two. Alarm doesn’t go off, 55 million people have to eat all the ice cream in their freezers. That’s a race condition.

So where does that leave us with our pickle? Well, have a look at this code:

c = Crypto ()
if 'url_encryption_key' in session:
key = session['url_encryption_key']
key = c.new_key ()
session['url_encryption_key'] = key
session.save ()

So, we need a key. If we’ve got one, we use it. If not, make a new one. Sounds good. But what if two requests come in at the same time? Well, we might have two see that there’s no key, and make one. Then, both will save to the session. But of course, the second one over-writes the first, so that first key is now invalid. The server will read the key from the session, do a compare, and throw out any requests with a non-matching key… so any requests that use the first one will fail, utterly.

Now, what if the device attempts to get all of the chapters for a book at one time, using a separate request for each chapter? That is pretty much what it does, and that just might cause some problems. We’ve got a race condition.

How do you fix a race condition? It isn’t easy. Really making things ‘thread safe’ can involve crazy things like semaphores and atomic transactions. Fortunately, though, in this case, we can greatly decrease the chance that we’ll see trouble by simply having the program check the current key before it sends out the data to the device:

key = session['url_encryption_key']

Since the new key is only created when the first request comes through, we don’t have to worry about race conditions happening after that first one, which is one of the things that makes fixing them so tricky… you just introduce a new race condition in fixing the first one. But here, all we need is for there to be one key for all requests. Whatever that key ended up being isn’t important; so we just make sure to pull it when we need it instead of assuming it’s the one our process chose. Of course, there can still be problem if the key is read the second time before the save occurs in the other thread. That’s unlikely here, but it could happen.

Also not that in the above code, the session isn’t reloaded right before the first check. That’s a problem that makes this worse (in a way – we’re not shutting down 1/4th of the country) than the Blackout bug… the ‘session’ variable here isn’t the “actual” session, but a copy we have. We’re not looking in the box, we’re looking at a picture of the box, that could be from awhile ago. Each process might be using outdated information and think they need a new session key when it really doesn’t. So let’s add that in as well:

session = get_session ()
c = Crypto ()
if 'url_encryption_key' in session: ...

With both of these fixes in place, we haven’t seen a bad pickle in a few days. Is it still possible in theory? Yes. But we certainly won’t see a few a day, as was the case previously. A solution that truly prevents race conditions will be a bit more work. But this should be enough to save the ice cream.

© Copyright 2017 Findaway. All rights reserved.