I’ve been reading Cryptography Engineering by Bruce Schneier, Niels Ferguson, and Tadayoshi Kohno, on the theory that someone who writes about privacy and surveillance as much as I do ought to have somewhat more detailed understanding of how modern cryptosystems work, even if I’m never going to be competent to work with the actual code. At one point, the authors mention a potential problem with certain kinds of ciphers. Stream ciphers work by combining a secret cryptographic key with a (supposedly) unique number—a random shared string or a “nonce“—to generate a “keystream.” The keystream is then XORed with the plaintext message to produce the encrypted ciphertext.
For the non-computer-geeks: that just means that for every bit in the sequence of ones and zeroes that makes up the plaintext, if the keystream has the same value in that position, then the corresponding bit of the ciphertext will get written as a 0, and if they have different values in that position, the corresponding bit of the ciphertext gets written as a 1. (This corresponds to the logical operation “exclusive or”: It outputs a 1, meaning “true,” just in case one or the other but not both of the inputs is true.) So, for instance, the capital letter “A” is normally encoded as the binary string: 01000001. A lowercase “z” is represented as 01111010. If you XOR them together, you get: 00111011. If you XOR in the “z” again, you get “A” back out… but that assumes you know at least one of the two original pieces of the puzzle: There’s a vast number of different ways to XOR two bytes together to produce 00111011.
In theory, there should be no way to reverse the process without knowing the keystream, which requires knowing the cryptographic key. But there’s a loophole: If the designer of the system messes up and allows that supposedly-unique “nonce” to be reused, then you end up with two messages that have been encrypted (XORed) with the same keystream. That doesn’t tell you what the keystream is. But if an attacker knows which two messages have been encrypted with the same keystream, he can just XOR those two cyphertexts together. The result is to mathematically cancel out the key, and give you the same result as if you’d just XORed the two original plaintexts together. Once you’ve got this, Schneier et. al. warn, an attacker will often be able to easily reverse the process and decompose that into the two original messages—provided the original messages aren’t just random gibberish, but something that exhibits patterns, like written English. But they didn’t bother explaining exactly how this could be done, so I ended up spending 15 minutes doodling on a legal pad trying to suss out how an attack would work. Even some of my geekier friends seemed to think it wasn’t possible when I floated the question on Twitter—and for some cases, it won’t be. For instance, if the two original messages are identical—meaning they have the same value at every bit position—then the result of XORing them is always going to be a string of zeroes, which makes it obvious the two initial messages were identical, but doesn’t give you any hint at the content of the messages.
Special cases aside, though, there definitely are some generally viable strategies for decomposing a file generated by XORing two messages—let’s assume they’re ordinary written English in ASCII characterformat—back into the original pair of texts. How would you go about it? I’ll update the post with the solutions that I came up with (or found online) later this weekend.
Update: I’m pleased, though not at all surprised, to see that I have a bunch of very smart readers who came up with basically all the strategies I did, and in some cases stated them with a good deal more sophistication than I could have. You’re probably better off just reading the comments, but I’ll summarize the basic ideas below the fold.
(1) Exploit Headers, Padding, & Formatting: I didn’t suggest any particular document structure beyond written English, but realistically you’d want to look for signs of some standard header information that would be at least partly identical (and of some characteristic length) for a pair of e-mail messages, HTML pages, word documents, etc. The strictly identical and overlapping parts will be represented as strings of zero-bytes in the XOR file, which may be of some help to the extent that you can make educated guesses about what you’re looking at from the length and pattern of the overlapping strings.
Even more useful, once you’ve got some evidence that you’re dealing with a particular kind of structured document, you can try XORing in different characteristic header strings like “Content-Type: text/html” or “Received-From:” in different positions and seeing if and where it spits out something intelligible. I’m assuming throughout that you’ve got frequency tables that allow a search algorithm to quickly assess whether a given output string is likely to appear in written English (or a given type of structured document) or just random gibberish. For strings like “Received-From” that you’d expect to be in approximately—but not exactly—the same position in the document, you might just precompute the strings you get in the region of overlap when you XOR it against itself offset by one or more characters. Then you can just quickly scan for matches with your precomputed “signatures”.
You can apply the same trick in the body if you think you’re dealing with a particular type of structured document. If you think at least one of the original plaintexts might use HTML, for instance, you’d XOR in long predictable strings like:
at every position until it spits back something intelligible.
On the other end, if you’ve got messages of different lengths, check to see if one of them was padded out with zeroes (or some other characteristic padding string): XOR some padding characters in at the end of the file and see if what comes back looks intelligible. If you XOR in bunch of zeroes and get out “Sincerely, Agent Smith” you’re probably on to something.
(2) Exploit ASCII: The ASCII character set systematically distinguishes between spaces and most common punctuation marks (which all begin 001), capital letters (010), and lowercase letters (011). There are some fairly uncommon punctuation marks that share initial bits with the letters, but we can ignore that initially for practical purposes. This turns out to give us a really useful shortcut for learning something about the structure of the two texts. Two of the same type of character XORed together, of course, are always going to start 000 (i.e. no difference), and most of the time that’s going to be two lowercase letters. But you can infer that a byte starting 010 means a space or punctuation mark XORed against a lowercase letter, 011 a space or punctuation against a capital letter, and 001 means a lowercase letter overlapping a capital. Since it’s going to be the most common punctuation by far, you probably want to default to treating an isolated 010 byte as a space. That gives you likely word breaks, which are handy for making your dictionary attack more efficient. Not all of them, mind you—they’ll be concealed when they coincide—but enough to be useful.
You can also look for characteristic sequences. If a sentence break in one document hits the middle of a word in the other, you’d have a sequence of bytes beginning 010, 010, 001—whereas a sequence beginning with 010, 010, 000 is more likely to be a punctuated wordbreak (comma, semicolon, colon) within a sentence falling across a word.
To see how this would be helpful in practice, imagine you’ve got:
I love cheese!!
You should too.
The first three bytes of the XORed file are going to be:
00010000 01001111 00011110
How convenient, that looks like a one-letter word (or a contraction) in one of the original plaintexts right at the start! Immediately you’d want to try (or realistically, have your analytic algorithm try) XORing a capital “A” and a capital “I” in the first two positions. The first gives you the rather unpromising “Qo” in the first two places, which can be safely discarded, but “Yo” is rather more promising. Word frequency analysis helps you continue: You’ve got two characters of the same type next, and then another 010 byte. An English sentence with two sequential one-letter words would be unusual, but the string “You” followed by a space is extremely common, and testing it gives you the highly plausible complement “I lo”—and so on. Obviously, I’m somewhat artificially describing the process as a human codebreaker would approach it—the computerized attack would replace our intuitive sense of plausibility with a probability assignment based on the frequency with which possible strings of characters appear in English in different positions in a word or sentence. (The sequence “ing” appears with relatively high frequency at the ends of words, far less often at the start.)
(3) Whack It With a Dictionary: Aided by reasonable guesses about the format of the plaintexts and the locations of (some) word breaks and punctuation generated in the first two steps, you fill in the gaps by XORing in test words in every position—starting with the 500 most commonly used words in English—and once again recording the ones that generate output sequences also found frequently in English. When XORing in a word generates a possible word fragment, you test the various ways to complete it in the appropriate spot. In other words, if you plug in “the” and get out “vac”, you check whether the subsequent bytes generate anything interesting for XORing in “uum”,”ant”, “ation”, “illate”, and so on.
If you want to optimize the speed of a dictionary attack, you can also adjust the word list you’re using as you get probable hits. Ideally, you’d have a frequency table of uncommon words whose occurrence tends to be correlated, so that (for instance) your algorithm knows that if it has found that XORing in words generates the sequence “litigation,” it prioritizes testing various legal terms that would otherwise be lower on the list, or at least pump the same word back in as a search input. The word “encryption” may be pretty rare in written English, but if you know it appears once in a given text, the probability of a second occurrence rises dramatically.
None of this guarantees that the original texts will be retrievable: As the limit case of identical plaintexts shows, there just may not be enough information remaining in the XORed file. But for most texts—different enough to leave traces of that difference, but similar enough to exploit some common regularities—these strategies should get you quite a lot. Probably there are others I’ve missed.