What have I done

I did something no one should ever do: I invented my own key derivation function, BFKDF. It’s based on scrypt so it can’t be too bad. I tried to make sure it’s not worse than scrypt. There is also some brainfuck involved which might just make me the only person who ever unironically used the language.

Why would I do something like that? Well, it all started with a new Twitter app which scrapes the website, stores all the tweets, greps for suicidal thoughts and sends alerts to friends. Sounds like a good idea, excepts it’s really not and it raises privacy concerns. True, Twitter is a public website. Unless you have a private account every tweet you ever sent is preserved forever and publicly available. However, there is a slight difference between just having it out there and having it analyzed, flagged, and e-mailed out to people. Insurance companies could refuse contracts or ask for higher fees if they know you are a “riskier” customer.

Naturally, whenever Twitter does something stupid, the topic of writing a distributed replacement comes up. But how would we do better? We’d like to keep it public, that’s the best part of Twitter, but we don’t want to allow anyone to collect all tweets. The only difference between a normal user and a scraper is the number of followed users and, in turn, the number of tweets read per second. We could encrypt them with a mostly zero key and tune it so cracking one takes about a second on a cellphone. Not a problem for regular users but it at least makes cracking all of them a little more difficult. We can make it more difficult by using a key derivation function such as PBKDF2 or scrypt as encrypt(kdf(mostly_zeros), message). The problem is that a cracking farm can beat a billion cellphones with specialized hardware. The great thing about scrypt is that it adds a large memory requirement for parallel cracking but cellphones have some RAM anyway and they don’t have to crack in parallel. There are only a few tweets coming in every few seconds. Cellphones also have a CPU. If you ever tried to mine bitcoins you know that this is something you want to avoid in a cracking circuit. You want as many copies of a pipelined implementation as you can fit on a chip.

My idea was to force them to include a fully functional CPU by generating deterministic random Turing complete code, give it deterministic random input and include the output in the final hash. Brainfuck happens to be pretty easy to randomly generate. We can’t just generate any sequence of the 8 opcodes but almost. A brainfuck program is only valid if it contains an equal number of “[” and “]” characters and, counting from left to right, there are never more “]”s than “[”s. Let’s try to count how many n character programs there are and the way to generate them with equal probability will follow.

The number of 0 character programs is 1. It’s just the empty string.
The number of 1 character programs is 6 because it can’t be any of the brackets.
The number of n > 1 character programs can be expressed recursively.
- If the first character is not a bracket, then there are 6 * f(n - 1) possible programs.
- If the first character is “[”, then we have to choose where to put the matching “]”.
  - Let 0 < k < n be this position.
  - For each possible k, there are f(k - 1) * f(n - k - 1) possible programs.
  - Using python notation, that’s sum([ f(k-1) * f(n-k-1) for k in range(1, n) ]).
- The first character can’t be “]”.
- That’s 6 * f(n - 1) + sum([ f(k - 1) * f(n - k - 1) for k in range(1, n) ]). Cool.

It’s easy to see that this is between 6^n and 7^n but I can’t be arsed to give you an exact number. It’s not important, there are enough. By following the above description and choosing the first character randomly from the 7 possible, then choosing k from uniformly if needed, then repeat recursively, we will pick each program with the same probability.

Now, there are few things we have to pay attention to when writing cryptographic code. One of the most important things is making sure the runtime of the algorithm does not depend on secret values. If we would seed a regular random number generator directly with the input of our brand new KDF and use it to generate brainfuck and its input, then measuring the runtime would probably tell the attacker everything they need to break it. It would still be OK for the Distributed Twitter scenario but not much else. Instead, let’s scrypt the input, use that as a key and IV for AES CTR, then take the bitstream and use that.

The next problem is the quality of random numbers generated by brainfuck. It sucks. It’s true, all programs have the same probability but many of them are equivalent. Infinite loops without output are common and so are infinite loops that output a short repeating sequence. Brainfuck control flow only distinguishes zeros and non-zeros, therefore the output contains too many 00, 01 and ff bytes. We can repeat this process until the output matches some complicated randomness condition but it will still suck. So let’s scrypt the output again, that’s what it’s good at, it makes shitty passwords secure. Finally, we append the input to this and scrypt the whole thing again.