Filesystems, version control, and the blockchain
Image courtesy of Adarsh Kummur
This is the fourth, and possibly final, article in my series on the blockchain without blockchain.
I’ve always had a warm place in my heart for filesystems.
I taught myself shell scripting while automating the installation of Disksuite, Sun’s free but sadistic disk mirroring software. I barely recall the actual work, instead remembering a hallway. I undertook a literal journey to learn programming, a repeated pilgrimage to the desk of a friend who took visible pleasure in explaining to me what I was doing wrong. It’s fair to say that if filesystems were less painful in the 90s, I would not be where I am today.
When Sun started advertising ZFS as the (finally!) successor to Disksuite and the filesystem it was built around, UFS, most of its functionality seemed obviously good — make the computers manage the disks, don’t demand people know up front how big a filesystem should be, don’t fail miserably when the server crashes, little things like that. But what was this data integrity thing? I’m embarrassed to say it took me a while to realize I needed it — who really cares if your filesystem is good at storing data, amirite? — and even longer to understand how it worked.
To explain it, I’m going to have to teach you cryptography. Just a little. You’re welcome to skip ahead if you’ve already got this part covered, but I expect most could use a little, ah, refresher. Step 1 in cryptography guides is usually: “Get a masters in mathematics from MIT.” I’m hoping to do a bit better than that. Cryptography really is just a form of math, and while we can’t all understand the details (I certainly don’t) we can at least understand the “algorithms happen here” flow diagrams.
Cryptography is most famous for its privacy utility: You use it to ensure you and only you can read your files and chat messages. It gets more complex once we need to read them on all of our different devices, but most of it is pretty similar in concept. Even more useful is ensuring both you and I can read some text, but no one else can. It’s more complex, but is essentially an extension of that first use.
Privacy is not the only use case for cryptography. It’s also useful for efficient validation. That is, it can be used to see if a file you have today is the same one you had yesterday. I sent you a document, you think it looks wrong; how do we make sure it did not get changed somehow in transit?
Obviously one way to do that is to just send it again. This is not a great solution, because if you did not trust it the first time, why would you trust it the second? That might also be a bad idea if bandwidth is expensive. You generally want a verification mechanism that takes less space than the original file, and less CPU power than directly comparing the two files.
Cryptography provides just such a capability, usually called a ‘hash function’. It’s an algorithm that converts, say, a large text file into a much shorter string. If you want to ensure the file is not changed in some way, just run it again and compare the output. The short strings are easier to compare than the long documents, and you could even read them over the phone to someone so they can check the file on their end. These algorithms generally produce a string of a fixed length, regardless of input — this makes them efficient for long term storage and comparison, and safe to run on any size file. Here’s an example hash from my files:
As you can see, it’s just a long string of gibberish. It’s not only useful for comparison, not meaningful in it’s own right.
What’s critical about these algorithms is that given a unique input they always provide a unique output. If you and I each have a file that hashes to a given string, then we can be confident we have exactly the same file. Of course, this can’t literally be true: We could design a hash function that only had 256 possible outputs, and there are obviously more than 256 possible inputs. This would produce a lot of what are called collisions, when two files hash to the same output, and, ah, is not terribly useful.
All of the modern hash functions are incredibly long. It is possible in theory but not in practice that a collision would happen. You’d need to execute the function 2^128 times. That’s 3.4 with 38 zeros after it. So, mathematically possible, but you can expect the sun to swallow the earth before the most secure hash functions get compromised. I mean, you can’t. You’ll be gone by then. But your files will still be safe.
Now that you’re at least as much an expert on cryptography as most of the bitcoin hodlers, why does any of this matter?
We were talking about data integrity.
You’d be right to guess that ZFS uses these hash functions to provide it. It goes further than just validating individual files. A little bit of cryptographic genius called a Merkle tree is the key. These don’t just hash the content on disk for later validation; they build a tree of hashes, where the leaf nodes are hashed by the nodes above them in the tree, which are themselves hashed by the root node. If any part of this system is corrupted — because the disk is broken, or someone changed the content some other way — it’s easy to detect. It’s not just that the individual hash will be different; remember each parent hashes all of its children, so now the parent is wrong. And its parent is wrong, too.
If the content is changed by any mechanism that does not also also update the Merkle tree, then it is easy to detect by rehashing all of the content and comparing the results to the stored tree.
This is how ZFS validates data integrity. It can write a block to disk, then pull the block and ensure it still matches the hash. When it writes a block, it updates the parallel tree, and when you ask for the block later, it can tell you if the block is still correct. If it’s not, it throws an error instead of handing it back to you.
When I first learned of this, it seemed overkill, but over time I remembered just how many ways there are for data to get corrupted. The most obvious one is someone changes it for nefarious reasons, but far more commonly you have a failure somewhere in the writing or reading process. The old spinning disks were error-prone, and the new SSD drives degrade eventually. It’s the complexity of reading and writing that really gets you, though: There are multiple layers of caches, drivers, and connections, any of which could introduce corruption.
For the first time on a normal production system, you could at least detect any of those problems. It’s too bad no one ever used it.
I know, I know, you came to hear about how you could get all the awesomeness of blockchain without using the blockchain and instead I’m giving lessons on two things you could literally not care less about, cryptography and filesystems. Don’t worry. It gets worse from here.
Long after I learned about and promptly forgot ZFS (after all, it’s not like I was using it), I adopted Git. It’s a version control system, used for storing and managing source code. Every geek knows about it, but most of the world only recently learned of it when Microsoft bought Github for $7.5B with a ‘b’. I was an early adopter, switching Puppet to Git in 2008. Eventually I even learned how it works. I was titillated and a bit horrified that I had duplicated in Puppet one of the key features that made Git work: A system of storing files that allowed them to be looked up by their content (or rather, a hash of their content). Normally you store files by a name, but if lots of people (or, in Puppet’s case, computers) store the same file, they might not call it the same thing, so Git and Puppet instead stored them by their hash. This ensured we never backed up more than one copy of a file, saving a lot of space, and made it easy to check for changes in files.
For Puppet, we just used this to back up files we changed, in case people later wanted to revert.
Git did a lot more than that.
Like ZFS, it builds a Merkle tree of the entire file repository, with a similar goal: To understand what files have changed and how. After all, git is used to track and share changes to a collection of files. The sharing is a critical component; you can easily copy an entire git repository to another computer, or another person, and it’s important that they be able to confirm that they have a faithful copy.
Git stores the hash tree alongside all of the files. At any point, you can use the tree to validate every file in your tree. If there are changes (which is pretty much the whole point of a version control system), it can automatically store the new files and update the related tree.
Just like ZFS, one of the key features here is that the Merkle tree allows us to validate every file stored. We can walk the file tree and compare each file to its hash, and then compare the file listing to its own hash, all the way up. Any discrepancy is easily spotted.
This is my favorite kind of cleverness: It’s simple in implementation, yet makes Git more flexible and useful. It has power that other version control systems are missing, just because it relies on this basic mechanism for storage and validation.
Ok. Now we get to the point.
Again, I’m not actually interested in the blockchain. I’m interested in peeling it apart, putting the useful bits to work while avoiding the whole anarcho-capitalist aspect.
It would be easy to see the blockchain as a sudden revolution, a dramatic change in what’s possible. Viewed this way, it’s hard to separate the pieces from the whole. If all you see is the big picture, it’s easy not to notice that every individual component has its own history, its own value.
The blockchain was gradual, for both me and the industry. It was not one giant leap forward. It was part of a story, a sequence, and the most interesting aspect — Merkle trees — is decades old in math and now pushing decades old even in popular usage. Most of the interesting features touted in the blockchain come directly from them. Immutability (which isn’t) and trustless systems derive directly.
It’s worth understanding that history, to see which stages and steps apply to problems you have. The current cryptocurrency tech stack is built to solve problems I don’t think exist. Certainly they aren’t problems I have.
Unlike the blockchain as a whole, though, the individual technical components have been used for years, even decades, in production. Focusing on the current trend can blind you to the opportunity history demonstrates. I think you’re a lot more likely to find broadly applicable solutions there than in trying to replace currency.
Because I got here from the world of filesystems and version control, I see different benefits than you might if you approach thinking of currencies or exchanges. Or chat messages. That does not make me right or wrong, but it does, at least, mean we’re going to work on different problems.
I expect most of you think this is boring. That’s great. It will give me that much more time to build something.