Skip to main content

Reed-Solomon libraries

If you want to run a multi-petabyte storage system then you don't want to do it with Raid 5 or Raid 6; with modern disks' ~3% per year failure rate, that's 300 a year when you have 10000 disks and the odds start to get pretty good (relatively speaking) that you'll face permanent data loss at some point when you lose a third disk from an array while two are rebuilding. And of course monitoring and replacing disks in lots of small arrays is manpower-intensive, which to investors translates as "expensive."

You probably don't want to go with triplication, either; disks are cheap, but not so cheap that you want to triple your hardware costs unnecessarily. While storing multiple copies of frequently used data is good, all your data probably isn't "frequently used."

What is the solution? As it turns out, Raid is actually a special case of Reed-Solomon encoding, which lets you specify any degree of redundancy you want. You can be safer than triplication with a fraction of the space needed.

I was prompted to write this because Mozy open-sourced the Reed-Solomon library I used while I was there, librs, complete with Python bindings. The original librs we used at Mozy was written by Byron Clark, a formidible task. Later we switched to the version you see on sourceforge, based on Plank's original encoder. I wasn't involved with librs at all except to fix a couple reference leaks in the Python wrapper.

But if you're actually looking for an rs library to use, Alen Peacock, who is much more knowledgeable than I about the gory details involved here, tells me that if you are starting from scratch the two libraries you should evaluate are zfec, which also comes with Python bindings, and Jerasure which is an updated -- i.e., probably faster than his first -- encoder by Plank. (Jerasure has nothing to do with Java.)

Comments

Shane Hathaway said…
It's great to see these new libraries pop up. Reed Solomon accomplishes an amazing trick that more software developers should be aware of. I don't think the possibilities have been explored nearly enough.

The unintuitive truth is that a tunable error correction algorithm easily achieves much higher reliability than replication, with less hardware. For example, if I organize my data into blocks spanning 20 disks, then put 10 error correction blocks on 10 other disks, that data is statistically safer than it would be on a system that maintains 4 replicas. Even better, I only need 50% extra space instead of 300%! (In fact, 20+10 is overkill; I'd rather use around 20+7.)

I've created a petabyte-scale storage system based on RS called Bit Mountain, but my employer is not mature in the ways of open source, so I can't release it. Fortunately, several other groups are now seeing the light, including allmydata.org. I need to check out their Tahoe project.

Another way I'd like to see RS applied is in packet-level transmission, especially for VOIP. Bandwidth isn't a problem anymore and delays up to 500 ms are unimportant. What's bad is regular packet loss. RS could solve the packet loss by generating a stream of forward error correcting packets that accompany the normal packets. I wish I could just turn on some iptables filter to add RS coding to a connection.
Unknown said…
Note that jerasure now lives at http://jerasure.org

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren