Jeff Bonwick's Weblog

Archives

« May 2007
Sun	Mon	Tue	Wed	Thu	Fri	Sat
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Today

Links

blogs.sun.com

Today's Page Hits: 489

« A Near-Death Experie... | Main

Friday May 04, 2007

Rampant Layering Violation?

Andrew Morton has famously called ZFS a "rampant layering violation" because it combines the functionality of a filesystem, volume manager, and RAID controller. I suppose it depends what the meaning of the word violate is. While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.

An example from mathematics (my actual background) provides a useful prologue.

Suppose you had to compute the sum, from n=1 to infinity, of 1/n(n+1).

Expanding that out term by term, we have:

        1/(1*2) + 1/(2*3) + 1/(3*4) + 1/(4*5) + ...

That is,

        1/2 + 1/6 + 1/12 + 1/20 + ...

What does that infinite series add up to? It may seem like a hard problem, but that's only because we're not looking at it right. If you're clever, you might notice that there's a different way to express each term:

        1/n(n+1) = 1/n - 1/(n+1)

For example,

        1/(1*2) = 1/1 - 1/2
        1/(2*3) = 1/2 - 1/3
        1/(3*4) = 1/3 - 1/4

Thus, our sum can be expressed as:

        (1/1 - 1/2) + (1/2 - 1/3) + (1/3 - 1/4) + (1/4 - 1/5) + ...

Now, notice the pattern: each term that we subtract, we add back. Only in Congress does that count as work. So if we just rearrange the parentheses -- that is, if we rampantly violate the layering of the original problem by using associativity to refactor the arithmetic across adjacent terms of the series -- we get this:

        1/1 + (-1/2 + 1/2) + (-1/3 + 1/3) + (-1/4 + 1/4) + ...

        1/1 + 0 + 0 + 0 + ...

In others words,

1.

Isn't that cool?

Mathematicians have a term for this. When you rearrange the terms of a series so that they cancel out, it's called telescoping -- by analogy with a collapsable hand-held telescope. In a nutshell, that's what ZFS does: it telescopes the storage stack. That's what allows us to have a filesystem, volume manager, single- and double-parity RAID, compression, snapshots, clones, and a ton of other useful stuff in just 80,000 lines of code.

A storage system is more complex than this simple analogy, but at a high level the same idea really does apply. You can think of any storage stack as a series of translations from one naming scheme to another -- ultimately translating a filename to a disk LBA (logical block address). Typically it looks like this:

        filesystem(upper): filename to object (inode)
        filesystem(lower): object to volume LBA
        volume manager: volume LBA to array LBA
        RAID controller: array LBA to disk LBA

This is the stack we're about to refactor.

First, note that the traditional filesystem layer is too monolithic. It would be better to separate the filename-to-object part (the upper half) from the object-to-volume-LBA part (the lower half) so that we could reuse the same lower-half code to support other kinds of storage, like objects and iSCSI targets, which don't have filenames. These storage classes could then speak to the object layer directly. This is more efficient than going through something like /dev/lofi, which makes a POSIX file look like a device. But more importantly, it provides a powerful new programming model -- object storage -- without any additional code.

Second, note that the volume LBA is completely useless. Adding a layer of indirection often adds flexibility, but not in this case: in effect we're translating from English to French to German when we could just as easily translate from English to German directly. The intermediate French has no intrinsic value. It's not visible to applications, it's not visible to the RAID array, and it doesn't provide any administrative function. It's just overhead.

So ZFS telescoped that entire layer away. There are just three distinct layers in ZFS: the ZPL (ZFS POSIX Layer), which provides traditional POSIX filesystem semantics; the DMU (Data Management Unit), which provides a general-purpose transactional object store; and the SPA (Storage Pool Allocator), which provides virtual block allocation and data transformations (replication, compression, and soon encryption). The overall ZFS translation stack looks like this:

        ZPL: filename to object
        DMU: object to DVA (data virtual address)
        SPA: DVA to disk LBA

The DMU provides both file and block access to a common pool of physical storage. File access goes through the ZPL, while block access is just a direct mapping to a single DMU object. We're also developing new data access methods that use the DMU's transactional capabilities in more interesting ways -- more about that another day.

The ZFS architecture eliminates an entire layer of translation -- and along with it, an entire class of metadata (volume LBAs). It also eliminates the need for hardware RAID controllers. At the same time, it provides a useful new interface -- object storage -- that was previously inaccessible because it was buried inside a monolithic filesystem.

I certainly don't feel violated. Do you?

Posted at 03:10AM May 04, 2007 by Jeff Bonwick in ZFS | Comments[31]

Comments:

I've implemented and supported a Linux-based storage system (70 TB and growing) on a stack of: hardware RAID, Linux FC/SCSI, LVM2, XFS, and NFS. From that perspective: flattening the stack is good. The scariest episodes we've had have been in unpredictable interactions between the layers when errors propagated up or down the stack cryptically or only partially (or, worse, didn't). With the experience we've had with the Linux-based system (which, admittedly, is generally working OK), it would be hard to imagine a more direct answer to every item on our list of complaints (not only reliability, but also usability issues) than ZFS, and I think the depth of the stack is ultimately behind for the large majority of those complaints.

Unsurprisingly, I'm aiming to migrate the system (on the same equipment) to ZFS on Solaris as soon as we can manage to.

Posted by Andrew Boyko on May 04, 2007 at 07:10 AM PDT #

Hi Jeff, If you get a spare minute, could you post an update on where team ZFS is with shrinking pools? I am super curious to see how this is going to be done, and what features will be provided (I am hopeful the shrink code will allow us to split off vdevs and mount them on another server for backup purposes). Thanks, - Ryan

Posted by Matty on May 04, 2007 at 08:01 AM PDT #

"I certainly don't feel violated. Do you?"

Of course not. Did you ever see the movie "Toy Story"? In one scene, the Piggy asks Mr. Potatohead regarding Woody:

Piggy: What's with him? Mr. Potatohead: laser envy!

What I mean to say is, somebody has "ZFS envy". No need to point any fingers who, is there?

Posted by UX-admin on May 04, 2007 at 08:45 AM PDT #

Thank you for taking the time to break this down.

Posted by tekronis on May 04, 2007 at 09:11 AM PDT #

There is no "layering violation". Where the layers are or should be depends on the problem and can change with time. In the book "Seeing What's Next", Clayton M. Christensen discusses how technologies should be layered. Monolithic stacks allow greater innovations, but slow down the speed at which the innovations can happen. Layered stacks limit the kinds of innovations but allow them to happen more quickly. Sometimes the stack needs to morph from one to the other or back.

Posted by Brian Utterback on May 04, 2007 at 10:18 AM PDT #

One problem in understanding ZFS is the simplicity of administrating it. For the cusory viewer it looks like we flattened the whole stuff into the filesystem, as we need only one command to create a usable filesystem with mirrored discs. Most of this critics are not really aware that the posix layer is only one possible view to a zpool. I fighted the misunderstanding several times in the last few month at customer sites and in community forums.

Posted by Joerg Moellenkamp on May 04, 2007 at 11:25 AM PDT #

For a less hyperbole-drenched comparison of an old-school BSD UNIX filesystem to ZFS, try http://blogs.sun.com/nico/entry/comparing_zfs_to_the_41. Deploying ZFS is a still a risky decision. The missing functionality (no online restripe, no online vdev removal) and difficulty in obtaining detail runtime information from a running system (how fragmented is *your* tank?) is evidence of a still-maturing subsystem. I'm using it for our lab servers but like many conservatively-minded system architects won't recommend it for 10-year-lifespan production systems until I see a fully-rounded product. And wow-factor blogs like this one just confirm my belief. Storage should be really, really boring, and not exciting at all. Like air travel should be. That said, I'm already telling colleagues that ZFS is the new hotness and WAFL (the only comparable competing storage system, by capability) is so last century ...

Posted by kosh on May 04, 2007 at 06:15 PM PDT #

I love your article, and especially the mathematical analogy. I would like to offer a complementary viewpoint, which extends your analogy by telescoping out a layer which you don't need. The original layering of "filesystem", "volume manager" and "RAID controller" was just what someone came up with. You chose to use summing-a-series as your example, and someone chose to use I/O as a model. A different analogy for the problem, like differential calculus, would result in different layering... which is what you have done with your ZPL to DMU to SPA sequence. The real question is how well the analogy equates with the messiness of real world needs, including boring details that aren't interesting. That will take more time to prove, but the beauty of the mathematics is a strong guide to its correctness. It is interesting that maths remains of practical use despite every attempt otherwise, and that seems to be linked to the non-linear appreciation of "beauty" versus the non-linear rightness of the solution.

Posted by steve on May 05, 2007 at 09:12 AM PDT #

"The missing functionality (no online restripe, no online vdev removal) and difficulty in obtaining detail runtime information from a running system (how fragmented is *your* tank?) is evidence of a still-maturing subsystem."

ZFS sure is not as mature as say UFS or ext3.

Functionality like vdev removal is something that is very desirable.

It sure is going to take some time to convince some users to switch to ZFS.

But, IMHO, by sticking to UFS (/ext3/vxfs/...) you probably are just feeling comfortable about potential silent data corruption because it has always been there. :)

-Manoj

Posted by Manoj Joseph on May 05, 2007 at 12:30 PM PDT #

Sure, eliminating LVM is nice. But there are many ways of eliminating it. ZFS chose to turn the entire stack into one amorphous blob, but there are many other ways of achieving the same goals in a layered way. ZFS has the right goal, it's just a bad way of achieving that goal. I predict that Linux will come up with a much cleaner and simpler solution to the same problem, one that will be fully compatible with ext3, and require very few changes to the kernel. Or, to put it differently, the ZFS developers haven't come up with the simplest way of looking at the problem by a long shot. But the fact that Sun has invested so much time, money, and marketing in ZFS means that Sun is stuck with it now.

Posted by Tom on May 05, 2007 at 03:26 PM PDT #

as a pure enduser, I find this an odd conversation. first, it is my understanding that sun chose a particular open-license to prevent zfs to make it into the linux kernel. this is unfortunate. with zfs coming to mac osx soon, zfs is the chance for sun to basically implement and benevolently control the future direction of one file system to conquer them all. (hey, sun: how about releasing a slower version for linux that's got the extra useless layer as dummy calls, and that allows me to mount a zfs partition/disk on a sun or on a max osx or on a linux machine?) second, is it not possible for zfs to plug into the upper linux layer for its interface, use the intermediate parts that it has itself rather than calling on linux, and rely only on the lowest linux hardware layer? I would kill to have zfs solidly available on linux. I have donated to zfs on fuse. but it ain't the same. zfs in the kernel would take over the linux world pretty quickly...if it were only possible. so, as I see it, linux really doesn't have the option to adopt zfs, which obviates the discussion of whether it should adopt this different model. /iaw

Posted by ivo welch on May 05, 2007 at 03:40 PM PDT #

I predict that Linux will come up with a much cleaner and simpler solution to the same problem, one that will be fully compatible with ext3, and require very few changes to the kernel. Or, to put it differently, the ZFS developers haven't come up with the simplest way of looking at the problem by a long shot.

And based on Linux's track record it will be here in about 5 years then rewritten in each and every kernel release going forward. Take two steps forward and one step back....it's the Linux way!

Posted by Bryan on May 05, 2007 at 04:19 PM PDT #

I do like the zfs ideas and as a big storage admin look forward to it whether in zfs or otherwise. However from a big storage perspective one thing you need to know is when you finally replace that dead drive out of the 140 in just the one rack you are looking at, which one is it? With hardware raid, it has a red light. You slide in a new drive and it rebuilds automatically. The end user doesnt know or care. Software OS level raid is not there yet. Of course it will get there sometime.

Posted by chuck on May 05, 2007 at 04:53 PM PDT #

"I certainly don't feel violated. Do you?"

I don't feel violated but I can honestly say after that explanation : I need a cigarette.

Keep up the good work with ZFS. Looking forward to seeing ZFS on other platforms *BSD and OS X.

Regards,

Nix

Posted by Nix on May 05, 2007 at 06:09 PM PDT #

LVM has always been an annoyance in my eyes. It a layer that very well mitigates any efforts to recover lost data. It doesn't leave you a consistent filesystem behind in case of failure. Sadly, loads of people just use it out of the superficial convenience it provides on first sight. I'd rather stick with UFS and RAID; ZFS however looks extremely promising.

Posted by John Balker on May 05, 2007 at 08:33 PM PDT #

The scariest episodes we've had have been in unpredictable interactions between the layers when errors propagated up or down the stack cryptically or only partially (or, worse, didn't).

Then bug reports should be filed against those kernel modules.

With the experience we've had with the Linux-based system (which, admittedly, is generally working OK)

Hmmm, Linux successfully managing 70TB, you say??? Sounds pretty Enterprise to me...

Posted by Ron on May 05, 2007 at 08:40 PM PDT #

Honestly, at the end of the day who gives a damn what other people think about "layering violations". ZFS has removed one of the biggest pains in the arse of administering large amounts of storage. This is coming from a self confessed linux zealot, nothing anywhere comes close. After trying to explain the whole concept to folks it is apparent that most reluctance to ZFS comes from those who just cannot grok the concept of not having to futz with partitioning, LVM's and laying out filesystems... it just does not compute in their eyes, they just do not see the needless complexity of the existing order of things. Funnily enough, most of the hardened critics I know who I endlessly debated with changed their tune after 5 minutes of actually using ZFS, the proof of the pudding is in the eating :-)

Posted by Ryan Oliver on May 05, 2007 at 09:36 PM PDT #

@Tom: It´s one common misunderstanding that ZFS is an amorphous. When you look at ZFS, you will recognize the layering. It´s done on a different way. It´s done in a way, that gives each layer more knowledge of the structures of the other layers. I wrote a little bit about this misunderstanding in my own weblog: http://www.c0t0d0s0.org/archives/3104-Layer-violation.html Surely, many thing would be possible without a different layer, but that would need out-of-band control of the different layers and falls flat onto it´s nose when you try to control the resync of mirror disks in a hardware controller according to prioritized files. PS: Without rethink some old thoughts, you end up with such kludges like Kludge... err ChunkFS.

Posted by 90.186.20.4 on May 06, 2007 at 02:06 AM PDT #

it is my understanding that sun chose a particular open-license to prevent zfs to make it into the linux kernel.

Thanks to software like FUSE and patent dodging libraries like libdvdcss & lame. Linux WILL have ZFS support. Of course in user space it will be muuuuuuch slower, but at least you will be able to read your files. Death to software patents and closed standards!

Posted by Nick on May 06, 2007 at 03:32 AM PDT #

"... It also eliminates the need for hardware RAID controllers. ... " Huh ?! Not for RAID5, I think. In most cases the best performace booster for a RAID5 controller is its battery keeping cache contents from being lost (and thus enabling write cache). You cannot implement it in software. Linear writes on RAID5 volumes might perform good with or without cache but random+synchronous ones will degrade significantly (like in transactional systems - be it filesystems, databases or anything else). You can mitigate this problem on areas you control (filesystems) but there are some you can't control (database engines etc.) and there software RAID implementation will propably get a severe performance hit.

Posted by rlewczuk on May 06, 2007 at 05:05 AM PDT #

Nick: Hardware RAID controllers benefit from battery-backed cache on all RAID levels, not just RAID 5. The boost you used to get from hw RAID 5 was the XOR coprocessor - but these days AFAIK the host CPU(s) can do the job much faster and with little cost to general performance. While you can't replace a battery backed cache directly with software, you don't need a full hardware RAID controller to do the job. All you need is a couple of hundred megs of DDR2 on a PCIe card, or even on a board with a SATA/SAS interface. Or battery-back main memory, but that's somewhat wasteful. So long as your sw RAID drivers know to use that storage (be it mapped into main memory or whatever) as cache and to ensure that everything goes into the cache before being sent to the disks, you should be able to re-sync the array from cache at boot time like with a hw RAID array. One advantage of doing things this way is that the storage devices may be on several different hardware controllers or attached in other ways (such as network block devices). I'm really hoping we start to see battery backed RAM storage on the market and used by software RAID engines. It'd be a welcome change from the limitations of hardware RAID controllers once the software matures to the point where things like RAID-5/RAID-6 hot expansion are supported.

Posted by Craig Ringer on May 06, 2007 at 06:19 AM PDT #

Bunch of Linux whiners. Ignore them as you would a petulant child. Let them go re-invent the wheel.. again, and again, and again. Good work, Sun. Looks like we'll see ZFS in FreeBSD 7.0, which I'm drooling over. This is a great contribution to the community!

Posted by zmetzing on May 06, 2007 at 02:20 PM PDT #

Here's the bottom line for file systems of any kind: Can you get the data back out when they fail? And when their "corruption repair" mechanisms fail - as they will inevitably at some point. If you can't guarantee that, don't introduce it. I don't like Windows "dynamic disks" and I don't like LVM for precisely the reason that there are few tools to allow data recovery when they fail. As for ZFS, I don't know it well enough to know if it's better or worse, or whether there are recovery tools. I just know I won't use it if there aren't recovery tools available for it. From any sized business standpoint, that's a no brainer. Geeks need to remember that "reliability - and recoverability - is Job One" for business use of a computer system. "Cool" is NOT "Job One." And for those who say that this whole issue is a non-issue if you use proper backup, I say: "You're right - and what happens when the backups fail - as they will at some point inevitably, just like the file system will fail?" Especially if your file system fails, it's quite possible your backups will fail - silently - without you knowing it until the file system failure requires restoring a backup.

Posted by Richard Steven Hack on May 06, 2007 at 02:37 PM PDT #

Let's try this again... Here's the bottom line for file systems of any kind: Can you get the data back out when they fail? And when their "corruption repair" mechanisms fail - as they will inevitably at some point. If you can't guarantee that, don't introduce it. I don't like Windows "dynamic disks" and I don't like LVM for precisely the reason that there are few tools to allow data recovery when they fail. As for ZFS, I don't know it well enough to know if it's better or worse, or whether there are recovery tools. I just know I won't use it if there aren't recovery tools available for it. From any sized business standpoint, that's a no brainer. Geeks need to remember that "reliability - and recoverability - is Job One" for business use of a computer system. "Cool" is NOT "Job One." And for those who say that this whole issue is a non-issue if you use proper backup, I say: "You're right - and what happens when the backups fail - as they will at some point inevitably, just like the file system will fail?" Especially if your file system fails, it's quite possible your backups will fail - silently - without you knowing it until the file system failure requires restoring a backup.

Posted by Richard Steven Hack on May 06, 2007 at 02:40 PM PDT #

Well, that's the second time this post is marked as spam. Your blog software can't add simple math?

Let's try again....

Here's the bottom line for file systems of any kind:

Can you get the data back out when they fail? And when their "corruption repair" mechanisms fail - as they will inevitably at some point.

If you can't guarantee that, don't introduce it.

I don't like Windows "dynamic disks" and I don't like LVM for precisely the reason that there are few tools to allow data recovery when they fail.

As for ZFS, I don't know it well enough to know if it's better or worse, or whether there are recovery tools. I just know I won't use it if there aren't recovery tools available for it.

From any sized business standpoint, that's a no brainer.

Geeks need to remember that "reliability - and recoverability - is Job One" for business use of a computer system. "Cool" is NOT "Job One."

And for those who say that this whole issue is a non-issue if you use proper backup, I say: "You're right - and what happens when the backups fail - as they will at some point inevitably, just like the file system will fail?" Especially if your file system fails, it's quite possible your backups will fail - silently - without you knowing it until the file system failure requires restoring a backup.

Posted by Richard Steven Hack on May 06, 2007 at 02:43 PM PDT #

Richard -- sorry about that. No idea what triggered the spam filter. It should be OK now.

Posted by Jeff Bonwick on May 07, 2007 at 12:20 AM PDT #

Regarding Linux and FS performance... as anyone who has tried using Linux as an NFS server should know, it's NFS server performance is terrible (yes, even the kernel NFS server module). My old SS20 has better NFS server performance than a P-III/550 running SuSE 9.3 or 10.2. The NFS server performance of ZFS is superb. "What are the numbers ?" you ask ? Intel P-III/550 with 512 MB memory, SuSE 10.2, NFS write performance of less than 1 MB/sec. Same hardware with Solaris 10x86 U3 (11/06) and ZFS, about 10 MB/sec (the limit of the 100 Mbps ethernet connection). The client is an iMac G5 1.8 GHz running 10.4.current.

Regarding battery backup for software RAID, Sun had that solution years ago in the PrestoServe product, except that it was for all disk i/o not just RAID. You had to get the stack layering correct between PrestoServe and DiskSuite or you could see data loss due to a power outage (the packages installed in the correct layers, but you could apply patches in an order that would get you in trouble).

On the other hand, I am in the camp that feels ZFS needs to mature some before I commit mission critical data to it. The exception being some data sets we have that need the extra performance we are seeing with ZFS for lots of small files on huge filesystems.

Posted by Paul Kraus on May 07, 2007 at 05:37 AM PDT #

@Ron: 70TB but not 'Enterprise' in any sense of the term, as it's a self-contained development lab. Yeah, the good news is that you can mostly create and grow >10TB filesystems with LVM/XFS; the bad news is that you're not going to find a Linux distribution that makes it easy or pleasant. (and any time you hit a weird/cryptic deadlock or error, you'll be chided for not running the very latest just-released not-available-in-your-distro version of whatever layer didn't report the error.)

On a side note, Joyent might be Sun's best marketing team.

Posted by Andrew Boyko on May 07, 2007 at 07:23 AM PDT #

I realize that this is probably not the best place to ask this question(and it is probably answered in the documentation somewhere, but I haven't seen it yet). I am getting rather interested in using ZFS (either on opensolaris, or more likely, on Linux) in my home for my data storage needs. But I have a question about storage pools and RAIDz... If I start out with two disks, and dump hundreds of gigabytes of music, movies, and photos I have taken over the years to it, and then add another drive to the storage pool.. is there any way to ask ZFS to re-allocate the old data over all 3 disks, now? I ask because I am unlikely to ever change most of the files once they're out there. Music, home movies, and amateur photography tends to become read-only as time goes on.. but I wouldn't want my older files to be relegated to my older, less reliable storage. My previous idea was to gang up a bunch of disks and do software raid 1 or 0+1, but I like the idea of organically adding storage as it's needed.

Posted by Araemo on May 07, 2007 at 01:05 PM PDT #

As a network programmer, I am very familiar with layering. And yes, there can be a lot of overhead of passing data up and down. However, a team I worked with on a TCP/IP stack discovered that in the long run it might be better to optimize the layering than to do an end run around it.

Our original stack supported the AT&T TLS protocol, and was 'flattened' to optimize for it. But we eventually had to support sockets, and that caused enough problems to force a complete rewrite of the stack. The thing is that the redesign ended up optimizing the way we moved data up and down the stack, and the result was better performance than before.

The issues with file systems may be completely different, but IMHO, layering (within reason) provides better long term flexibility and maintainability, and the focus of optimization should be in passing the absolute minimum amount of information to get data where it needs to be.

Later . . . Jim

Posted by JJS on May 07, 2007 at 05:37 PM PDT #

"Thinking Outside The Box" = "Rampant Layering Violation?"

Posted by Haik on May 08, 2007 at 08:52 AM PDT #