Andrew Morton has famously called ZFS a "rampant layering violation" because it combines the functionality of a filesystem, volume manager, and RAID controller. I suppose it depends what the meaning of the word violate is. While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.
An example from mathematics (my actual background) provides a useful prologue.
Suppose you had to compute the sum, from n=1 to infinity, of 1/n(n+1).
Expanding that out term by term, we have:
1/(1*2) + 1/(2*3) + 1/(3*4) + 1/(4*5) + ...
That is,
1/2 + 1/6 + 1/12 + 1/20 + ...
What does that infinite series add up to? It may seem like a hard problem, but that's only because we're not looking at it right. If you're clever, you might notice that there's a different way to express each term:
1/n(n+1) = 1/n - 1/(n+1)
For example,
1/(1*2) = 1/1 - 1/2
1/(2*3) = 1/2 - 1/3
1/(3*4) = 1/3 - 1/4
Thus, our sum can be expressed as:
(1/1 - 1/2) + (1/2 - 1/3) + (1/3 - 1/4) + (1/4 - 1/5) + ...
Now, notice the pattern: each term that we subtract, we add back. Only in Congress does that count as work. So if we just rearrange the parentheses -- that is, if we rampantly violate the layering of the original problem by using associativity to refactor the arithmetic across adjacent terms of the series -- we get this:
1/1 + (-1/2 + 1/2) + (-1/3 + 1/3) + (-1/4 + 1/4) + ...
or
1/1 + 0 + 0 + 0 + ...
In others words,
1.
Isn't that cool?
Mathematicians have a term for this. When you rearrange the terms of a series so that they cancel out, it's called telescoping -- by analogy with a collapsable hand-held telescope. In a nutshell, that's what ZFS does: it telescopes the storage stack. That's what allows us to have a filesystem, volume manager, single- and double-parity RAID, compression, snapshots, clones, and a ton of other useful stuff in just 80,000 lines of code.
A storage system is more complex than this simple analogy, but at a high level the same idea really does apply. You can think of any storage stack as a series of translations from one naming scheme to another -- ultimately translating a filename to a disk LBA (logical block address). Typically it looks like this:
filesystem(upper): filename to object (inode)
filesystem(lower): object to volume LBA
volume manager: volume LBA to array LBA
RAID controller: array LBA to disk LBA
This is the stack we're about to refactor.
First, note that the traditional filesystem layer is too monolithic. It would be better to separate the filename-to-object part (the upper half) from the object-to-volume-LBA part (the lower half) so that we could reuse the same lower-half code to support other kinds of storage, like objects and iSCSI targets, which don't have filenames. These storage classes could then speak to the object layer directly. This is more efficient than going through something like /dev/lofi, which makes a POSIX file look like a device. But more importantly, it provides a powerful new programming model -- object storage -- without any additional code.
Second, note that the volume LBA is completely useless. Adding a layer of indirection often adds flexibility, but not in this case: in effect we're translating from English to French to German when we could just as easily translate from English to German directly. The intermediate French has no intrinsic value. It's not visible to applications, it's not visible to the RAID array, and it doesn't provide any administrative function. It's just overhead.
So ZFS telescoped that entire layer away. There are just three distinct layers in ZFS: the ZPL (ZFS POSIX Layer), which provides traditional POSIX filesystem semantics; the DMU (Data Management Unit), which provides a general-purpose transactional object store; and the SPA (Storage Pool Allocator), which provides virtual block allocation and data transformations (replication, compression, and soon encryption). The overall ZFS translation stack looks like this:
ZPL: filename to object
DMU: object to DVA (data virtual address)
SPA: DVA to disk LBA
The DMU provides both file and block access to a common pool of physical storage. File access goes through the ZPL, while block access is just a direct mapping to a single DMU object. We're also developing new data access methods that use the DMU's transactional capabilities in more interesting ways -- more about that another day.
The ZFS architecture eliminates an entire layer of translation -- and along with it, an entire class of metadata (volume LBAs). It also eliminates the need for hardware RAID controllers. At the same time, it provides a useful new interface -- object storage -- that was previously inaccessible because it was buried inside a monolithic filesystem.
I certainly don't feel violated. Do you?
Unsurprisingly, I'm aiming to migrate the system (on the same equipment) to ZFS on Solaris as soon as we can manage to.
Posted by Andrew Boyko on May 04, 2007 at 07:10 AM PDT #
Posted by Matty on May 04, 2007 at 08:01 AM PDT #
Of course not. Did you ever see the movie "Toy Story"? In one scene, the Piggy asks Mr. Potatohead regarding Woody:
Piggy: What's with him? Mr. Potatohead: laser envy!
What I mean to say is, somebody has "ZFS envy". No need to point any fingers who, is there?
Posted by UX-admin on May 04, 2007 at 08:45 AM PDT #
Posted by tekronis on May 04, 2007 at 09:11 AM PDT #
Posted by Brian Utterback on May 04, 2007 at 10:18 AM PDT #
Posted by Joerg Moellenkamp on May 04, 2007 at 11:25 AM PDT #
Posted by kosh on May 04, 2007 at 06:15 PM PDT #
Posted by steve on May 05, 2007 at 09:12 AM PDT #
"The missing functionality (no online restripe, no online vdev removal) and difficulty in obtaining detail runtime information from a running system (how fragmented is *your* tank?) is evidence of a still-maturing subsystem."
ZFS sure is not as mature as say UFS or ext3.
Functionality like vdev removal is something that is very desirable.
It sure is going to take some time to convince some users to switch to ZFS.
But, IMHO, by sticking to UFS (/ext3/vxfs/...) you probably are just feeling comfortable about potential silent data corruption because it has always been there. :)
-Manoj
Posted by Manoj Joseph on May 05, 2007 at 12:30 PM PDT #
Posted by Tom on May 05, 2007 at 03:26 PM PDT #
Posted by ivo welch on May 05, 2007 at 03:40 PM PDT #
And based on Linux's track record it will be here in about 5 years then rewritten in each and every kernel release going forward. Take two steps forward and one step back....it's the Linux way!
Posted by Bryan on May 05, 2007 at 04:19 PM PDT #
Posted by chuck on May 05, 2007 at 04:53 PM PDT #
I don't feel violated but I can honestly say after that explanation : I need a cigarette.
Keep up the good work with ZFS. Looking forward to seeing ZFS on other platforms *BSD and OS X.
Regards,
Nix
Posted by Nix on May 05, 2007 at 06:09 PM PDT #
Posted by John Balker on May 05, 2007 at 08:33 PM PDT #
Then bug reports should be filed against those kernel modules.
With the experience we've had with the Linux-based system (which, admittedly, is generally working OK)
Hmmm, Linux successfully managing 70TB, you say??? Sounds pretty Enterprise to me...
Posted by Ron on May 05, 2007 at 08:40 PM PDT #
Posted by Ryan Oliver on May 05, 2007 at 09:36 PM PDT #
Posted by 90.186.20.4 on May 06, 2007 at 02:06 AM PDT #
Thanks to software like FUSE and patent dodging libraries like libdvdcss & lame. Linux WILL have ZFS support. Of course in user space it will be muuuuuuch slower, but at least you will be able to read your files. Death to software patents and closed standards!
Posted by Nick on May 06, 2007 at 03:32 AM PDT #
Posted by rlewczuk on May 06, 2007 at 05:05 AM PDT #
Posted by Craig Ringer on May 06, 2007 at 06:19 AM PDT #
Posted by zmetzing on May 06, 2007 at 02:20 PM PDT #
Posted by Richard Steven Hack on May 06, 2007 at 02:37 PM PDT #
Posted by Richard Steven Hack on May 06, 2007 at 02:40 PM PDT #
Let's try again....
Here's the bottom line for file systems of any kind:
Can you get the data back out when they fail? And when their "corruption repair" mechanisms fail - as they will inevitably at some point.
If you can't guarantee that, don't introduce it.
I don't like Windows "dynamic disks" and I don't like LVM for precisely the reason that there are few tools to allow data recovery when they fail.
As for ZFS, I don't know it well enough to know if it's better or worse, or whether there are recovery tools. I just know I won't use it if there aren't recovery tools available for it.
From any sized business standpoint, that's a no brainer.
Geeks need to remember that "reliability - and recoverability - is Job One" for business use of a computer system. "Cool" is NOT "Job One."
And for those who say that this whole issue is a non-issue if you use proper backup, I say: "You're right - and what happens when the backups fail - as they will at some point inevitably, just like the file system will fail?" Especially if your file system fails, it's quite possible your backups will fail - silently - without you knowing it until the file system failure requires restoring a backup.
Posted by Richard Steven Hack on May 06, 2007 at 02:43 PM PDT #
Posted by Jeff Bonwick on May 07, 2007 at 12:20 AM PDT #
Regarding Linux and FS performance... as anyone who has tried using Linux as an NFS server should know, it's NFS server performance is terrible (yes, even the kernel NFS server module). My old SS20 has better NFS server performance than a P-III/550 running SuSE 9.3 or 10.2. The NFS server performance of ZFS is superb. "What are the numbers ?" you ask ? Intel P-III/550 with 512 MB memory, SuSE 10.2, NFS write performance of less than 1 MB/sec. Same hardware with Solaris 10x86 U3 (11/06) and ZFS, about 10 MB/sec (the limit of the 100 Mbps ethernet connection). The client is an iMac G5 1.8 GHz running 10.4.current.
Regarding battery backup for software RAID, Sun had that solution years ago in the PrestoServe product, except that it was for all disk i/o not just RAID. You had to get the stack layering correct between PrestoServe and DiskSuite or you could see data loss due to a power outage (the packages installed in the correct layers, but you could apply patches in an order that would get you in trouble).
On the other hand, I am in the camp that feels ZFS needs to mature some before I commit mission critical data to it. The exception being some data sets we have that need the extra performance we are seeing with ZFS for lots of small files on huge filesystems.
Posted by Paul Kraus on May 07, 2007 at 05:37 AM PDT #
On a side note, Joyent might be Sun's best marketing team.
Posted by Andrew Boyko on May 07, 2007 at 07:23 AM PDT #
Posted by Araemo on May 07, 2007 at 01:05 PM PDT #
Our original stack supported the AT&T TLS protocol, and was 'flattened' to optimize for it. But we eventually had to support sockets, and that caused enough problems to force a complete rewrite of the stack. The thing is that the redesign ended up optimizing the way we moved data up and down the stack, and the result was better performance than before.
The issues with file systems may be completely different, but IMHO, layering (within reason) provides better long term flexibility and maintainability, and the focus of optimization should be in passing the absolute minimum amount of information to get data where it needs to be.
Later . . . Jim
Posted by JJS on May 07, 2007 at 05:37 PM PDT #
Posted by Haik on May 08, 2007 at 08:52 AM PDT #