Back in 2016, Ars Technica picked up this piece from my blog [1] as well as a longer piece reviewing the newly announced APFS [2] [3]. Glad it's still finding an audience!
"Still another version I’ve heard calls into question the veracity of their purported friendship, and has Steve instead suggesting that Larry go f*ck himself. Normally the iconoclast, that would, if true, represent Steve’s most mainstream opinion."
LOL!!
I really hope they weren't friends, that really shatters my internal narrative (mainly because I can't actually picture either of them having actual friends).
As a desktop user, I am content with APFS. The only feature from ZFS that I would like, is the corruption detection. I honestly don't know how robust the image and video formats are to bit corruption. On the one hand, potentially, "very" robust. But on the other, I would think that there are some very special bits that if toggled can potentially "ruin" the entire file. But I don't know.
However, I can say, every time I've tried ZFS on my iMac, it was simply a disaster.
Just trying to set it up on a single USB drive, or setting it up to mirror a pair. The net effect was that it CRUSHED the performance on my machine. It became unusable. We're talking "move the mouse, watch the pointer crawl behind" unusable. "Let's type at 300 baud" unusable. Interactive performance was shot.
That's the fault of macOS, I also experienced 100% CPU and load off the charts and it was kernel_task jammed up by USB. Once I used a Thunderbolt enclosure it started to be sane. This experience was the same across multiple non-Apple filesystems as I was trying a bunch to see which one was the best at cross-os compatibility
Also, separately, ZFS says "don't run ZFS on USB". I didn't have problems with it, but I knew I was rolling the dice
> I honestly don't know how robust the image and video formats are to bit corruption.
It depends on the format. A BMP image format would limit the damage to 1 pixel, while a JPEG could propagate the damage to potentially the entire image.
There is an example of a bitflip damaging a picture here:
That single bit flip ruined about half of the image.
As for video, that depends on how far apart I frames are. Any damage from a bit flip would likely be isolated to the section of video from the bitflip until the next I-frame occurs. As for how bad it could be, it depends on how the encoding works.
> On the one hand, potentially, "very" robust.
Only in uncompressed files.
> But on the other, I would think that there are some very special bits that if toggled can potentially "ruin" the entire file. But I don't know.
The way that image compression works means that a single bit flip prior to decompression can affect a great many pixels, as shown at Ars Technica.
> However, I can say, every time I've tried ZFS on my iMac, it was simply a disaster.
Did you file an issue? I am not sure what the current status of the macOS driver’s production readiness is, but it will be difficult to see it improve if people do not report issues that they have.
A thing that can destroy most files in terms of readability is a bit flip in the header section. These could be theoretically corrected with clever guesswork and a hex editor, but in practise it is going to be hard to know where the bit flip occured, when a file just can't be read anymore.
> The only feature from ZFS that I would like, is the corruption detection.
I run ZFS on my main server at home (Proxmox: a Linux hypervisor based on Debian and Proxmox ships with ZFS) but...
No matter the FS, for "big" files that aren't supposed to change, I append a (partial) cryptographic checksum to the filename. For example:
20240238-familyTripBari.mp4 becomes 20240238-familyTripBari-b3-8d77e2419a36.mp4 where "-b3-" indicates the type of cryptographic hash ("b3" for Blake3 in my case for it's very fast) and 8d77e2419a36 is the first x hexdigits of the cryptographic hash.
I play the video file (or whatever file it is) after I added the checksum: I know it's good.
I do that for movies, pictures, rips of my audio CDs (although these ones are matched with a "perfect rips" online database too), etc. Basically with everything that isn't supposed to change and that I want to keep.
I then have a shell script (which I run on several machines) that uses random sampling where I pick the percentage of files that have such a cryptographic checksum in their filenames that I want to check and that verifies that each still has its checksum matching. I don't verify 100% of the files all the time. Typically I'll verify, say, 3% of my files, randomly, daily.
Does it help? Well sure yup. For whatever reason one file was corrupt on one of my system: it's not too clear why for the file had the correct size but somehow a bit had flipped. During some sync probably. And my script caught it.
The nice thing is I can copy such files on actual backups: DVDs or BluRays or cloud or whatever. The checksum is part of the filename, so I know if my file changed or not no matter the OS / backup medium / cloud or local storage / etc.
The checksum doesn’t help you fix the flipped bit nor does it tell you which bit flipped. You would have to re-create from a complete back up instead of using the efficiency of parity discs. Basically Raid 1 vs Raid 5
If OP is backing up locally onto a ZFS server like they said they were then say propagating this data to a cloud provider like Blackblaze which uses ext4 this sort of approach makes sense.
This approach is also good when you have multiple sources to restore from. It makes it easier to determine what is the new "source of truth."
Theres something to be said too for backing up onto different FS too. You don't want to be stung by a FS bug and if you do then it's good to know about it.
The death of ZFS in macOS was a huge shift in the industry. This has to be seen in the context of microsoft killed their largely ambitious WinFS which felt like the death of desktop innovation in combination.
Both are imho linked to "offline desktop use cases are not important anymore". Both companies saw their future gains elsewhere, in internet-related functions and what became known as "cloud". No need to have a fancy, featurefull and expensive filesystem when it is only to be used as a cache for remote cloud stuff.
Internet connections of the day didn't yet offer enough speed for cloud storage.
Apple was already working to integrate ZFS when Oracle bought Sun.
From TFA:
> ZFS was featured in the keynotes, it was on the developer disc handed out to attendees, and it was even mentioned on the Mac OS X Server website. Apple had been working on its port since 2006 and now it was functional enough to be put on full display.
However, once Oracle bought Sun, the deal was off.
Again from TFA:
> The Apple-ZFS deal was brought for Larry Ellison's approval, the first-born child of the conquered land brought to be blessed by the new king. "I'll tell you about doing business with my best friend Steve Jobs," he apparently said, "I don't do business with my best friend Steve Jobs."
The NetApp lawsuit. Apple wanted indemnification, and Sun/Oracle did not want to indemnify Apple.
At the time that NetApp filed its lawsuit I blogged about how ZFS was a straightforward evolution of BSD 4.4's log structured filesystem. I didn't know that to be the case historically, that is, I didn't know if Bonwick was inspired by LFS, but I showed how almost in every way ZFS was a simple evolution of LFS. I showed my blog to Jeff to see how he felt about it, and he didn't say much but he did acknowledge it. The point of that blog was to show that there was prior art and that NetApp's lawsuit was worthless. I pointed it out to Sun's general counsel, too.
While I was disappointed that NetApp sued, the ZFS team literally referenced NetApp and WAFL multiple times in their presenations IIRC. They were kind of begging to be sued.
No, the ZFS team did not "literally reference NetApp and WAFL" in their presentations and no, Sun did not "start it" -- NetApp initiated the litigation (though Sun absolutely countersued), and NetApp were well on their way to losing not only their case but also their WAFL patents when Oracle acquired Sun. Despite having inherited a winning case, Oracle chose to allow the suit to be dismissed[0]; terms of the settlement were undisclosed.
Seems specious. Patents don't preclude one from overtly trying to compete; they protect specific mechanisms. In this case either ZFS didn't use the same mechanisms or the mechanisms themselves were found to have prior art.
The argument advanced in the piece isn't without merit -- that ripping out DTrace, if subsequent legal developments demanded it, would be a heck of a lot easier than removing a filesystem that would by then contain massive amounts of customer data.
And given what a litigious jackass Larry Ellison / Oracle is, I can't fault Apple for being nervous.
There's always two sides to one story. And Jobs was not the kind of person to take no for an answer. I've known other narcissists and they've always seen friends as more of a resource than someone to care about.
I think the truth is somewhere in the middle.
Another rumour was that Schwartz spilling the beans pissed Jobs off, which I wouldn't really put past him. Though I don't think it would have been enough to kill this.
I think all these little things added up and the end result was just "better not then".
The business case for providing a robust desktop filesystem simply doesn’t exist anymore.
20 years ago, (regular) people stored their data on computers and those needed to be dependable. Phones existed, but not to the extent they do today.
Fast forward 20 years, and many people don’t even own a computer (in the traditional sense, many have consoles). People now have their entire life on their phones, backed up and/or stored in the cloud.
SSDs also became “large enough” that HDDs are mostly a thing of the past in consumer computers.
Instead you today have high reliability hardware and software in the cloud, which arguably is much more resilient than anything you could reasonably cook up at home. Besides the hardware (power, internet, fire suppression, physical security, etc), you’re also typically looking at multi geographical redundancy across multiple data centers using reed-Solomon erasure coding, but that’s nothing the ordinary user needs to know about.
Most cloud services also offer some kind of snapshot functionality as malware protection (ie OneDrive offers unlimited snapshots for 30 days rolling).
Truth is that most people are way better off just storing their data in the cloud and making a backup at home, though many people seem to ignore the latter, and Apple makes it exceptionally hard to automate.
What do you do when you discover that some thing you have not touched in a long time, but suddenly need, is corrupted and all of your backups are corrupt because the corruption happened prior to your 30 day window at OneDrive?
You would have early warning with ZFS. You have data loss with your plan.
I remember eagerly anticipating ZFS for desktop hard disks. I seem to remember it never took off because memory requirements were too high and payoffs were insufficient to justify the trade off.
Linux or FreeBSD developers are free to adopt ZFS as their primary file systems. But it appears that practical benefits are not really evident to most users.
Lots of ZFS users are enthusiasts who heard about that one magic thing that does it all in one tidy box. Whereas usually you would have to known all the minutiae of LVM/mdadm/cryptsetup/nbd and mkfs.whatever to get to the same point. So while ZFS is the nicer-dicer of volume management and filesystems, the latter is your whole chef's knife set. And while you can dice with both, the user groups are not the same. And enthusiasts with the right usecases are very few.
And for the thin-provisioned snapshotted subvolume usecase, btrfs is currently eating ZFS's lunch due to far better Linux integration. Think snapshots at every update, and having a/b boot to get back to a known-working config after an update. So widespread adoption through the distro route is out of the question.
Ubuntu's ZFS-on-root with zsys auto snapshots have been working excellently on my server for 5 years. It automatically takes snapshots on every update and adds entries to grub so rolling back to the last good state is just a reboot away.
> And for the thin-provisioned snapshotted subvolume usecase, btrfs is currently eating ZFS's lunch due to far better Linux integration.
Is this a technical argument? Or is this just more licensing nonsense?
> Think snapshots at every update, and having a/b boot to get back to a known-working config after an update.
I use ZFS and I have snapshots on every update? I have snapshots hourly, daily, weekly and monthly. I have triggered snapshots too, and ad hoc dynamic snapshots too. I wrote about it here: https://kimono-koans.github.io/opinionated-guide/
Linux’s signed off policy makes that impossible. Linus Torvalds would need Larry Elison’s signed off before even considering it. Linus told me this by email around 2013 (if I recall correctly) when I emailed him to discuss user requests for upstream inclusion. He had no concerns about the license being different at the time.
That's called marketing. Give it a snazzy name, like say "TimeMachine" and users will jump on it.
Also, ZFS has a bad name within the Linux community due to some licensing stuff. I find that most BSD users don't really care about such legalese and most people I know that run FreeBSD are running ZFS on root. Which works amazingly well I might add.
Especially with something like sanoid added to it, it basically does the same as timemachine on mac, a feature that users love. Albeit stored on the same drive (but with syncoid or just manually rolled zfs send/recv scripts you can do that on another location too).
> I find that most BSD users don't really care about such legalese and most people I know that run FreeBSD are running ZFS on root.
I don't think it's that they don't care, it's that the CDDL and BSD-ish licenses are generally believed to just not have the conflict that CDDL and GPL might. (IANAL, make your own conclusions about whether either of those are true)
Hmm yeah but that's the thing, who really cares about licenses as a user? I certainly don't. It's just some stuff that some lawyers fuss over. I don't read EULAs either nor would I even consider obeying them. The whole civil-legal world is just something I ignore.
I do have a feeling that Linux users in general care more about the GPL which is quite specific of course. Though I wonder if anyone chooses Linux for that reason.
But really personally I don't care whether companies give anything back, if anything I would love less corporate involvement in the OS I use. It was one of my main reasons for picking BSD. The others were a less fragmented ecosystem and less push to change things constantly.
True and I understand the caution considering Oracle is involved which are an awful company to do deal with (and their takeover of Sun was a disaster).
But really, this is a concern for distros. Not for end users. Yet many of the Linux users I speak to are somehow worried about this. Most can't even describe the provisions of the GPL so I don't really know what that's about. Just something they picked up, I guess.
Licensing concerns that prevent distros from using ZFS will sooner or later also have adverse effects on end users. Actually those effects are already there: The constant need to adapt a large patchset to the current kernel, meaning updates are a hassle. The lack of packaging in distributions, meaning updates are a hassle. And the lack of integration and related tooling, meaning many features can not be used (like a/b boots from snapshots after updates) easily, and installers won't know about ZFS so you have to install manually.
None of this is a worry about being sued as an end user. But all of those are worries that you life will be harder with ZFS, and a lot harder as soon as the first lawsuits hit anyone, because all the current (small) efforts to keep it working will cease immediately.
That is due to licensing reasons, yes. It makes maintaining the codebase even more complicated because when the kernel module API changes (which it very frequently does) you cannot just adapt it to your needs, you have to work around all the new changes that are there in the new version.
You have things backward. Licensing has nothing to do with it. Changes to the kernel are unnecessary. Maintaining the code base is also simplified by supporting the various kernel versions the way that they are currently supported.
Windows has a really good basis for it though, in volume shadow copy. I also don't understand why Microsoft never built a time machine based on that. Well, they kinda did but only on samba shares. But not locally.
But these days they want you to subscribe to their cloud storage so the versioning is done there, which makes sense in their commercial point of view.
I think snapshots on ZFS are better than time machine though. Time machine is a bit of a clunky mess of soft links that can really go to shit on a minor corruption. Leaving you with an unrestorable backup and just some vague error messages.
I worked a lot with macs and I've had my share of bad backups when trying to fix people's problems. I've not seen ZFS fail like that. It's really solid and tends to indicate issues before they lead to bigger problems.
Shadow Protect by StorageCraft was brilliant. Pretty sure MS actually licensed Shadow Copy from them, but I could be mistaken. It's been a while since I played in that space.
I can't readily tell how much of the dumbness is from the filesystem and how much from the kernel but the end result is that until it gets away from 1980s version of file locking there's no prayer. Imagine having to explain to your boss that your .docx wasn't backed up because you left Word open over the weekend. A just catastrophically idiotic design
Ah but this is really not true. Volume shadow copy makes snapshots of files and through that it can make a backup of an entire NTFS including files with a lock on them and fully quiesced. It was invented for that exact purpose. Backup software on windows leverages this functionality well. It took much longer for Linux to have something similar.
I have many criticisms of NTFS like it being really bad at handling large volumes of small files. But this is something it can do well.
The lock prevents other people from copying the file or opening it even in read only, yes. But backup software can back it up just fine.
I just mean that GPL is a bit of a religion. There are very strong opinions and principles behind it. Whereas the BSD license is more like "do whatever you want". It makes sense that the followers of the former care more deeply about it, right?
Personally I don't care about or obey any software licenses, as a user.
But this is kinda the vibe I get from other BSD users if a license discussion comes up. Maybe it's my bubble, that's possible.
There is an increase of posts with casual confidence in their own absolute correctnes. I originally attributed it to the influence of llms, but it is becoming so common now it is hard to dismiss.
ZFS on FreeBSD is quite nice. System tools like freebsd-update integrate well. UFS continues to work as well, and may be more appropriate for some use cases where ZFS isn't a good fit, copy on write is sometimes very expensive.
Afaik, the FreeBSD position is both ZFS and UFS are fully supported and neither is secondary to the other; the installer asks what you want from ZFS, UFS, Manual (with a menu based tool), or Shell and you do whatever; in that order, so maybe a slight preferance towards ZFS.
Besides the licensing issue, I wonder if optimizing ZFS for low latency + low RAM + low power on iPhone was an uphill battle or if it’s easy. My experiencing running ZFS years ago was poor latency and large RAM use with my NAS, but that hardware and drive configuration was optimized for low $ per gb stored and used parity stuff.
While its deduplication feature clearly demands more memory, my understanding is that the ZFS ARC is treated by the kernel as a driver with a massive, persistent memory allocation that cannot be swapped out ("wired" pages). Unlike the regular file system cache, ARC's eviction is not directly managed by the kernel. Instead, ZFS itself is responsible for deciding when and how to shrink the ARC.
This can lead to problems under sudden memory pressure. Because the ARC does not immediately release memory when the system needs it, userland pages might get swapped out instead. This behavior is more noticeable on personal computers, where memory usage patterns are highly dynamic (applications are constantly being started, used, and closed). On servers, where workloads are more static and predictable, the impact is usually less severe.
I do wonder if this is also the case on Solaris or illumos, where there is no intermediate SPL between ZFS and the kernel. If so, I don't think that a hypothetical native integration of ZFS on macOS (or even Linux) would adopt the ARC in its current form.
The ZFS driver will release memory if the kernel requests it. The only integration level issue is that the free command does not show ARC as a buffer/cache, so it misrepresents reality, but as far as I know, this is an issue with caches used by various filesystems (e.g. extent caches). It is only obvious in the case of ZFS because the ARC can be so large. That is a feature, not a bug, since unused memory is wasted memory.
I assume that the VM2 project achieved something similar to the ABD changes that were done in OpenZFS. ABD replaced the use of SLAB buffers for ARC with lists of pages. The issue with SLAB buffers is that absurd amounts of work could be done to free memory, and a single long lived SLAB object would prevent any of it from mattering. Long lived slab objects caused excessive reclaim, slowed down the process of freeing enough memory to satisfy system needs and in some cases, prevented enough memory from being freed to satisfy system needs entirely. Switching to linked lists of pages fixed that since the memory being freed from ARC upon request would immediately become free rather than be deferred to when all of the objects in the SLAB had been freed.
This seems like an early application of the Tim Cook doctrine: Why would Apple want to surrender control of this key bit of technology for their platforms?
The rollout of APFS a decade later validated this concern. There’s just no way that flawless transition happens so rapidly without a filesystem fit to order for Apple’s needs from Day 0.
(Edit: My comment is simply about the logistics and work involved in a very well executed filesystem migration. Not about whether ZFS is good for embedded or memory constrained devices.)
What you describe hits my ear as more NIH syndrome than technical reality.
Apple’s transition to APFS was managed like you’d manage any kind of mass scale filesystem migration. I can’t imagine they’d have done anything differently if they’d have adopted ZFS.
Which isn’t to say they wouldn’t have modified ZFS.
But with proper driver support and testing it wouldn’t have made much difference whether they wrote their own file system or adopted an existing one. They have done a fantastic job of compartmentalizing and rationalizing their OS and user data partitions and structures. It’s not like every iPhone model has a production run that has different filesystem needs that they’d have to sort out.
There was an interesting talk given at WWDC a few years ago on this. The roll out of APFS came after they’d already tested the filesystem conversion for randomized groups of devices and then eventually every single device that upgraded to one of the point releases prior to iOS 10.3. The way they did this was to basically run the conversion in memory as a logic test against real data. At the end they’d have the super block for the new APFS volume, and on a successful exit they simply discarded it instead of writing it to persistent storage. If it errored it would send a trace back to Apple.
Huge amounts of testing and consistency in OS and user data partitioning and directory structures is a huge part of why that migration worked so flawlessly.
To be clear, BTRFS also supports in-place upgrade. It's not a uniquely Apple feature; any copy-on-write filesystem with flexibility as to where data is located can be made to fit inside of the free blocks of another filesystem. Once you can do that, then you can do test runs[0] of the filesystem upgrade before committing to wiping the superblock.
I don't know for certain if they could have done it with ZFS; but I can imagine it would at least been doable with some Apple extensions that would only have to exist during test / upgrade time.
[0] Part of why the APFS upgrade was so flawless was that Apple had done a test upgrade in a prior iOS update. They'd run the updater, log any errors, and then revert the upgrade and ship the error log back to Apple for analysis.
I don't see why ZFS wouldn't have gone over equally flawlessly. None of the features that make ZFS special were in HFS(+), so conversion wouldn't be too hard. The only challenge would be maintaining the legacy compression algorithms, but ZFS is configurable enough that Apple could've added their custom compression to it quite easily.
There are probably good reasons for Apple to reinvent ZFS as APFS a decade later, but none of them technical.
I also wouldn't call the rollout of APFS flawless, per se. It's still a terrible fit for (external) hard drives and their own products don't auto convert to APFS in some cases. There was also plenty of breakage when case-sensitivity flipped on people and software, but as far as I can tell Apple just never bothered to address that.
Using ZFS isn't surrendering control. Same as using parts of FreeBSD. Apple retains control because they don't have an obligation (or track record) of following the upstream.
For zfs, there's been a lot of improvements over the years, but if they had done the fork and adapt and then leave it alone, their fork would continue to work without outside control. They could pull in things from outside if they want, when they want; some parts easier than others.
If it were an issue it would hardly be an insurmountable one. I just can't imagine a scenario where Apple engineers go “Yep, we've eked out all of the performance we possibly can from this phone, the only thing left to do is change out the filesystem.”
Does it matter if it’s insurmountable? At some point, the benefits of a new FS outweigh the drawbacks. This happens earlier than you might think, because of weird factors like “this lets us retain top filesystem experts on staff”.
It’s worth remembering that the filesystem they were looking to replace was HFS+. It was introduced in the 90s as a modernization of HFS, itself introduced in the 80s.
Now, old does not necessarily mean bad, but in this case….
If I recall correctly, ZFS error recovery was still “restore from backup” at the time, and iCloud acceptance was more limited. (ZFS basically gave up if an error was encountered after the checksum showed that the data was read correctly from storage media.) That's fine for deployments where the individual system does not matter (or you have dedicated staff to recover systems if necessary), but phones aren't like that. At least not from the user perspective.
ZFS has ditto blocks that allows it to self heal in the case of corrupt metadata as long as a good copy remains (and there would be at least 2 copies by default). ZFS only ever needs you to restore from backup if the damage is so severe that there is no making sense of things.
Minor things like the indirect blocks being missing for a regular file only affect that file. Major things like all 3 copies of the MOS (the equivalent to a superblock) being gone for all uberblock entries would require recovery from backup.
If all copies of any other filesystem’s superblock were gone too, that filesystem would be equally irrecoverable and would require restoring from backup.
As far as I understand it, ditto blocks were only used if the corruption was detected due to checksum mismatch. If the checksum was correct, but metadata turned out to be unusable later (say because it was corrupted in memory, and the the checksum was computed after the corruption happened), that was treated as a fatal error.
Kind of odd that the blog states that "The architect for ZFS at Apple had left" and links to the LinkedIn profile of someone who doesn't have any Apple work experience listed on their resume. I assume the author linked to the wrong profile?
Also can confirm Don is one of the kindest, nicest principal engineer level people I’ve worked with in my career. Always had time to
mentor and assist.
Not sure how I fat-fingered Don's LinkedIn, but I'm updating that 9-year-old typo. Agreed that Don is a delight. In the years after this article I got to collaborate more with him, but left Delphix before he joined to work on ZFS.
Thanks for sharing I was just looking for what happened to Sun. I like the second-hand quote comparing the IBM and HP as "garbage trucks colliding" plus the inclusion of blog posts with links to the court filings.
Is it fair to say ZFS made most sense on Solaris using Solaris Containers on SPARK?
ZFS was developed in Solaris, and at the time we were mostly selling SPARC systems. That changed rapidly and the biggest commercial push was in the form of the ZFS Storage Appliance that our team (known as Fishworks) built at Sun. Those systems were based on AMD servers that Sun was making at the time such as Thumper [1]. Also in 2016, Ubuntu leaned in to use of ZFS for containers [2]. There was nothing that specific about Solaris that made sense for ZFS, and even less of a connection to the SPARC architecture.
> There was nothing that specific about Solaris that made sense for ZFS, and even less of a connection to the SPARC architecture.
Although it does not change the answer to the original question, I have long been under the impression that part of the design of ZFS had been influenced by the Niagara processor. The heavily threaded ZIO pipeline had been so forward thinking that it is difficult to imagine anyone devising it unless they were thinking of the future that the Niagara processor represented.
Am I correct to think that or did knowledge of the upcoming Niagara processor not shape design decisions at all?
By the way, why did Thumper use an AMD Opteron over the UltraSPARC T1 (Niagara)? That decision seems contrary to idea of putting all of the wood behind one arrow.
Niagara did not shape design decisions at all -- remember that Niagara was really only doing on a single socket what we had already done on large SMP machines (e.g., Starfire/Starcat). What did shape design decisions -- or at least informed thinking -- was a belief that all main memory would be non-volatile within the lifespan of ZFS. (Still possible, of course!) I don't know that there are any true artifacts of that within ZFS, but I would say that it affected thinking much more than Niagara.
As for Thumper using Opteron over Niagara: that was due to many reasons, both technological (Niagara was interesting but not world-beating) and organizational (Thumper was a result of the acquisition of Kealia, which was independently developing on AMD).
I don’t recall that being the case. Bonwick had been thinking about ZFS for at least a couple of years. Matt Ahrens joined Sun (with me) in 2001. The Afara acquisition didn’t close until 2002. Niagara certainly was tantalizing but it wasn’t a primary design consideration. As I recall, AMD was head and shoulders above everything else in terms of IO capacity. Sun was never very good (during my tenure there) at coordination or holistic strategy.
Yeah I think if it hadn’t been for the combination of Oracle and CDDL, Red Hat would have been more interested in for Linux. As it was they basically went with XFS and volume management. Fedora did eventually go with btrfs but dints know if there are are any plans for copy-on-write FS for RHEL at any point.
It’s not like Red Hat had/has no influence over what makes it into mainline. But the options for copy on write were either relatively immature or had license issues in their view.
Their view is that if it is out of tree, they will not support it. This supersedes any discussion of license. Even out of tree GPL drivers are not supported by RedHat.
We had those things at work as fileservers, so no containers or anything fancy.
Sun salespeople tried to sell us the idea of "zfs filesystems are very cheap, you can create many of them, you don't need quota" (which ZFS didn't have at the time), which we tried out. It was abysmally slow. It was even slow with just one filesystem on it. We scrapped the whole idea, just put Linux on them and suddenly fileserver performance doubled. Which is something we weren't used to with older Solaris/Sparc/UFS or /VXFS systems.
We never tried another generation of those, and soon after Sun was bought by Oracle anyways.
I had a combination uh-oh/wow! moment back in those days when the hacked up NFS server I built on a Dell with Linux and XFS absolutely torched the Solaris and UFS system we'd been using for development. Yeah, it wasnt apples to apples. Yes, maybe ZFS would have helped. But XFS was proven at SGI and it was obvious that the business would save thousands overnight by moving to Linux on Dell instead of sticking with Sun E450s. That was the death knell for my time as a Solaris sysadmin, to be honest.
ZFS remains an excellent filesystem for bulk storage on rust, but were I Apple at the time, I would probably want to focus on something built for the coming era of flash and NVMe storage. There are a number of axioms built into ZFS that come out of the spinning disk era that still hold it back for flash-only filesystems.
Certainly one would build something different starting in 2025 rather than 2001, but do you have specific examples of how ZFS’s design holds it back? I think it has been adapted extremely well for the changing ecosystem.
I wonder what ZFS in the iPhone would've looked like. As far as I recall, the iPhone didn't have error correcting memory, and ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk. ZFS' RAM-hungry nature would've also forced Apple to add more memory to their phone.
A very long ago someone named cyberjock was a prolific and opinionated proponent of ZFS, who wrote many things about ZFS during a time when the hobbyist community was tiny and not very familiar with how to use it and how it worked. Unfortunately, some of their most misguided and/or outdated thoughts still haunt modern consciousness like an egregore.
What you are probably thinking of is the proposed doomsday scenario where bad ram could theoretically kill a ZFS pool during a scrub.
I have never once heard of this happening in real life.
Hell, I’ve never even had bad ram. I have had bad sata/sas cables, and a bad disk though. ZFS faithfully informed me there was a problem, which no other file system would have done. I’ve seen other people that start getting corruption when sata/sas controllers go bad or overheat, which again is detected by ZFS.
What actually destroys pools is user error, followed very distantly by plain old fashioned ZFS bugs that someone with an unlucky edge case ran into.
You can take that as meaning “I’ve never had a noticed issue that was detected by extensive ram testing, or solved by replacing ram”.
I got into overclocking both regular and ECC DDR4 ram for a while when AMD’s 1st gen ryzen stuff came out, thanks to asrock’s x399 motherboard which unofficially supporting ECC, allowing both it’s function and reporting of errors (produced when overlocking)
Based on my own testing and issues seen from others, regular memory has quite a bit of leeway before it becomes unstable, and memory that’s generating errors tends to constantly crash the system, or do so under certain workloads.
Of course, without ECC you can’t prove every single operation has been fault free, but as some point you call it close enough.
I am of the opinion that ECC memory is the best memory to overclock, precisely because you can prove stability simply by using the system.
All that said, as things become smaller with tighter specifications to squeeze out faster performance, I do grow more leery of intermittent single errors that occur on the order of weeks or months in newer generations of hardware. I was once able to overclock my memory to the edge of what I thought was stability as it passed all tests for days, but about every month or two there’d be a few corrected errors show up in my logs. Typically, any sort of stability is caught by manual tests within minutes or the hour.
My friends and I spent a lot of our middle and high school days building computers from whatever parts we could find, and went through a lot of sourcing components everywhere from salvaged throwaways to local computer shops, when those were a thing. We hit our fair share of bad RAM, and by that I mean a handful of sticks at best.
To me, the most implausible thing about ZFS-without-ECC doomsaying is the presumption that the failure mode of RAM is a persistently stuck bit. That's way less common than transient errors, and way more likely to be noticed, since it will destabilize any piece of software that uses that address range. And now that all modern high-density DRAM includes on-die ECC, transient data corruption on the link between DRAM and CPU seems overwhelmingly more likely than a stuck bit.
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk
ZFS does not need or benefit from ECC memory any more than any other FS. The bitflip corrupted the data, regardless of ZFS. Any other FS is just oblivious, ZFS will at least tell you your data is corrupt but happily keep operating.
> ZFS' RAM-hungry nature
ZFS is not really RAM-hungry, unless one uses deduplication (which is not enabled by default, nor generally recommended). It can often seem RAM hungry on Linux because the ARC is not counted as “cache” like the page cache is.
It's very amusing that this kind of legend has persisted! ZFS is notorious for *noticing* when bits flip, something APFS designers claimed was rare given the robustness of Apple hardware.[1][2] What would ZFS on iPhone have looked like? Hard to know, and that certainly wasn't the design center.
Neither here nor there, but DTrace was ported to iPhone--it was shown to me in hushed tones in the back of an auditorium once...
I did early ZFSOnLinux development on hardware that did not have ECC memory. I once had a situation where a bit flip happened in the ARC buffer for libpython.so and all python software started crashing. Initially, I thought I had hit some sort of blizzard bug in ZFS, so I started debugging. At that time, opening a ZFS snapshot would fetch a duplicate from disk into a redundant ARC buffer, so while debugging, I ran cmp on libpython.so between the live copy and a snapshot copy. It showed the exact bit that had flipped. After seeing that and convincing myself the bitflip was not actually on stable storage, I did a reboot, and all was well. Soon afterward, I got a new development machine that had ECC so that I would not waste my time chasing phantom bugs caused by bit flips.
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk
I don't think it is. I've never heard of that happening, or seen any evidence ZFS is more likely to break than any random filesystem. I've only seen people spreading paranoid rumors based on a couple pages saying ECC memory is important to fully get the benefits of ZFS.
So you've never seen the people saying you should steer clear of ZFS unless you're going to have an enormous ARC even when talking about personal media servers?
People, especially those on the Internet, say a lot of things.
Some of the things they say aren't credible, even if they're said often.
You don't need an enormous amount of ram to run zfs unless you have dedupe enabled. A lot of people thought they wanted dedupe enabled though. (2024's fast dedupe may help, but probably the right answer for most people is not to use dedupe)
It's the same thing with the "need" for ECC. If your ram is bad, you're going to end up with bad data in your filesystem. With ZFS, you're likely to find out your filesystem is corrupt (although, if the data is corrupted before the checksum is calculated, then the checksum doesn't help); with a non-checksumming filesystem, you may get lucky and not have meta data get corrupted and the OS keeps going, just some of your files are wrong. Having ECC would be better, but there's tradeoffs so it never made sense for me to use it at home; zfs still works and is protecting me from disk contents changing, even if what was written could be wrong.
I have seen people say such things, and none of it was based on reality. They just misinterpreted the performance cliff that data deduplication had to mean you must have absurd amounts of memory even though data deduplication is off by default. I suspect few of the people peddling this nonsense even used ZFS and the few who did, had not looked very deeply into it.
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk.
If you have no mirrors and no raidz and no ditto blocks then errors cause problems, yes. Early on they would cause panics.
But this isn't ZFS "corrupting itself", rather, it's ZFS saving itself and you from corruption, and the price you pay for that is that you need to add redundancy (mirrors, raidz, or ditto blocks). It's not a bad deal. Some prefer not to know.
Sometimes data on disk and in memory are randomly corrupted. For a pretty amazing example, check out "bitsquatting"[1]--it's like domain name squatting, but instead of typos, you squat on domains that would bit looked up in the case of random bit flips. These can occur due e.g. to cosmic rays. On-disk, HDDs and SSDs can produce the wrong data. It's uncommon to see actual invalid data rather than have an IO fail on ECC, but it certainly can happen (e.g. due to firmware bugs).
Basically it's that memory changes out from under you. As we know, computers use Binary, so everything boils down to it being a 0 or a 1. A bit flip is changing what was say a 0 into a 1.
Usually attributed to "cosmic rays", but really can happen for any number of less exciting sounding reasons.
Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network. Memory and disks are not checked for correctness, basically ever on any machine anywhere. Many servers(but certainly not all) are the rare exception when it comes to memory safety. They usually have ECC(Error Correction Code) Memory, basically a checksum on the memory to ensure that if memory is corrupted, it's noticed and fixed.
Essentially every filesystem everywhere does zero data integrity checking:
MacOS APFS: Nope
Windows NTFS: Nope
Linux EXT4: Nope
BSD's UFS: Nope
Your mobile phone: Nope
ZFS is the rare exception for file systems that actually double check the data you save to it is the data you get back from it. Every other filesystem is just a big ball of unknown data. You probably get back what you put it, but there is zero promises or guarantees.
But SSDs (to my knowledge) only implement checksum for the data transfer. Its a requirement of the protocol. So you can be sure that the Stuff in memory and checksum computed by the CPU arrives exactly like that in the SSD driver. In the past this was a common error source with hardware raid which was faulty.
But there is ABSOLUTELY NO checksum for the bits stored on a SSD. So bit rot at the cells of the SSDs are undetected.
That is ABSOLUTELY incorrect. SSDs have enormous amounts of error detection and correction builtin explicitly because errors on the raw medium are so common that without it you would never be able to read correct data from the device.
It has been years since I was familiar enough with the insides of SSDs to tell you exactly what they are doing now, but even ~10-15 years ago it was normal for each raw 2k block to actually be ~2176+ bytes and use at least 128 bytes for LDPC codes. Since then the block sizes have gone up (which reduces the number of bytes you need to achieve equivalent protection) and the lithography has shrunk (which increases the raw error rate).
Where exactly the error correction is implemented (individual dies, SSD controller, etc) and how it is reported can vary depending on the application, but I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.
> I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.
While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.
There are lots of weasel words there on purpose. There is generally zero guarantee in reality and drives lie all the time about data being safely written to disk, even if it wasn't actually safely written to disk yet. This means on power failure/interruption the outcome of being able to read XYZ back is 100% unknown. Drive Manufacturers make zero promises here.
On most consumer compute, there is no promises or guarantees that what you wrote on day 1 will be there on day 2+. It mostly works, and the chances are better than even that your data will be mostly safe on day 2+, but there is zero promises or guarantees. We know how to guarantee it, we just don't bother(usually).
You can buy laptops and desktops with ECC RAM and use ZFS(or other checksumming FS), but basically nobody does. I'm not aware of any mobile phones that offer either option.
> While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.
I'm not really sure what point you're trying to make. It's using ECC, so they should be the same bytes.
There isn't infinite reliability, but nothing has infinite reliability. File checksums don't provide infinite reliability either, because the checksum itself can be corrupted.
You keep talking about promises and guarantees, but there aren't any. All there is are statistical rates of reliability. Even ECC RAM or file checksums don't offer perfect guarantees.
For daily consumer use, the level of ECC built into disks is generally plenty sufficient. It's chosen to be so.
I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.
We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.
We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.
Yes, the disk mostly promises what you write there will be read back correctly, but that's at the disk level only. The OS, Filesystem and Memory generally do no checking, so any errors at those levels will propagate. We know it happens, we just mostly choose to not do anything about it.
My point was, on most consumer compute, there is no promises or guarantees that what you see on day 1 will be there on day 2. It mostly works, and the chances are better than even that your data will be mostly safe on day 2, but there is zero promises or guarantees, even though we know how to do it. Some systems do, those with ECC memory and ZFS for example. Other filesystems also support checksumming, like BTRFS being the most common counter-example to ZFS. Even though parts of BTRFS are still completely broken(see their status page for details).
Yes, ZFS is not the only filesystem with data checksumming and guarantees, but it's one of the very rare exceptions that do.
ZFS has been in productions work loads since 2005, 20 years now. It's proven to be very safe.
BTRFS has known fundamental issues past one disk. It is however improving. I will say BTRFS is fine for a single drive. Even the developers last I checked(a few years ago) don't really recommend it past a single drive, though hopefully that's changing over time.
Apple and Sun couldn't agree on a 'support contract'. From Jeff Bonwick, one of the co-creators ZFS:
>> Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a "private license" from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms.
> I cannot disclose details, but that is the essence of it.
Apple took DTrace, licensed via CDDL—just like ZFS—and put it into the kernel without issue. Of course a file system is much more central to an operating system, so they wanted much more of a CYA for that.
ZFS is the king of all file systems. As someone with over a petabyte of storage across 275 drives I have never lost a single byte due to a hard drive failure or corruption thanks to ZFS
ZFS sort of moved inside the NVMe controller - it also checksums and scrubs things all the time, you just don't see it. This does not, however, support multi-device redundant storage, but that is not a concern for Apple - the vast majority of their devices have only one storage device.
Back in 2016, Ars Technica picked up this piece from my blog [1] as well as a longer piece reviewing the newly announced APFS [2] [3]. Glad it's still finding an audience!
[1]: https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-ap...
[2]: https://ahl.dtrace.org/2016/06/19/apfs-part1/
[3]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...
"Still another version I’ve heard calls into question the veracity of their purported friendship, and has Steve instead suggesting that Larry go f*ck himself. Normally the iconoclast, that would, if true, represent Steve’s most mainstream opinion."
LOL!!
I really hope they weren't friends, that really shatters my internal narrative (mainly because I can't actually picture either of them having actual friends).
But aren’t they very similar? Very successful, very rich, and very pathetic human beings… no?
As a desktop user, I am content with APFS. The only feature from ZFS that I would like, is the corruption detection. I honestly don't know how robust the image and video formats are to bit corruption. On the one hand, potentially, "very" robust. But on the other, I would think that there are some very special bits that if toggled can potentially "ruin" the entire file. But I don't know.
However, I can say, every time I've tried ZFS on my iMac, it was simply a disaster.
Just trying to set it up on a single USB drive, or setting it up to mirror a pair. The net effect was that it CRUSHED the performance on my machine. It became unusable. We're talking "move the mouse, watch the pointer crawl behind" unusable. "Let's type at 300 baud" unusable. Interactive performance was shot.
After I remove it, all is right again.
> Just trying to set it up on a single USB drive
That's the fault of macOS, I also experienced 100% CPU and load off the charts and it was kernel_task jammed up by USB. Once I used a Thunderbolt enclosure it started to be sane. This experience was the same across multiple non-Apple filesystems as I was trying a bunch to see which one was the best at cross-os compatibility
Also, separately, ZFS says "don't run ZFS on USB". I didn't have problems with it, but I knew I was rolling the dice
Yeah they do say that but anecdotally my Plex server has been ZFS over USB 3 since 2020 with zero problems (using Ubuntu 20.04)
Anyway only bringing it up to reinforce that it is probably a macOS problem.
> I honestly don't know how robust the image and video formats are to bit corruption.
It depends on the format. A BMP image format would limit the damage to 1 pixel, while a JPEG could propagate the damage to potentially the entire image. There is an example of a bitflip damaging a picture here:
https://arstechnica.com/information-technology/2014/01/bitro...
That single bit flip ruined about half of the image.
As for video, that depends on how far apart I frames are. Any damage from a bit flip would likely be isolated to the section of video from the bitflip until the next I-frame occurs. As for how bad it could be, it depends on how the encoding works.
> On the one hand, potentially, "very" robust.
Only in uncompressed files.
> But on the other, I would think that there are some very special bits that if toggled can potentially "ruin" the entire file. But I don't know.
The way that image compression works means that a single bit flip prior to decompression can affect a great many pixels, as shown at Ars Technica.
> However, I can say, every time I've tried ZFS on my iMac, it was simply a disaster.
Did you file an issue? I am not sure what the current status of the macOS driver’s production readiness is, but it will be difficult to see it improve if people do not report issues that they have.
A thing that can destroy most files in terms of readability is a bit flip in the header section. These could be theoretically corrected with clever guesswork and a hex editor, but in practise it is going to be hard to know where the bit flip occured, when a file just can't be read anymore.
> The only feature from ZFS that I would like, is the corruption detection.
I run ZFS on my main server at home (Proxmox: a Linux hypervisor based on Debian and Proxmox ships with ZFS) but...
No matter the FS, for "big" files that aren't supposed to change, I append a (partial) cryptographic checksum to the filename. For example:
20240238-familyTripBari.mp4 becomes 20240238-familyTripBari-b3-8d77e2419a36.mp4 where "-b3-" indicates the type of cryptographic hash ("b3" for Blake3 in my case for it's very fast) and 8d77e2419a36 is the first x hexdigits of the cryptographic hash.
I play the video file (or whatever file it is) after I added the checksum: I know it's good.
I do that for movies, pictures, rips of my audio CDs (although these ones are matched with a "perfect rips" online database too), etc. Basically with everything that isn't supposed to change and that I want to keep.
I then have a shell script (which I run on several machines) that uses random sampling where I pick the percentage of files that have such a cryptographic checksum in their filenames that I want to check and that verifies that each still has its checksum matching. I don't verify 100% of the files all the time. Typically I'll verify, say, 3% of my files, randomly, daily.
Does it help? Well sure yup. For whatever reason one file was corrupt on one of my system: it's not too clear why for the file had the correct size but somehow a bit had flipped. During some sync probably. And my script caught it.
The nice thing is I can copy such files on actual backups: DVDs or BluRays or cloud or whatever. The checksum is part of the filename, so I know if my file changed or not no matter the OS / backup medium / cloud or local storage / etc.
If you have "bit flip anxiety", it helps ; )
The checksum doesn’t help you fix the flipped bit nor does it tell you which bit flipped. You would have to re-create from a complete back up instead of using the efficiency of parity discs. Basically Raid 1 vs Raid 5
If OP is backing up locally onto a ZFS server like they said they were then say propagating this data to a cloud provider like Blackblaze which uses ext4 this sort of approach makes sense.
This approach is also good when you have multiple sources to restore from. It makes it easier to determine what is the new "source of truth."
Theres something to be said too for backing up onto different FS too. You don't want to be stung by a FS bug and if you do then it's good to know about it.
The death of ZFS in macOS was a huge shift in the industry. This has to be seen in the context of microsoft killed their largely ambitious WinFS which felt like the death of desktop innovation in combination.
Both are imho linked to "offline desktop use cases are not important anymore". Both companies saw their future gains elsewhere, in internet-related functions and what became known as "cloud". No need to have a fancy, featurefull and expensive filesystem when it is only to be used as a cache for remote cloud stuff.
Internet connections of the day didn't yet offer enough speed for cloud storage.
Apple was already working to integrate ZFS when Oracle bought Sun.
From TFA:
> ZFS was featured in the keynotes, it was on the developer disc handed out to attendees, and it was even mentioned on the Mac OS X Server website. Apple had been working on its port since 2006 and now it was functional enough to be put on full display.
However, once Oracle bought Sun, the deal was off.
Again from TFA:
> The Apple-ZFS deal was brought for Larry Ellison's approval, the first-born child of the conquered land brought to be blessed by the new king. "I'll tell you about doing business with my best friend Steve Jobs," he apparently said, "I don't do business with my best friend Steve Jobs."
And that was the end.
Was it not open source at that point?
It was! And Apple seemed fine with including DTrace under the CDDL. I’m not sure why Apple wanted some additional arrangement but they did.
The NetApp lawsuit. Apple wanted indemnification, and Sun/Oracle did not want to indemnify Apple.
At the time that NetApp filed its lawsuit I blogged about how ZFS was a straightforward evolution of BSD 4.4's log structured filesystem. I didn't know that to be the case historically, that is, I didn't know if Bonwick was inspired by LFS, but I showed how almost in every way ZFS was a simple evolution of LFS. I showed my blog to Jeff to see how he felt about it, and he didn't say much but he did acknowledge it. The point of that blog was to show that there was prior art and that NetApp's lawsuit was worthless. I pointed it out to Sun's general counsel, too.
While I was disappointed that NetApp sued, the ZFS team literally referenced NetApp and WAFL multiple times in their presenations IIRC. They were kind of begging to be sued.
Also, according to NetApp, "Sun started it".
https://www.networkcomputing.com/data-center-networking/neta...
No, the ZFS team did not "literally reference NetApp and WAFL" in their presentations and no, Sun did not "start it" -- NetApp initiated the litigation (though Sun absolutely countersued), and NetApp were well on their way to losing not only their case but also their WAFL patents when Oracle acquired Sun. Despite having inherited a winning case, Oracle chose to allow the suit to be dismissed[0]; terms of the settlement were undisclosed.
[0] https://www.theregister.com/2010/09/09/oracle_netapp_zfs_dis...
Seems specious. Patents don't preclude one from overtly trying to compete; they protect specific mechanisms. In this case either ZFS didn't use the same mechanisms or the mechanisms themselves were found to have prior art.
Ok? So what?
The argument advanced in the piece isn't without merit -- that ripping out DTrace, if subsequent legal developments demanded it, would be a heck of a lot easier than removing a filesystem that would by then contain massive amounts of customer data.
And given what a litigious jackass Larry Ellison / Oracle is, I can't fault Apple for being nervous.
No, it was that Apple wanted to be indemnified in the event that Sun/Oracle lost the NetApp lawsuit.
Ironic since the post above tells the story as LE saying no to Jobs.
There's always two sides to one story. And Jobs was not the kind of person to take no for an answer. I've known other narcissists and they've always seen friends as more of a resource than someone to care about.
I think the truth is somewhere in the middle.
Another rumour was that Schwartz spilling the beans pissed Jobs off, which I wouldn't really put past him. Though I don't think it would have been enough to kill this.
I think all these little things added up and the end result was just "better not then".
Exactly this.
The business case for providing a robust desktop filesystem simply doesn’t exist anymore.
20 years ago, (regular) people stored their data on computers and those needed to be dependable. Phones existed, but not to the extent they do today.
Fast forward 20 years, and many people don’t even own a computer (in the traditional sense, many have consoles). People now have their entire life on their phones, backed up and/or stored in the cloud.
SSDs also became “large enough” that HDDs are mostly a thing of the past in consumer computers.
Instead you today have high reliability hardware and software in the cloud, which arguably is much more resilient than anything you could reasonably cook up at home. Besides the hardware (power, internet, fire suppression, physical security, etc), you’re also typically looking at multi geographical redundancy across multiple data centers using reed-Solomon erasure coding, but that’s nothing the ordinary user needs to know about.
Most cloud services also offer some kind of snapshot functionality as malware protection (ie OneDrive offers unlimited snapshots for 30 days rolling).
Truth is that most people are way better off just storing their data in the cloud and making a backup at home, though many people seem to ignore the latter, and Apple makes it exceptionally hard to automate.
You do realise that in your use cases you still need reliable filesystems both in the cloud and on devices.
Because any corruption at any point will get synced as a change, or worse can cause failure.
What do you do when you discover that some thing you have not touched in a long time, but suddenly need, is corrupted and all of your backups are corrupt because the corruption happened prior to your 30 day window at OneDrive?
You would have early warning with ZFS. You have data loss with your plan.
You should have bought OneDrive Premium Plus with Extended Backup Storage (tm). This is your fault for not shelling out, stop blaming your computer.
/s
Workstation use cases exist. Data archival is not the only application of file systems.
I remember eagerly anticipating ZFS for desktop hard disks. I seem to remember it never took off because memory requirements were too high and payoffs were insufficient to justify the trade off.
Linux or FreeBSD developers are free to adopt ZFS as their primary file systems. But it appears that practical benefits are not really evident to most users.
Lots of ZFS users are enthusiasts who heard about that one magic thing that does it all in one tidy box. Whereas usually you would have to known all the minutiae of LVM/mdadm/cryptsetup/nbd and mkfs.whatever to get to the same point. So while ZFS is the nicer-dicer of volume management and filesystems, the latter is your whole chef's knife set. And while you can dice with both, the user groups are not the same. And enthusiasts with the right usecases are very few.
And for the thin-provisioned snapshotted subvolume usecase, btrfs is currently eating ZFS's lunch due to far better Linux integration. Think snapshots at every update, and having a/b boot to get back to a known-working config after an update. So widespread adoption through the distro route is out of the question.
Ubuntu's ZFS-on-root with zsys auto snapshots have been working excellently on my server for 5 years. It automatically takes snapshots on every update and adds entries to grub so rolling back to the last good state is just a reboot away.
> And for the thin-provisioned snapshotted subvolume usecase, btrfs is currently eating ZFS's lunch due to far better Linux integration.
Is this a technical argument? Or is this just more licensing nonsense?
> Think snapshots at every update, and having a/b boot to get back to a known-working config after an update.
I use ZFS and I have snapshots on every update? I have snapshots hourly, daily, weekly and monthly. I have triggered snapshots too, and ad hoc dynamic snapshots too. I wrote about it here: https://kimono-koans.github.io/opinionated-guide/
The ZFS license makes it impossible to include in upstream Linux kernel, which makes it much less usable as primary filesystem.
Linux’s signed off policy makes that impossible. Linus Torvalds would need Larry Elison’s signed off before even considering it. Linus told me this by email around 2013 (if I recall correctly) when I emailed him to discuss user requests for upstream inclusion. He had no concerns about the license being different at the time.
That's called marketing. Give it a snazzy name, like say "TimeMachine" and users will jump on it.
Also, ZFS has a bad name within the Linux community due to some licensing stuff. I find that most BSD users don't really care about such legalese and most people I know that run FreeBSD are running ZFS on root. Which works amazingly well I might add.
Especially with something like sanoid added to it, it basically does the same as timemachine on mac, a feature that users love. Albeit stored on the same drive (but with syncoid or just manually rolled zfs send/recv scripts you can do that on another location too).
> I find that most BSD users don't really care about such legalese and most people I know that run FreeBSD are running ZFS on root.
I don't think it's that they don't care, it's that the CDDL and BSD-ish licenses are generally believed to just not have the conflict that CDDL and GPL might. (IANAL, make your own conclusions about whether either of those are true)
Hmm yeah but that's the thing, who really cares about licenses as a user? I certainly don't. It's just some stuff that some lawyers fuss over. I don't read EULAs either nor would I even consider obeying them. The whole civil-legal world is just something I ignore.
I do have a feeling that Linux users in general care more about the GPL which is quite specific of course. Though I wonder if anyone chooses Linux for that reason.
But really personally I don't care whether companies give anything back, if anything I would love less corporate involvement in the OS I use. It was one of my main reasons for picking BSD. The others were a less fragmented ecosystem and less push to change things constantly.
I feel the same way. The way people talk about how ZFS is incompatible with Linux feels like debates over religious doctrine.
> ZFS has a bad name within the Linux community due to some licensing stuff
This is out of an abundance of caution. Canonical bundle ZFS in the Ubuntu kernel and no one sued them (yet).
True and I understand the caution considering Oracle is involved which are an awful company to do deal with (and their takeover of Sun was a disaster).
But really, this is a concern for distros. Not for end users. Yet many of the Linux users I speak to are somehow worried about this. Most can't even describe the provisions of the GPL so I don't really know what that's about. Just something they picked up, I guess.
Licensing concerns that prevent distros from using ZFS will sooner or later also have adverse effects on end users. Actually those effects are already there: The constant need to adapt a large patchset to the current kernel, meaning updates are a hassle. The lack of packaging in distributions, meaning updates are a hassle. And the lack of integration and related tooling, meaning many features can not be used (like a/b boots from snapshots after updates) easily, and installers won't know about ZFS so you have to install manually.
None of this is a worry about being sued as an end user. But all of those are worries that you life will be harder with ZFS, and a lot harder as soon as the first lawsuits hit anyone, because all the current (small) efforts to keep it working will cease immediately.
Unlike other out of tree filesystems such as Reiser4, the ZFS driver does not patch the kernel sources.
That is due to licensing reasons, yes. It makes maintaining the codebase even more complicated because when the kernel module API changes (which it very frequently does) you cannot just adapt it to your needs, you have to work around all the new changes that are there in the new version.
You have things backward. Licensing has nothing to do with it. Changes to the kernel are unnecessary. Maintaining the code base is also simplified by supporting the various kernel versions the way that they are currently supported.
Time Machine was released 17 years ago, and I wish Windows had anything that good. And they're on their 3rd backup system since then.
Windows has a really good basis for it though, in volume shadow copy. I also don't understand why Microsoft never built a time machine based on that. Well, they kinda did but only on samba shares. But not locally.
But these days they want you to subscribe to their cloud storage so the versioning is done there, which makes sense in their commercial point of view.
I think snapshots on ZFS are better than time machine though. Time machine is a bit of a clunky mess of soft links that can really go to shit on a minor corruption. Leaving you with an unrestorable backup and just some vague error messages.
I worked a lot with macs and I've had my share of bad backups when trying to fix people's problems. I've not seen ZFS fail like that. It's really solid and tends to indicate issues before they lead to bigger problems.
Shadow Protect by StorageCraft was brilliant. Pretty sure MS actually licensed Shadow Copy from them, but I could be mistaken. It's been a while since I played in that space.
Hard links[0].
[0] https://en.wikipedia.org/wiki/Time_Machine_(macOS)#Operation
Hard links are only used on HFS+. APFS has snapshot support.
> I wish Windows had anything that good
I can't readily tell how much of the dumbness is from the filesystem and how much from the kernel but the end result is that until it gets away from 1980s version of file locking there's no prayer. Imagine having to explain to your boss that your .docx wasn't backed up because you left Word open over the weekend. A just catastrophically idiotic design
Ah but this is really not true. Volume shadow copy makes snapshots of files and through that it can make a backup of an entire NTFS including files with a lock on them and fully quiesced. It was invented for that exact purpose. Backup software on windows leverages this functionality well. It took much longer for Linux to have something similar.
I have many criticisms of NTFS like it being really bad at handling large volumes of small files. But this is something it can do well.
The lock prevents other people from copying the file or opening it even in read only, yes. But backup software can back it up just fine.
>I find that most BSD users don't really care about such legalese and most people I know that run FreeBSD are running ZFS on root.
What a weird take. BSD's license is compatible with ZFS, that's why. "Don't really care?" Really? Come on.
I just mean that GPL is a bit of a religion. There are very strong opinions and principles behind it. Whereas the BSD license is more like "do whatever you want". It makes sense that the followers of the former care more deeply about it, right?
Personally I don't care about or obey any software licenses, as a user.
But this is kinda the vibe I get from other BSD users if a license discussion comes up. Maybe it's my bubble, that's possible.
There is an increase of posts with casual confidence in their own absolute correctnes. I originally attributed it to the influence of llms, but it is becoming so common now it is hard to dismiss.
I never claimed absolute correctness, just stating what I see with the other BSD users I know.
ZFS on FreeBSD is quite nice. System tools like freebsd-update integrate well. UFS continues to work as well, and may be more appropriate for some use cases where ZFS isn't a good fit, copy on write is sometimes very expensive.
Afaik, the FreeBSD position is both ZFS and UFS are fully supported and neither is secondary to the other; the installer asks what you want from ZFS, UFS, Manual (with a menu based tool), or Shell and you do whatever; in that order, so maybe a slight preferance towards ZFS.
ZFS is a first class citizen in FreeBSD and has been for at least a decade(probably longer). Not at all like in most Linux distros.
OpenZFS exists and there is a port of it for Mac OS X.
The problem is that it is still owned by Oracle. And Solaris ZFS is incompatible with OpenZFS. Not that people really use Solaris anymore.
It is really unfortunate. Linux has adopted file systems from other operating systems before. It is just nobody trust Oracle.
APFS? That still happened.
Besides the licensing issue, I wonder if optimizing ZFS for low latency + low RAM + low power on iPhone was an uphill battle or if it’s easy. My experiencing running ZFS years ago was poor latency and large RAM use with my NAS, but that hardware and drive configuration was optimized for low $ per gb stored and used parity stuff.
While its deduplication feature clearly demands more memory, my understanding is that the ZFS ARC is treated by the kernel as a driver with a massive, persistent memory allocation that cannot be swapped out ("wired" pages). Unlike the regular file system cache, ARC's eviction is not directly managed by the kernel. Instead, ZFS itself is responsible for deciding when and how to shrink the ARC.
This can lead to problems under sudden memory pressure. Because the ARC does not immediately release memory when the system needs it, userland pages might get swapped out instead. This behavior is more noticeable on personal computers, where memory usage patterns are highly dynamic (applications are constantly being started, used, and closed). On servers, where workloads are more static and predictable, the impact is usually less severe.
I do wonder if this is also the case on Solaris or illumos, where there is no intermediate SPL between ZFS and the kernel. If so, I don't think that a hypothetical native integration of ZFS on macOS (or even Linux) would adopt the ARC in its current form.
The ZFS driver will release memory if the kernel requests it. The only integration level issue is that the free command does not show ARC as a buffer/cache, so it misrepresents reality, but as far as I know, this is an issue with caches used by various filesystems (e.g. extent caches). It is only obvious in the case of ZFS because the ARC can be so large. That is a feature, not a bug, since unused memory is wasted memory.
> The ZFS driver will release memory if the kernel requests it.
Not fast enough always.
Solaris achieved some kind of integration between the ARC and the VM subsystem as part of the VM2 project. I don't know any more details than that.
I assume that the VM2 project achieved something similar to the ABD changes that were done in OpenZFS. ABD replaced the use of SLAB buffers for ARC with lists of pages. The issue with SLAB buffers is that absurd amounts of work could be done to free memory, and a single long lived SLAB object would prevent any of it from mattering. Long lived slab objects caused excessive reclaim, slowed down the process of freeing enough memory to satisfy system needs and in some cases, prevented enough memory from being freed to satisfy system needs entirely. Switching to linked lists of pages fixed that since the memory being freed from ARC upon request would immediately become free rather than be deferred to when all of the objects in the SLAB had been freed.
Maz arc size is configurable and it does not need the mythical 1GB per TB to function well.
This seems like an early application of the Tim Cook doctrine: Why would Apple want to surrender control of this key bit of technology for their platforms?
The rollout of APFS a decade later validated this concern. There’s just no way that flawless transition happens so rapidly without a filesystem fit to order for Apple’s needs from Day 0.
(Edit: My comment is simply about the logistics and work involved in a very well executed filesystem migration. Not about whether ZFS is good for embedded or memory constrained devices.)
What you describe hits my ear as more NIH syndrome than technical reality.
Apple’s transition to APFS was managed like you’d manage any kind of mass scale filesystem migration. I can’t imagine they’d have done anything differently if they’d have adopted ZFS.
Which isn’t to say they wouldn’t have modified ZFS.
But with proper driver support and testing it wouldn’t have made much difference whether they wrote their own file system or adopted an existing one. They have done a fantastic job of compartmentalizing and rationalizing their OS and user data partitions and structures. It’s not like every iPhone model has a production run that has different filesystem needs that they’d have to sort out.
There was an interesting talk given at WWDC a few years ago on this. The roll out of APFS came after they’d already tested the filesystem conversion for randomized groups of devices and then eventually every single device that upgraded to one of the point releases prior to iOS 10.3. The way they did this was to basically run the conversion in memory as a logic test against real data. At the end they’d have the super block for the new APFS volume, and on a successful exit they simply discarded it instead of writing it to persistent storage. If it errored it would send a trace back to Apple.
Huge amounts of testing and consistency in OS and user data partitioning and directory structures is a huge part of why that migration worked so flawlessly.
To be clear, BTRFS also supports in-place upgrade. It's not a uniquely Apple feature; any copy-on-write filesystem with flexibility as to where data is located can be made to fit inside of the free blocks of another filesystem. Once you can do that, then you can do test runs[0] of the filesystem upgrade before committing to wiping the superblock.
I don't know for certain if they could have done it with ZFS; but I can imagine it would at least been doable with some Apple extensions that would only have to exist during test / upgrade time.
[0] Part of why the APFS upgrade was so flawless was that Apple had done a test upgrade in a prior iOS update. They'd run the updater, log any errors, and then revert the upgrade and ship the error log back to Apple for analysis.
I don't see why ZFS wouldn't have gone over equally flawlessly. None of the features that make ZFS special were in HFS(+), so conversion wouldn't be too hard. The only challenge would be maintaining the legacy compression algorithms, but ZFS is configurable enough that Apple could've added their custom compression to it quite easily.
There are probably good reasons for Apple to reinvent ZFS as APFS a decade later, but none of them technical.
I also wouldn't call the rollout of APFS flawless, per se. It's still a terrible fit for (external) hard drives and their own products don't auto convert to APFS in some cases. There was also plenty of breakage when case-sensitivity flipped on people and software, but as far as I can tell Apple just never bothered to address that.
HFS compression, AFAICT, is all done in user space with metadata and extended attributes.
Using ZFS isn't surrendering control. Same as using parts of FreeBSD. Apple retains control because they don't have an obligation (or track record) of following the upstream.
For zfs, there's been a lot of improvements over the years, but if they had done the fork and adapt and then leave it alone, their fork would continue to work without outside control. They could pull in things from outside if they want, when they want; some parts easier than others.
If it were an issue it would hardly be an insurmountable one. I just can't imagine a scenario where Apple engineers go “Yep, we've eked out all of the performance we possibly can from this phone, the only thing left to do is change out the filesystem.”
Does it matter if it’s insurmountable? At some point, the benefits of a new FS outweigh the drawbacks. This happens earlier than you might think, because of weird factors like “this lets us retain top filesystem experts on staff”.
It’s worth remembering that the filesystem they were looking to replace was HFS+. It was introduced in the 90s as a modernization of HFS, itself introduced in the 80s.
Now, old does not necessarily mean bad, but in this case….
If I recall correctly, ZFS error recovery was still “restore from backup” at the time, and iCloud acceptance was more limited. (ZFS basically gave up if an error was encountered after the checksum showed that the data was read correctly from storage media.) That's fine for deployments where the individual system does not matter (or you have dedicated staff to recover systems if necessary), but phones aren't like that. At least not from the user perspective.
ZFS has ditto blocks that allows it to self heal in the case of corrupt metadata as long as a good copy remains (and there would be at least 2 copies by default). ZFS only ever needs you to restore from backup if the damage is so severe that there is no making sense of things.
Minor things like the indirect blocks being missing for a regular file only affect that file. Major things like all 3 copies of the MOS (the equivalent to a superblock) being gone for all uberblock entries would require recovery from backup.
If all copies of any other filesystem’s superblock were gone too, that filesystem would be equally irrecoverable and would require restoring from backup.
As far as I understand it, ditto blocks were only used if the corruption was detected due to checksum mismatch. If the checksum was correct, but metadata turned out to be unusable later (say because it was corrupted in memory, and the the checksum was computed after the corruption happened), that was treated as a fatal error.
So the solution is to use ._ files that most people will delete because they are a nuisance?
Apple wanted one operating system that ran on everything from a Mac Pro to an Apple Watch and there’s no way ZFS could have done that.
ZFS would be quite comfortable with the 512MB of RAM on an Apple Watch:
https://iosref.com/ram-processor
People have run operating systems using ZFS on less.
Kind of odd that the blog states that "The architect for ZFS at Apple had left" and links to the LinkedIn profile of someone who doesn't have any Apple work experience listed on their resume. I assume the author linked to the wrong profile?
Ex-Apple File System engineer here who shared an office with the other ZFS lead at the time. Can confirm they link to the wrong profile for Don Brady.
This is the correct person: https://github.com/don-brady
Also can confirm Don is one of the kindest, nicest principal engineer level people I’ve worked with in my career. Always had time to mentor and assist.
Not sure how I fat-fingered Don's LinkedIn, but I'm updating that 9-year-old typo. Agreed that Don is a delight. In the years after this article I got to collaborate more with him, but left Delphix before he joined to work on ZFS.
Given your expertise, any chance you can comment on the risk of data corruption on APFS given that it only checksums metadata?
I moved out of the kernel in 2008 and never went back, so don’t have a wise opinion here which would be current.
It was just yesterday I relistened to the contemporary Hypercritical episode on the topic: https://hypercritical.fireside.fm/56
Wow, John's voice has changed a LOT from back then
Discussed at the time:
ZFS: Apple’s New Filesystem That Wasn’t - https://news.ycombinator.com/item?id=11909606 - June 2016 (128 comments)
Thanks for sharing I was just looking for what happened to Sun. I like the second-hand quote comparing the IBM and HP as "garbage trucks colliding" plus the inclusion of blog posts with links to the court filings.
Is it fair to say ZFS made most sense on Solaris using Solaris Containers on SPARK?
ZFS was developed in Solaris, and at the time we were mostly selling SPARC systems. That changed rapidly and the biggest commercial push was in the form of the ZFS Storage Appliance that our team (known as Fishworks) built at Sun. Those systems were based on AMD servers that Sun was making at the time such as Thumper [1]. Also in 2016, Ubuntu leaned in to use of ZFS for containers [2]. There was nothing that specific about Solaris that made sense for ZFS, and even less of a connection to the SPARC architecture.
[1]: https://www.theregister.com/2005/11/16/sun_thumper/
[2]: https://ubuntu.com/blog/zfs-is-the-fs-for-containers-in-ubun...
> There was nothing that specific about Solaris that made sense for ZFS, and even less of a connection to the SPARC architecture.
Although it does not change the answer to the original question, I have long been under the impression that part of the design of ZFS had been influenced by the Niagara processor. The heavily threaded ZIO pipeline had been so forward thinking that it is difficult to imagine anyone devising it unless they were thinking of the future that the Niagara processor represented.
Am I correct to think that or did knowledge of the upcoming Niagara processor not shape design decisions at all?
By the way, why did Thumper use an AMD Opteron over the UltraSPARC T1 (Niagara)? That decision seems contrary to idea of putting all of the wood behind one arrow.
Niagara did not shape design decisions at all -- remember that Niagara was really only doing on a single socket what we had already done on large SMP machines (e.g., Starfire/Starcat). What did shape design decisions -- or at least informed thinking -- was a belief that all main memory would be non-volatile within the lifespan of ZFS. (Still possible, of course!) I don't know that there are any true artifacts of that within ZFS, but I would say that it affected thinking much more than Niagara.
As for Thumper using Opteron over Niagara: that was due to many reasons, both technological (Niagara was interesting but not world-beating) and organizational (Thumper was a result of the acquisition of Kealia, which was independently developing on AMD).
Thanks. I had been unaware of the Starfire/Starcat machines.
I don’t recall that being the case. Bonwick had been thinking about ZFS for at least a couple of years. Matt Ahrens joined Sun (with me) in 2001. The Afara acquisition didn’t close until 2002. Niagara certainly was tantalizing but it wasn’t a primary design consideration. As I recall, AMD was head and shoulders above everything else in terms of IO capacity. Sun was never very good (during my tenure there) at coordination or holistic strategy.
Yeah I think if it hadn’t been for the combination of Oracle and CDDL, Red Hat would have been more interested in for Linux. As it was they basically went with XFS and volume management. Fedora did eventually go with btrfs but dints know if there are are any plans for copy-on-write FS for RHEL at any point.
Fedora Server uses XFS on LVM by default & you can do CoW with any modern filesystem on top of an LVM thin pool.
And there is also the Stratis project Red Hat is involved in: https://stratis-storage.github.io/
It looks like btrfs is/was the default for just Fedora Workstation. I’m less connected to Red Hat filesystem details than I used to be.
TIL Stratis is still alive. I thought it basically went on life support after the lead dev left Red Hat.
Still no checksumming though...
RedHat’s policy is no out of tree kernel modules, so it would not have made a difference.
It’s not like Red Hat had/has no influence over what makes it into mainline. But the options for copy on write were either relatively immature or had license issues in their view.
Their view is that if it is out of tree, they will not support it. This supersedes any discussion of license. Even out of tree GPL drivers are not supported by RedHat.
We had those things at work as fileservers, so no containers or anything fancy.
Sun salespeople tried to sell us the idea of "zfs filesystems are very cheap, you can create many of them, you don't need quota" (which ZFS didn't have at the time), which we tried out. It was abysmally slow. It was even slow with just one filesystem on it. We scrapped the whole idea, just put Linux on them and suddenly fileserver performance doubled. Which is something we weren't used to with older Solaris/Sparc/UFS or /VXFS systems.
We never tried another generation of those, and soon after Sun was bought by Oracle anyways.
I had a combination uh-oh/wow! moment back in those days when the hacked up NFS server I built on a Dell with Linux and XFS absolutely torched the Solaris and UFS system we'd been using for development. Yeah, it wasnt apples to apples. Yes, maybe ZFS would have helped. But XFS was proven at SGI and it was obvious that the business would save thousands overnight by moving to Linux on Dell instead of sticking with Sun E450s. That was the death knell for my time as a Solaris sysadmin, to be honest.
ZFS probably wouldn't have helped. One of my points is, ZFS was slower than UFS in our setup. And both where slower than Linux on the same hardware.
Thanks. Also, the Thumper looks awesome like a max-level MMORPG character that would kill the level-1 consumer Synology NAS character in one hit.
> Is it fair to say ZFS made most sense on Solaris using Solaris Containers on SPARK?
You mean SPARC. And no, ZFS stands alone. But yes, containers were a lot faster to create using ZFS.
Thanks I didn't notice until now. I miskeyed SPARC on my iPhone keyboard.
ZFS remains an excellent filesystem for bulk storage on rust, but were I Apple at the time, I would probably want to focus on something built for the coming era of flash and NVMe storage. There are a number of axioms built into ZFS that come out of the spinning disk era that still hold it back for flash-only filesystems.
Certainly one would build something different starting in 2025 rather than 2001, but do you have specific examples of how ZFS’s design holds it back? I think it has been adapted extremely well for the changing ecosystem.
This presentation from 2022 covers the topic: https://m.youtube.com/watch?v=v8sl8gj9UnA
Note: sound drops out for a couple minutes at 1:30 mark but comes back.
I wonder what ZFS in the iPhone would've looked like. As far as I recall, the iPhone didn't have error correcting memory, and ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk. ZFS' RAM-hungry nature would've also forced Apple to add more memory to their phone.
ZFS detects corruption.
A very long ago someone named cyberjock was a prolific and opinionated proponent of ZFS, who wrote many things about ZFS during a time when the hobbyist community was tiny and not very familiar with how to use it and how it worked. Unfortunately, some of their most misguided and/or outdated thoughts still haunt modern consciousness like an egregore.
What you are probably thinking of is the proposed doomsday scenario where bad ram could theoretically kill a ZFS pool during a scrub.
This article does a good job of explaining how that might happen, and why being concerned about it is tilting at windmills: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...
I have never once heard of this happening in real life.
Hell, I’ve never even had bad ram. I have had bad sata/sas cables, and a bad disk though. ZFS faithfully informed me there was a problem, which no other file system would have done. I’ve seen other people that start getting corruption when sata/sas controllers go bad or overheat, which again is detected by ZFS.
What actually destroys pools is user error, followed very distantly by plain old fashioned ZFS bugs that someone with an unlucky edge case ran into.
> Hell, I’ve never even had bad ram.
To what degree can you separate this claim from "I've never noticed RAM failures"?
You can take that as meaning “I’ve never had a noticed issue that was detected by extensive ram testing, or solved by replacing ram”.
I got into overclocking both regular and ECC DDR4 ram for a while when AMD’s 1st gen ryzen stuff came out, thanks to asrock’s x399 motherboard which unofficially supporting ECC, allowing both it’s function and reporting of errors (produced when overlocking)
Based on my own testing and issues seen from others, regular memory has quite a bit of leeway before it becomes unstable, and memory that’s generating errors tends to constantly crash the system, or do so under certain workloads.
Of course, without ECC you can’t prove every single operation has been fault free, but as some point you call it close enough.
I am of the opinion that ECC memory is the best memory to overclock, precisely because you can prove stability simply by using the system.
All that said, as things become smaller with tighter specifications to squeeze out faster performance, I do grow more leery of intermittent single errors that occur on the order of weeks or months in newer generations of hardware. I was once able to overclock my memory to the edge of what I thought was stability as it passed all tests for days, but about every month or two there’d be a few corrected errors show up in my logs. Typically, any sort of stability is caught by manual tests within minutes or the hour.
My friends and I spent a lot of our middle and high school days building computers from whatever parts we could find, and went through a lot of sourcing components everywhere from salvaged throwaways to local computer shops, when those were a thing. We hit our fair share of bad RAM, and by that I mean a handful of sticks at best.
It isn't hard to run memtest on all your computers, and that will catch the kind of bad RAM that the aforementioned doomsday scenario requires.
To me, the most implausible thing about ZFS-without-ECC doomsaying is the presumption that the failure mode of RAM is a persistently stuck bit. That's way less common than transient errors, and way more likely to be noticed, since it will destabilize any piece of software that uses that address range. And now that all modern high-density DRAM includes on-die ECC, transient data corruption on the link between DRAM and CPU seems overwhelmingly more likely than a stuck bit.
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk
ZFS does not need or benefit from ECC memory any more than any other FS. The bitflip corrupted the data, regardless of ZFS. Any other FS is just oblivious, ZFS will at least tell you your data is corrupt but happily keep operating.
> ZFS' RAM-hungry nature
ZFS is not really RAM-hungry, unless one uses deduplication (which is not enabled by default, nor generally recommended). It can often seem RAM hungry on Linux because the ARC is not counted as “cache” like the page cache is.
---
ZFS docs say as much as well: https://openzfs.github.io/openzfs-docs/Project%20and%20Commu...
And even dedup was finally rewritten to be significantly more memory efficient, as of the new 2.3 release of ZFS: https://github.com/openzfs/zfs/discussions/15896
It's very amusing that this kind of legend has persisted! ZFS is notorious for *noticing* when bits flip, something APFS designers claimed was rare given the robustness of Apple hardware.[1][2] What would ZFS on iPhone have looked like? Hard to know, and that certainly wasn't the design center.
Neither here nor there, but DTrace was ported to iPhone--it was shown to me in hushed tones in the back of an auditorium once...
[1]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...
[2]: https://ahl.dtrace.org/2016/06/19/apfs-part5/#checksums
I did early ZFSOnLinux development on hardware that did not have ECC memory. I once had a situation where a bit flip happened in the ARC buffer for libpython.so and all python software started crashing. Initially, I thought I had hit some sort of blizzard bug in ZFS, so I started debugging. At that time, opening a ZFS snapshot would fetch a duplicate from disk into a redundant ARC buffer, so while debugging, I ran cmp on libpython.so between the live copy and a snapshot copy. It showed the exact bit that had flipped. After seeing that and convincing myself the bitflip was not actually on stable storage, I did a reboot, and all was well. Soon afterward, I got a new development machine that had ECC so that I would not waste my time chasing phantom bugs caused by bit flips.
> ZFS is notorious for corrupting itself when bit flips
That is a notorious myth.
https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk
I don't think it is. I've never heard of that happening, or seen any evidence ZFS is more likely to break than any random filesystem. I've only seen people spreading paranoid rumors based on a couple pages saying ECC memory is important to fully get the benefits of ZFS.
They also insist that you need about 10 TB RAM per TB disk space or something like that.
There is a rule of thumb that you should have at least 1 GB of RAM per TB of disk when using deduplication. That's.... Different.
So you've never seen the people saying you should steer clear of ZFS unless you're going to have an enormous ARC even when talking about personal media servers?
People, especially those on the Internet, say a lot of things.
Some of the things they say aren't credible, even if they're said often.
You don't need an enormous amount of ram to run zfs unless you have dedupe enabled. A lot of people thought they wanted dedupe enabled though. (2024's fast dedupe may help, but probably the right answer for most people is not to use dedupe)
It's the same thing with the "need" for ECC. If your ram is bad, you're going to end up with bad data in your filesystem. With ZFS, you're likely to find out your filesystem is corrupt (although, if the data is corrupted before the checksum is calculated, then the checksum doesn't help); with a non-checksumming filesystem, you may get lucky and not have meta data get corrupted and the OS keeps going, just some of your files are wrong. Having ECC would be better, but there's tradeoffs so it never made sense for me to use it at home; zfs still works and is protecting me from disk contents changing, even if what was written could be wrong.
Not that I recall? And it's worked fine for me...
I have seen people say such things, and none of it was based on reality. They just misinterpreted the performance cliff that data deduplication had to mean you must have absurd amounts of memory even though data deduplication is off by default. I suspect few of the people peddling this nonsense even used ZFS and the few who did, had not looked very deeply into it.
Even then you obviously need L2ARC as well!! /s
But on optane. Because obviously you need an all flash main array for streaming a movie.
Fortunately, this has significantly improved since dedup was rewritten as part of the new ZFS 2.3 release. Search for zfs “fast dedup”.
It’s unfortunate some folks are missing the tongue-in-cheek nature of your comment.
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk.
If you have no mirrors and no raidz and no ditto blocks then errors cause problems, yes. Early on they would cause panics.
But this isn't ZFS "corrupting itself", rather, it's ZFS saving itself and you from corruption, and the price you pay for that is that you need to add redundancy (mirrors, raidz, or ditto blocks). It's not a bad deal. Some prefer not to know.
Not this myth again. ZFS does not need ECC RAM. Stop propagating this falsehood.
> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk.
What's a bit flip?
Sometimes data on disk and in memory are randomly corrupted. For a pretty amazing example, check out "bitsquatting"[1]--it's like domain name squatting, but instead of typos, you squat on domains that would bit looked up in the case of random bit flips. These can occur due e.g. to cosmic rays. On-disk, HDDs and SSDs can produce the wrong data. It's uncommon to see actual invalid data rather than have an IO fail on ECC, but it certainly can happen (e.g. due to firmware bugs).
[1]: https://en.wikipedia.org/wiki/Bitsquatting
Basically it's that memory changes out from under you. As we know, computers use Binary, so everything boils down to it being a 0 or a 1. A bit flip is changing what was say a 0 into a 1.
Usually attributed to "cosmic rays", but really can happen for any number of less exciting sounding reasons.
Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network. Memory and disks are not checked for correctness, basically ever on any machine anywhere. Many servers(but certainly not all) are the rare exception when it comes to memory safety. They usually have ECC(Error Correction Code) Memory, basically a checksum on the memory to ensure that if memory is corrupted, it's noticed and fixed.
Essentially every filesystem everywhere does zero data integrity checking:
ZFS is the rare exception for file systems that actually double check the data you save to it is the data you get back from it. Every other filesystem is just a big ball of unknown data. You probably get back what you put it, but there is zero promises or guarantees.> disks are not checked for correctness, basically ever on any machine anywhere.
I'm not sure that's really accurate -- all modern hard drives and SSD's use error-correcting codes, as far as I know.
That's different from implementing additional integrity checking at the filesystem level. But it's definitely there to begin with.
But SSDs (to my knowledge) only implement checksum for the data transfer. Its a requirement of the protocol. So you can be sure that the Stuff in memory and checksum computed by the CPU arrives exactly like that in the SSD driver. In the past this was a common error source with hardware raid which was faulty.
But there is ABSOLUTELY NO checksum for the bits stored on a SSD. So bit rot at the cells of the SSDs are undetected.
That is ABSOLUTELY incorrect. SSDs have enormous amounts of error detection and correction builtin explicitly because errors on the raw medium are so common that without it you would never be able to read correct data from the device.
It has been years since I was familiar enough with the insides of SSDs to tell you exactly what they are doing now, but even ~10-15 years ago it was normal for each raw 2k block to actually be ~2176+ bytes and use at least 128 bytes for LDPC codes. Since then the block sizes have gone up (which reduces the number of bytes you need to achieve equivalent protection) and the lithography has shrunk (which increases the raw error rate).
Where exactly the error correction is implemented (individual dies, SSD controller, etc) and how it is reported can vary depending on the application, but I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.
> I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.
While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.
There are lots of weasel words there on purpose. There is generally zero guarantee in reality and drives lie all the time about data being safely written to disk, even if it wasn't actually safely written to disk yet. This means on power failure/interruption the outcome of being able to read XYZ back is 100% unknown. Drive Manufacturers make zero promises here.
On most consumer compute, there is no promises or guarantees that what you wrote on day 1 will be there on day 2+. It mostly works, and the chances are better than even that your data will be mostly safe on day 2+, but there is zero promises or guarantees. We know how to guarantee it, we just don't bother(usually).
You can buy laptops and desktops with ECC RAM and use ZFS(or other checksumming FS), but basically nobody does. I'm not aware of any mobile phones that offer either option.
> While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.
I'm not really sure what point you're trying to make. It's using ECC, so they should be the same bytes.
There isn't infinite reliability, but nothing has infinite reliability. File checksums don't provide infinite reliability either, because the checksum itself can be corrupted.
You keep talking about promises and guarantees, but there aren't any. All there is are statistical rates of reliability. Even ECC RAM or file checksums don't offer perfect guarantees.
For daily consumer use, the level of ECC built into disks is generally plenty sufficient. It's chosen to be so.
I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.
We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.
We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.
All MLC SSDs absolutely do data checksums and error recovery, otherwise they would very lose your data much more than they do.
You can see some stats using `smartctl`.
Yes, the disk mostly promises what you write there will be read back correctly, but that's at the disk level only. The OS, Filesystem and Memory generally do no checking, so any errors at those levels will propagate. We know it happens, we just mostly choose to not do anything about it.
My point was, on most consumer compute, there is no promises or guarantees that what you see on day 1 will be there on day 2. It mostly works, and the chances are better than even that your data will be mostly safe on day 2, but there is zero promises or guarantees, even though we know how to do it. Some systems do, those with ECC memory and ZFS for example. Other filesystems also support checksumming, like BTRFS being the most common counter-example to ZFS. Even though parts of BTRFS are still completely broken(see their status page for details).
Btrfs and bcachefs both have data checksumming. I think ReFS does as well.
Yes, ZFS is not the only filesystem with data checksumming and guarantees, but it's one of the very rare exceptions that do.
ZFS has been in productions work loads since 2005, 20 years now. It's proven to be very safe.
BTRFS has known fundamental issues past one disk. It is however improving. I will say BTRFS is fine for a single drive. Even the developers last I checked(a few years ago) don't really recommend it past a single drive, though hopefully that's changing over time.
I'm not familiar enough with bcachefs to comment.
What's the current state of ZFS on Macos? As far as I'm aware there's a supported fork.
Apple and Sun couldn't agree on a 'support contract'. From Jeff Bonwick, one of the co-creators ZFS:
>> Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a "private license" from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms.
> I cannot disclose details, but that is the essence of it.
* https://archive.is/http://mail.opensolaris.org/pipermail/zfs...
Apple took DTrace, licensed via CDDL—just like ZFS—and put it into the kernel without issue. Of course a file system is much more central to an operating system, so they wanted much more of a CYA for that.
> indemnification
That was the sticking point. In the context of the NetApp lawsuit Apple wanted indemnification should Sun/Oracle lose the suit.
ZFS is the king of all file systems. As someone with over a petabyte of storage across 275 drives I have never lost a single byte due to a hard drive failure or corruption thanks to ZFS
ZFS sort of moved inside the NVMe controller - it also checksums and scrubs things all the time, you just don't see it. This does not, however, support multi-device redundant storage, but that is not a concern for Apple - the vast majority of their devices have only one storage device.
[flagged]