Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
File systems unfit as distributed storage back ends: 10 years of Ceph (muratbuffalo.blogspot.com)
209 points by mpweiher on Dec 7, 2019 | hide | past | favorite | 34 comments


Something that is not mentioned, though, is that all the years patching existing FS were an education in how a FS works. They likely wouldn’t have had enough knowledge to go balls-out on a new FS at the start. So that 2-year effort was really much longer.


See also the Adrian Colyer discussion (if you have deja vu, this is why, but it's not a repeat of that post): https://news.ycombinator.com/item?id=21460759


I found the story of how Backblaze stores things "on top of", versus "in" filesystems similarly interesting. https://www.backblaze.com/blog/vault-cloud-storage-architect...

It seems like they could go one further and eliminate the ext4 underneath.


You'd love this: https://maisonbisson.com/post/object-storage-prior-art-and-l...

Talks about Facebook, Instagram, S3 and other Object Store services and how they deal with storage at scale.


thanks for the link!


Ceph is awesome, even years ago it was a great Technology. We from croit.io do provide a free software to manage Ceph with ease.


Looks slick, you got downvoted because you dared to promote something but this actually looks like a reasonable value-add. Ceph is one of those things that's just a bit risky for orgs without the subject specific expertise.


How does it compare to Rook? We use rook to manage ceph in our on prem Kubernetes cluster and it is excellent


Yes it really cannot emphasized enough that the legacy filesystem system interface with it's too-simple 1970s origin and then far, far, far too complex decades of duck tape is a disasterous albatros.

C.f. What linus is saying in https://news.ycombinator.com/item?id=21673372 except turn it around. When an interface has devolved into two sides hating and Postel's-law-enabling each other ad infinitum, and a statement like his is actually justifiable, it's time to close up shop and move on. Nothing good will ever come from POSIX-like storage ever again, and any storage system built around it is doomed to be a mess of too many layers and also too many layer violations. Utter hopelessness.


> Yes it really cannot emphasized enough that the legacy filesystem system interface with it's too-simple 1970s origin and then far, far, far too complex decades of duck tape is a disasterous albatros.

Except that nobody will sign on.

Look at what happened to FreeBSD in the 5.0 timeframe when they reworked their storage layers into GEOM. It was a NIGHTMARE. Most people agreed it needed to be done, but there was an excruciatingly loud segment who complained incessantly. It took some gigantic brass balls and asbestos-lined flamesuits on the part of FreeBSD heavy hitters to drive it through.

If the system in Linux is to get fixed, Linus would probably have to step in and pronounce.


Maybe the OS is not the right layer for this?


What other layer could it be in? (Legitimately curious)


In userland, for one. Or an unprivileged service in a microkernel O/S. There are a lot of concerns jammed into the current concept of filesystem.


Do you have an opinion on whether filesystem in Windows is a comparable mess?


I don't, but I imagine it is no better. The VMS NT people had good intentions but both ecosystems are smothered in backwards compatibility and Postel's law issues.


This all makes me grateful that I use sqlite3 instead of FS for storage, even for fairly trivial projects.


Could you expand a little on how you're doing that?

I've been thinking about transitioning entirely to sqlite for all my data.


The GP may seem like sarcasm to some ... sqlite is an overlooked, novel, and faster way (up to 35%!) to store things than the filesystem [0].

You can use something like libsqlfs [1] for POSIX file heuristics with sqlite as the backing store.

One HA single primary/multi-master solution to use sqlite may be drbd.

[0] https://www.sqlite.org/fasterthanfs.html [1] https://github.com/guardianproject/libsqlfs


There's no hard formula. It's basically forming a habit to never use file I/O and use SQL queries instead.

Bonus points for the `lsm1` extension of sqlite3 which allows you to use it as a key-value store, which I used with mixed success (if I could remember the key names that seemed the most logical thing in the world last week, lol).

There's nothing to it, really. sqlite3 is a very mature software and save for a mechanical failure of your storage drive, the odds of it losing your data are practically zero.

For even more bonus points, encrypt your sqlite3 storage. That way you can freely distribute it on Git hosting services.


Brilliant, but I wonder now that we know how filesystems work if we could redesign ceph to do the right thing. For instance a lot of work was made to schedule the important writes at the right time. Perhaps they could have handled these latency issues explicitly.


Was it Ceph they were using at CERN (ATLAS Project, at least?) they were using some kind of file system federation.


On one of the bigger experiments we use quite a few things:

- AFS as a federated posix file system for user home directories. My impression is that a distributed posix filesystem is... well... hard, for basically the reasons listed in the link. We're actually trying to phase it out, starting by reducing the size of the federation by cutting off access outside the CERN network.

- A few in-house developments like xrootd [1] (basically CERN's version of an object store) and EOS (a posix file system built on top), to store data. These projects have their roots in a time when CERN was at the forefront of "big data" and it made sense to develop an in-house project. These days there are a number of alternatives and my impression is that the reasons for continuing the projects are mostly historical.

- For read-only data we have cvmfs [2], a FUSE module which is synced to some other file system a few times a day. Making it read-only simplifies the metadata handling considerably: it's actually quite nice for a CERN project.

- Some people have started using Ceph for more experimental things, but in general these "industry" projects are only starting to replace the home-grown ones.

[1]: https://xrootd.slac.stanford.edu/index.html

[2]: https://cernvm.cern.ch/portal/filesystem



Googling isn't much help. You can find references to AFS, DFS, VM-FS, and EOS...all being used at CERN.


Oh I see, I was thinking of AFS, I'm sure, I thought it was built on ceph or vice versa.


Hi! This could have been submitted as an HTTPS link.


I assume if the owner of the site wanted to redirect all http->https traffic they would do so.


Doing that doesn't actually solve the problem though. A MITM attacker still gets to read and modify all that content.


I don't believe blogspot allows you to turn off http, just (optionally) redirect it to https.


[flagged]


Well they could replace the website with something that makes you part of a botnet. Like so https://news.ycombinator.com/item?id=21721843


But ceph has the bluestore backend that doesn't go through the file system.


That was the punchline of the post... file systems add too much extra overhead, so they wrote a storage backend without the file system.


I am solving for this problem right now and my solution is working great cross OS. The file system is not the files contained by that system. The thing that effects performance is the CPU time for compression.


MapR-FS is a great distributed file system, which solves tons of challenges out of the box. E.g. HA, POSIX, NFS, multi tenancy, multi-temprature, co-location, and sexurity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: