File systems unfit as distributed storage back ends: 10 years of Ceph

toyg · on Dec 7, 2019

Something that is not mentioned, though, is that all the years patching existing FS were an education in how a FS works. They likely wouldn’t have had enough knowledge to go balls-out on a new FS at the start. So that 2-year effort was really much longer.

wging · on Dec 7, 2019

See also the Adrian Colyer discussion (if you have deja vu, this is why, but it's not a repeat of that post): https://news.ycombinator.com/item?id=21460759

tyingq · on Dec 7, 2019

I found the story of how Backblaze stores things "on top of", versus "in" filesystems similarly interesting. https://www.backblaze.com/blog/vault-cloud-storage-architect...

It seems like they could go one further and eliminate the ext4 underneath.

ignoramous · on Dec 7, 2019

You'd love this: https://maisonbisson.com/post/object-storage-prior-art-and-l...

Talks about Facebook, Instagram, S3 and other Object Store services and how they deal with storage at scale.

zod50 · on Dec 8, 2019

thanks for the link!

Mave83 · on Dec 7, 2019

Ceph is awesome, even years ago it was a great Technology. We from croit.io do provide a free software to manage Ceph with ease.

core-questions · on Dec 7, 2019

Looks slick, you got downvoted because you dared to promote something but this actually looks like a reasonable value-add. Ceph is one of those things that's just a bit risky for orgs without the subject specific expertise.

SEJeff · on Dec 8, 2019

How does it compare to Rook? We use rook to manage ceph in our on prem Kubernetes cluster and it is excellent

Ericson2314 · on Dec 7, 2019

Yes it really cannot emphasized enough that the legacy filesystem system interface with it's too-simple 1970s origin and then far, far, far too complex decades of duck tape is a disasterous albatros.

C.f. What linus is saying in https://news.ycombinator.com/item?id=21673372 except turn it around. When an interface has devolved into two sides hating and Postel's-law-enabling each other ad infinitum, and a statement like his is actually justifiable, it's time to close up shop and move on. Nothing good will ever come from POSIX-like storage ever again, and any storage system built around it is doomed to be a mess of too many layers and also too many layer violations. Utter hopelessness.

bsder · on Dec 7, 2019

> Yes it really cannot emphasized enough that the legacy filesystem system interface with it's too-simple 1970s origin and then far, far, far too complex decades of duck tape is a disasterous albatros.

Except that nobody will sign on.

Look at what happened to FreeBSD in the 5.0 timeframe when they reworked their storage layers into GEOM. It was a NIGHTMARE. Most people agreed it needed to be done, but there was an excruciatingly loud segment who complained incessantly. It took some gigantic brass balls and asbestos-lined flamesuits on the part of FreeBSD heavy hitters to drive it through.

If the system in Linux is to get fixed, Linus would probably have to step in and pronounce.

mpweiher · on Dec 7, 2019

Maybe the OS is not the right layer for this?

mbreese · on Dec 8, 2019

What other layer could it be in? (Legitimately curious)

valenciarose · on Dec 8, 2019

In userland, for one. Or an unprivileged service in a microkernel O/S. There are a lot of concerns jammed into the current concept of filesystem.

user5994461 · on Dec 8, 2019

Do you have an opinion on whether filesystem in Windows is a comparable mess?

Ericson2314 · on Dec 8, 2019

I don't, but I imagine it is no better. The VMS NT people had good intentions but both ecosystems are smothered in backwards compatibility and Postel's law issues.

pdimitar · on Dec 7, 2019

This all makes me grateful that I use sqlite3 instead of FS for storage, even for fairly trivial projects.

limomium · on Dec 8, 2019

Could you expand a little on how you're doing that?

I've been thinking about transitioning entirely to sqlite for all my data.

zippie · on Dec 8, 2019

The GP may seem like sarcasm to some ... sqlite is an overlooked, novel, and faster way (up to 35%!) to store things than the filesystem [0].

You can use something like libsqlfs [1] for POSIX file heuristics with sqlite as the backing store.

One HA single primary/multi-master solution to use sqlite may be drbd.

[0] https://www.sqlite.org/fasterthanfs.html [1] https://github.com/guardianproject/libsqlfs

pdimitar · on Dec 8, 2019

There's no hard formula. It's basically forming a habit to never use file I/O and use SQL queries instead.

Bonus points for the `lsm1` extension of sqlite3 which allows you to use it as a key-value store, which I used with mixed success (if I could remember the key names that seemed the most logical thing in the world last week, lol).

There's nothing to it, really. sqlite3 is a very mature software and save for a mechanical failure of your storage drive, the odds of it losing your data are practically zero.

For even more bonus points, encrypt your sqlite3 storage. That way you can freely distribute it on Git hosting services.

alexnewman · on Dec 8, 2019

Brilliant, but I wonder now that we know how filesystems work if we could redesign ceph to do the right thing. For instance a lot of work was made to schedule the important writes at the right time. Perhaps they could have handled these latency issues explicitly.

kzrdude · on Dec 7, 2019

Was it Ceph they were using at CERN (ATLAS Project, at least?) they were using some kind of file system federation.

dguest · on Dec 8, 2019

On one of the bigger experiments we use quite a few things:

- AFS as a federated posix file system for user home directories. My impression is that a distributed posix filesystem is... well... hard, for basically the reasons listed in the link. We're actually trying to phase it out, starting by reducing the size of the federation by cutting off access outside the CERN network.

- A few in-house developments like xrootd [1] (basically CERN's version of an object store) and EOS (a posix file system built on top), to store data. These projects have their roots in a time when CERN was at the forefront of "big data" and it made sense to develop an in-house project. These days there are a number of alternatives and my impression is that the reasons for continuing the projects are mostly historical.

- For read-only data we have cvmfs [2], a FUSE module which is synced to some other file system a few times a day. Making it read-only simplifies the metadata handling considerably: it's actually quite nice for a CERN project.

- Some people have started using Ceph for more experimental things, but in general these "industry" projects are only starting to replace the home-grown ones.

[1]: https://xrootd.slac.stanford.edu/index.html

[2]: https://cernvm.cern.ch/portal/filesystem

shdb · on Dec 8, 2019

they do use it. https://www.youtube.com/watch?v=OopRMUYiY5E https://www.youtube.com/watch?v=0i7ew3XXb7Q

tyingq · on Dec 7, 2019

Googling isn't much help. You can find references to AFS, DFS, VM-FS, and EOS...all being used at CERN.

kzrdude · on Dec 7, 2019

Oh I see, I was thinking of AFS, I'm sure, I thought it was built on ceph or vice versa.

ddtaylor · on Dec 7, 2019

Hi! This could have been submitted as an HTTPS link.

3fe9a03ccd14ca5 · on Dec 7, 2019

I assume if the owner of the site wanted to redirect all http->https traffic they would do so.

ddtaylor · on Dec 7, 2019

Doing that doesn't actually solve the problem though. A MITM attacker still gets to read and modify all that content.

tyingq · on Dec 7, 2019

I don't believe blogspot allows you to turn off http, just (optionally) redirect it to https.

bayarrhea · on Dec 7, 2019

[flagged]

adrianN · on Dec 7, 2019

Well they could replace the website with something that makes you part of a botnet. Like so https://news.ycombinator.com/item?id=21721843

MichaelMoser123 · on Dec 7, 2019

But ceph has the bluestore backend that doesn't go through the file system.

mbreese · on Dec 7, 2019

That was the punchline of the post... file systems add too much extra overhead, so they wrote a storage backend without the file system.

austincheney · on Dec 7, 2019

I am solving for this problem right now and my solution is working great cross OS. The file system is not the files contained by that system. The thing that effects performance is the CPU time for compression.

de6u99er · on Dec 7, 2019

MapR-FS is a great distributed file system, which solves tons of challenges out of the box. E.g. HA, POSIX, NFS, multi tenancy, multi-temprature, co-location, and sexurity.