An in-memory distributed Posix-compatible file system

dsr_ · on Jan 8, 2015

The standalone mode is boring -- it's a ramdisk, about the same as you get from tmpfs.

The standalone_swift mode is interesting -- it's a ramdisk with eventual commit to disk -- but I don't see performance figures or essential config details, like how often it flushes.

The distributed mode is surely cool for some use cases, but it doesn't seem to commit to disk and it neither replicates nor works in parallel.

shafiee01 · on Jan 9, 2015

1-Well if you don't want to use a backend storage what else do you expect?

2-It will flush as soon as you call flush or close your file. It's not necessarily disk and can be any configured backend (right now only swift is supported; however, it's easy to develop other backends as well)

3-Replication is done in the backend storage. Once your data is flushed it's replicated to your desire in Swift, Amazon or whatever backend you are using. What do you mean 'nor works in aralllel'?

notacoward · on Jan 9, 2015

If the back end is eventually consistent, how can you claim POSIX compliance on the front end? User writes and fsyncs, BFS node goes down, somebody tries to read. What will they get? Who the heck knows? The only in-memory copy is gone, and the back end you've chosen doesn't guarantee that you'll get the most recently written data on the next read.

It's perfectly OK if you decide on a different consistency model, but then you can't say you're POSIX compliant (actually you can't anyway for legal reasons but that's another topic).

shafiee01 · on Jan 9, 2015

If you flush a file it will be in the backend; Therefore, even if a node goes down, another node will be responsible for that. There will be obviously a delay for the recovery. In POSIX if your buffer is not flushed there is no guarantee that you will get the latest version neither.

Yeah, maybe I should totally get rid of that "POSIX" compatible thing. I had that in mind because I expose my fs interface through fuse library and supposedly they cover POSIX.

notacoward · on Jan 9, 2015

> In POSIX if your buffer is not flushed there is no guarantee that you will get the latest version neither.

That's not true. If it's not flushed there's no guarantee that it will be durable (i.e. on stable storage, will survive a crash of the entire system). However, there is a guarantee that it will be consistent.

http://pubs.opengroup.org/onlinepubs/009695399/functions/wri...

> After a write() to a regular file has successfully returned:

> Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write()

Even more specifically, and relevantly to the current point...

> If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes.

Many POSIX requirements related to what the database folks would call ACID are hard to meet in a distributed file system, but they are nonetheless requirements. This particular requirement implies that if a node fails but the system as a whole remains up, then the last write must be immediately available. It could be met, for example, by doing your own synchronous in-memory replication. However, if you rely on pinging the data through a backing store that only provides eventual consistency, then the requirement is not met.

notacoward · on Jan 9, 2015

The obvious, comparison for standalone mode would be to a regular file system built on a ramdisk. The obvious comparison for network mode would have been Plain Old NFS with a ramdisk on the server. Both comparisons are missing. Also, this statement is flat-out untrue.

> IOZONE performs all of its operations on one file

Getting iozone to use multiple files and threads is trivial (-t option), and the results for that would be much more interesting than those for a single file/thread. Unfortunately, such benchmark errors are extremely common. I don't fault the author so much as his advisor, who seems to have a background in computer networking (hence the strong focus in this work on alternative network interfaces) but none in storage. My advice to the author would be to find a more appropriate advisor for this project.

shafiee01 · on Jan 9, 2015

1- This is just a preliminary evaluation. I just made the repository public 3 hours ago. I am planning so many experiments including the ones you mentioned.

2- As I said more benchmark including multiple files with iozone, postmark are coming. This is just to give viewer a sense. P.S I am happy with my supervisor :) thanks for your recommendations though ;)

notacoward · on Jan 9, 2015

As you continue your testing, I strongly recommend that you use different numbers of threads/files and queue depths, as well as different file and I/O sizes. If you want to make scalability claims, you'll also need to test with multiple clients and servers. It's also good practice to show the exact iozone commands used, or (even better) fio command line and profiles.

More generally, I think it would be very helpful if you could be more explicit about how your system works. For example...

* What are your consistency/durability guarantees? I mean really, since they're clearly not POSIX.

* How do you detect and respond to faults?

* How would you describe the system in CAP terms?

* How do you reconstruct file system state from the backing-store object state after a node failure or cold restart?

* How do you decide which node should hold a file? How do you re-make that decision when servers are added or removed? Can they be?

These are the decisions that every distributed file system must make (and an explicit comparison to others sure wouldn't hurt). While it might seem like a lot, you did say this is for a master's thesis and it's stuff your advisor should already have required. It will certainly come up when you go to defend that thesis. If this truly advances the state of the art in some way, you should already know exactly how and be able to explain it.

Disclaimer: I work on GlusterFS. On my own, I've worked on two toy projects (CassFS and VoldFS) that explored ideas in how to build a distributed file system on top of another kind of data store. I might be able to help you, if I knew more about which issues you'd already addressed and which remain.

shafiee01 · on Jan 9, 2015

Yes, I plan to run experiments with different numbers of threads/files and many other factors. Right know I only have access to a cluster of 7 old machines. That's the best I can do and for the hadoop case I used all of the machines. Sure thing I will provide all of the commands that I used. I just made this project public to get this type of comments. I have not written or explained anything yet. So there are A LOT to explain like the questions you raised. I will try to answer them here as well: Here is a big sketch of the system: Nodes use fuse to provide a traditional file system view to applications. Zookeeper is used for consensus and it keeps track of what file (name) each node is holding. Swift (or any other persistent storage) is used as the backend. Nodes communicate through TCP or Zero_networking to each other for remote IO operations. * When a file is flushed my consistency/durability guarantees are whatever the backend storage is. If swift, then it's swift's consistency/durability guarantees. While in memory (not flushed or closed) it's consistent because there is only one copy of file at the host node but not durable because if that node dies it's gone. if a node fails there is a master (elected with zookeeper) which will assign the files that node was holding to other nodes (assuming that nodes has flushed it's files to the backend). Failure in zookeeper and swift are out of my project scope. As soon as the system starts, it goes through a zookeeper master election and then master distributes files in the backends to the live nodes. * right now it's just a silly algorithm (the most free node) however, it can/should be changed to a more advanced mechanism which probably considers load and many other factors as well.

I am very happy to answer your questions and I really enjoyed answering them. I am sure this is the type of question I will get in my defence session as well. And that's why I post this in the news. :)

Awesome, it will be great if other people can contribute in my project. There are many things left to do and a tons of space for improvement. Maybe we can discuss more through email.

virtuallynathan · on Jan 9, 2015

Are you planning to benchmark this on 10G/40G networks? looks like on first glance it is limited by the available network bandwidth.

shafiee01 · on Jan 9, 2015

I wish if I could :) but right now in our lab I don't have access to such equipments. Honestly, that was the plan because using 10/40G networks will significantly reduce the delay in remote operations as well as increasing the throughput in remote operations.

notacoward · on Jan 9, 2015

Might I suggest a round of testing on Amazon or Rackspace? They both have 10Gb-equipped instance types with SSDs and everything, for a few bucks per cluster-hour. That shouldn't break even an academic budget. You might not be able to do your special ZERO networking in that environment, but you'd be able to explore other interesting corners of the performance space and your experiments would be reproducible by others.

shafiee01 · on Jan 9, 2015

Yes, I will talk about that with my supervisor.