There are many systems that take a native data structure in your favorite language and, using some sort of reflection, makes an on-disk structure that resembles it. Python pickles and Java’s serialization system are infamous examples, and rkyv is a less alarming one.
I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.
Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.
(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)
Fully agreed. rkyv looks like something that is hyper optimizing for a very niche case, but doesn't actually admit that it is doing so. The use case here is transient data akin to swapping in-memory data to disk.
"However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot."
At a first glance, it might sound like rkyv is better, after all, it has less restrictions and external schemas are annoying, but it doesn't actually solve the schema issue by having a self describing format like JSON or CBOR. You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.
> You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.
This seems false after reading the book, the doc, and a cursory reading of the source code.
It is definitely independent of rust version. The code make use of repr(C) on struct (field order follows the source code) and every field gets its own alignment (making it independent from the C ABI alignment). The format is indeed portable. It is also versioned.
The schema of the user structs is in Rust code. You can make this work across languages, but that's a lot of work and code to support. And this project appears to be in Rust for Rust.
On a side note, I find the code really easy to understand and follow. In my not so humble opinion, it is carefully crafted for performance while being elegant.
> Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
> (OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust?
Trusting possibly malicious inputs is an universal problem.
Here is a simple example:
echo "rm -rf /" > cmd
sh cmd
And this problem is no different in rkyv than rkvy_dyn or any other serialization format on the planet. The issue is trusting inputs. This is also called a man in the middle attack.
The solution is to add a cryptographic signature to detect tempering.
This is an unhelpful interpretation. With a decent memory-safe parser, it’s perfectly safe [1] to deserialize JSON or (most of) XML [0] protobuf or Cap’n Proto or HTTP requests, etc. Or to query a database containing untrusted data. You need to be careful that you don’t introduce a vulnerability by doing something unwise with the deserialized result, but a good deserializer will safely produce a correctly typed output given any input, and the biggest risk is that the output is excessively large.
But tools like Pickle or Java deserialization or, most likely, rkyv_dyn will happily give you outputs that contain callables and that contain behavior, and the result is not safe to access. (In Python, it’s wildly unsafe to access, as merely reading a field of a Python object calls functions encoded by the class, and the class may be quite dynamic.)
[0] The world is full of infamously dangerous XML parsers. Don’t use them, especially if they’re written in C or C++ or they don’t promise that they will not access the network.
> The solution is to add a cryptographic signature to detect tempering.
If you don’t have a deserializer that works on untrusted input, how do you verify signatures. Also, do you really thing it’s okay to do “sh $cmd” just because you happen to have verified a signature.
> This is also called a man in the middle attack.
I suggest looking up what a man in the middle attack is.
Ah, I see the confusion. rkyv_dyn doesn't serialize code. Rust is compiled to machine code. It would be quite a feat to accomplish.
I was a bit confused when you compared it to Python pickle and assumed you were talking about general input validation somehow.
I agree that pickle and similar are profoundly surprising and error prone. I struggle to find any reasonable reason one would want that.
As for the man in middle attack, I meant that if somebody intercepts the serialized form, they can mutate it. And without a cryptographic signature, you wouldn't know.
> rkyv_dyn doesn't serialize code. Rust is compiled to machine code.
Java is compiled to bytecode, and Obj-C is compiled to machine code. Yet both Android and iOS have had repeated severe vulnerabilities related to deserializing an object that contains a subobject of an unexpected type that pulls code along with it. It seems to be that rkyv_dyn has exactly the same underlying issue.
Sure, Rust is “safe”, and if all the unsafe code is sufficiently careful, it ought to be impossible to get the type of corruption that results in direct code execution, memory writes, etc. But systems can be fully compromised by semantic errors, too.
If I’m designing a system that takes untrusted input and produces an object of type Thing, I want Thing to be pure data. Once you start allowing an open set of methods on Thing or its subobjects, you have lost control of your own control flow. So doing:
thing.a.func()
may call a function that wasn’t even written at the time you wrote that line of code or even a function that is only present in some but not all programs that execute that line of code.
Exploiting this is considerably harder than exploiting pickle, but considerably harder is not the same as impossible.
You know very well what I meant by "compile to machine code". But you decided to interpret it in a combative way. Even though you seem very knowledgeable, this makes me want to stop discussing with you.
Ultimately you should read the code of rkyv_dyn to understand what it does instead of making random claims.
It will be faster for you to read the code than for me to attempt explaining how it works. Especially since you will most likely choose the least charitable interpretation of everything I say. There is very little code, it won't take long.
> You know very well what I meant by "compile to machine code".
I really don't. I think you mean that Rust compiles to machine code and neither loads executable code at runtime nor contains a JIT, so you can't possibly open a file and deserialize it and end up with code or particularly code-like things from that file being executed in your process.
My point is that there's an open-ended global registry of objects that implement a given trait, and it's possible (I think) to deserialize and get an unexpected type out, and calling its methods may run code that was not expected by whoever wrote the calling code. And the set of impls and thus the set of actual methods may expand by the mere fact of linking something else into the project.
This probably won't blow up quite as badly as NSCoding does in ObjC because Rust is (except when unsafe is used) memory-safe, so use-after-free just from deserializing is pretty unlikely. But I would still never use a mechanism like this if there was any chance of it consuming potentially malicious input.
> even a really cool library that makes Rust do this.
The first library that comes to mind when I think of this is `serde` with `#[derive(Serialize, Deserialize)]`, but that gives persistence-format output as you describe is preferable to the former case. I usually use it with JSON.
Maybe a little bit. But serde works with JSON (among other formats), and you can use it to read and write JSON that interoperates with other libraries and languages just fine. Kind of like how SQLAlchemy looks kind of like you’re writing normal Python code, but it interoperates with SQL.
I know "serde" is a take on "codec" but *rewrite* was right there! Also, as long as I'm whinging about naming? 'print' and 'parse' are five letter p words in a bidirectional relationship. Oh! Oh! push, peek, poke, ... pull! It even makes more sense than pop! And it's four letters!
But if you use complicated serialisation formats you can't mmap a file into memory and use it directly. Which is quite convenient if you don't want to parse the whole file and allocate it to memory because it's too large compared to the amount of memory or time you have.
Actually, it's you who is giving that impression with an ultra vague "doesn't solve the problems described".
The only problem in the blog post is efficient coding of optional fields and all they was introduce a bitmap. From that perspective, JSON and XML solve the optional fields problem to perfection, since an absent field costs exactly nothing.
I guess you missed the part where the size of the data stored on disk and efficient deserialization are also critically important performance characteristics that neither JSON nor XML have?
Capnproto doesn’t support transform on serialize - the optional fields still take up disk space unless you use the packed representation which has some performance drawbacks. Also the generated capnproto rust code is quite heavy on compile times which is probably some consideration that’s important for compiling queries.
Even completely ignoring the issues of language-centric vs data-format-centric serializers, your list is missing two very notable entries from my list: Arrow and Parquet. Both of them go to quite some lengths to efficiently handle optional/missing data efficiently. (I haven’t personally used either one for large data sets, but I have played with them. I think you’ll find that Arrow IPC / Feather (why can’t they just pick one name?) has excellent performance for the actual serialization and deserialization part as long as you do several rows at a time, but Parquet might win for table scans depending on the underlying storage medium.). Both of them are, quite specifically, the result of years of research into storing longish arrays of wide structures with potentially complex shapes and lots of missing data efficiently. (Logical arrays. They’re really struct-of-arrays formats, and I personally have a use case I kind of want to use Feather for except that Feather is not well tuned for emitting one row at a time.)
> Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.
I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.
Can you be a little more specific about your assertion?
The space overhead and the overhead of serialization/deserialization. Rkyv is zero overhead - it’s random access without needing to deserialize and can even be memory mapped.
The whole “zero overhead” thing is IMO a red herring. I care about a few things: stability across versions and languages, space efficiency (sometimes) and performance. I do not care about “overhead” — performance trumps overhead every time.
Your deserializer is probably running on a CPU, and that CPU probably has a very fast L1 cache and might be targeted by a compiler that can do scalar replacement of aggregates and such. A non-zero-overhead deserializer can run very quickly and result in the output being streamed efficiently from its source and ending up hot in L1 in a useful format. A zero-overhead deserializer might do messy reads in a bad order without streaming hints and run much slower.
And then to get very very large records, as in the OP, where getting a good on-disk layout may require thought. And, frequently, the right layout isn’t even array-of-structs, which is why there are so many tools designed to query column stores like Parquet efficiently.
Serdes time can be significant. There are use cases for the zero copy formats even though they use more space. Likewise bit-packed asn1 is often slower than byte-aligned.
If you care about space, you're almost certainly going to compress your output (unless, like, you're literally storing random noise) and so you'll necessarily have overhead from that.
Unless the reason you care about space is because it's some sort of wire protocol for a slow network (like LoRaWAN or Iridium packets or a binary UART protocol), where compression probably doesn't make sense because the compression overhead is too large. But even here, just defining the data layout makes sense, I think.
Tihs could take the form of a C struct with __attribute__((packed)) but that is fragile if you care about more platforms than one. (I generally don't, so that works for me!).
I have zero doubt that you’re on some ‘no true Scotsman’-style “you’re not doing Real Development if you are using these technologies to solve these problems” thing. Let’s just drop that. There are myriad ‘real man webscale development’ scenarios where these are more than acceptable.
Pretty sure protobuf used a header to track field presence within a message, similarly to what this article does. That does have its own overhead you could avoid if you knew all fields were present, but that's not the assumption it makes.
Sure, if your structure doesn't contain any pointers and you only ever want to support one endianness and you trust your compiler to fix the machine layout of the struct forever.
I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.
Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.
(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)