Feldera is an incremental query engine, you can think of it as a specialized database. If you have a set of question you can express in SQL it will ingest all your data and build many sophisticated indexes for it (these get stored on disk). Whenever new data arrives feldera can instantly update the answers to all your questions. This is mostly useful when the data is much larger than what fits in memory because then the questions will be especially expensive to answer with a regular (batch) database.
> Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
> it sounds like helping customers with databases full of red flags is their bread and butter
Yes that captures it well. Feldera is an incremental query engine. Loosely speaking: it computes answers to any of your SQL queries by doing work proportional to the incoming changes for your data (rather than the entire state of your database tables).
If you have queries that take hours to compute in a traditional database like Spark/PostgreSQL/Snowflake (because of their complexity, or data size) and you want to always have the most up-to-date answer for your queries, feldera will give you that answer 'instantly' whenever your data changes (after you've back-filled your existing dataset into it).
I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.
I think its a good article and I enjoyed learning a little more about rust, but would have been nice to point out this is a common technique used for tuple storage in databases for those not familiar.
It comes off as being a novel solution rather than connecting it to a long tradition of DB design. I believe PG for instance has used a null bitmap since the beginning 40 years ago.
As to your hard disagree, I guess it depends... While this particular user is on the higher end (in terms of columns), it's not our only user where column counts are huge. We see tables with 100+ columns on a fairly regular basis especially when dealing with larger enterprises.
Can you clarify which knowledge domains those enterprises fall under with examples of what problems they were trying to solve?
If it's not obvious, I agree with the hard disagree. Every time I see a table with that many columns, I have a hard time believing there isn't some normalization possible.
Schemas that stubbornly stick to high-level concepts and refuse to dig into the subfeatures of the data are often seen from inexperienced devs or dysfunctional/disorganized places too inflexible to care much. This isn't really negotiable. There will be issues with such a schema if it's meant to scale up or be migrated or maintained long term.
Normalization is possible but not practical in a lot of cases: nearly every “legacy” database I’ve seen has at least one table that just accumulates columns because that was the quickest way to ship something.
Also, normalization solves a problem that’s present in OLTP applications: OLAP/Big Data applications generally have problems that are solved by denormalization.
We have many large enterprises from wildly different domains use feldera and from what I can tell there is no correlation between the domain and the amount of columns.
As fiddlerwoaroof says, it seems to be more a function of how mature/big the company is and how much time it had to 'accumulate things' in their data model.
And there might be very good reasons to design things the way they did, it's very hard to question it without being a domain expert in their field, I wouldn't dare :).
> I can tell there is no correlation between the domain and the amount of columns.
This is unbelievable. In purely architectural terms that would require your database design to be an amorphous big ball of everything, with no discernible design or modelling involved. This is completely unrealistic. Are queries done at random?
In practical terms, your assertion is irrelevant. Look at the sparse columns. Figure out those with sparse rows.
Then move half of the columns to a new table and keep the other half in the original table. Congratulations, you just cut down your column count by half, and sped up your queries.
Even better: discover how your data is being used. Look at queries and check what fields are used in each case. Odds are, that's your table right there.
Let's face it. There is absolutely no technical or architectural reason to reach this point. This problem is really not about structs.
Feldera speak from lived experience when they say 100+ column tables are common in their customer base. They speak from lived experience when they say there's no correlation in their customer base.
Feldera provides a service. They did not design these schemas. Their customers did, and probably over such long time periods that those schemas cannot be referred to as designed anymore -- they just happened.
IIUC Feldera works in OLAP primarily, where I have no trouble believing these schemas are common. At my $JOB they are, because it works well for the type of data we process. Some OLAP DBs might not even support JOINs.
Feldera folks are simply reporting on their experience, and people are saying they're... wrong?
I remember the first time I encountered this thing called TPC-H back when I was a student. I thought "wow surely SQL can't get more complicated than that".
Turns out I was very wrong about that. So it's all about perspective.
> Normalization is possible but not practical in a lot of cases: nearly every “legacy” database I’ve seen has at least one table that just accumulates columns because that was the quickest way to ship something.
Strong disagree. I'll explain.
Your argument would support the idea of adding a few columns to a table to get to a short time to market. That's ok.
Your comment does not come close to justify why you would keep the columns in. Not the slightest.
Tables with many columns create all sorts of problems and inefficiencies. Over fetching is a problem all on itself. Even the code gets brittle, where each and every single tweak risks beijg a major regression.
Creating a new table is not hard. Add a foreign key, add the columns, do a standard parallel write migration. Done. How on earth is this not practical?
I’m not justifying the design but splitting a table with several billion rows is not a trivial task, especially when ORMs and such are involved. Additionally, it’s easier to get work scheduled to ship a feature than it is to convince the relevant players to complete the swing.
> I’m not justifying the design but splitting a table with several billion rows is not a trivial task, especially when ORMs and such are involved.
I don't agree. Let me walk you through the process.
- create the new table
- follow a basic parallel writes strategy
-- update your database consumers to write to the new table without reading from it
-- run a batch job to populate the new table with data from the old table
-- update your database consumer to read from the new table while writing to both old and new tables
From this point onward, just pick a convenient moment to stop writing to the old database and call the migration done. Do post-migrarion cleanup tasks.
> Additionally, it’s easier to get work scheduled to ship a feature than it is to convince the relevant players to complete the swing.
The ease of piling up technical debt is not a justification to keep broken systems and designs. It's only ok to make a messs to deliver things because you're expected to clean after yourself afterwards.
There are sometimes reasons this is harder in practice, for example let’s say the business or even third parties have access to this db directly and have hundreds of separate apps/services relying on this db (also an anti-pattern of course but not uncommon), that makes changing the db significantly harder.
Mistakes made early on and not corrected can snowball and lead to this kind of mess, which is very hard to back out of.
Fine, but you still need to read in those 100+ fields. So now you gotta contend with 20+ joins just to pull in one record. Not more practical than a single SELECT in my opinion.
You don't need to join what you don't actually need. You also need to be careful writing your queries, not just the schema. The most common ones should be wrapped in views or functions to avoid the problem of everyone rolling their own later.
Performance generally isn't an issue for an arbitrary number of joins as long as your indices are set up correctly.
If you really do need a bulk read like that I think you want json columns, or to just go all in with a nosql database. Even then, the above regarding indexing is still true.
I think you believe the average developer, especially on enterprise software where you see this sort of shit, is far more competent or ambitious than they actually are. Many would be horrified to see the number of monkeys banging out nasty DDL in Hibernate or whatever C# uses that have no idea what "normal forms" or "relational algebra" are and are actively resistant to even attempting to learn.
Security models from SaaS companies based on having a bunch of random bytes/numbers with coarse-grained permissions, and valid for a very long time are already a bad idea. With agents, secrets/tokens really need to be minted with time-limited, scope-limited, OpenID/smart-contract based trust relationships so they will fare much better in this new world. Unfortunately, this is a struggle still for most major vendors (e.g., Github gh CLI still doesn't let you use Github Apps out-of-the box)
> It's pretty clear that the security models that were design into operating systems never truly considered networked systems
Andrew Tanenbaum developed the Amoeba operating system with those requirements in mind almost 40 years ago. There were plenty of others that did propose similar systems in the systems research community. It's not that we don't know how to do it just that the OS's that became mainstream didn't want to/need to/consider those requirements necessary/<insert any other potential reason I forgot>.
Yes, Tanenbaum was right. But it is a hard sell, even today, people just don't seem to get it.
Bluntly: if it isn't secure and correct it shouldn't be used. But companies seem to prefer insecure, incorrect but fast software because they are in competition with other parties and the ones that want to do things right get killed in the market.
Developers will militate against anything that they perceive to make their life difficult, eg anything that stops them blindly running ‘npm get’ and running arbitary code off the internet.
Well yeah, we had to fix some LLM that broke things at a client; we asked why they didn't sandbox it or whatever and the devs said they tried to use nsjail; could not get their software to work with it, gave up and just let it rip without any constraints because the project had to go live.
reply