Snowflake and Bigquery will bite you in the ass later on. You can do 80% of the things you will need - which is great for some newbie stuff or for sales presentations. Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
Databricks/Spark are just good platforms to help you do something with structured data in your lake. With the recent additions to its execution engine and Delta (strange naming tbh) it will be pretty much the same as Snowflake for you.
BigQuery/Snowflake can process Parquet and multiple other formats in Object Storage. You can use them more "freely" if you keep your raw data in open formats.
You need something more complicated than what can be done using BigQuery/Snowflake (that remaining 20%, though I would say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever and process it with anything, including Dataproc/HDInsight/EMR or even Databricks. That's actually a common pattern.
While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.
Can't do a simple
select * from s3://file.parquet
which you can do in Spark. Having to load it into the data warehouse means that you duplicate your data two times and it is stupidly annoying.
Many times the data doesn't even resemble anything tabular before I structure it in python scripts. Why would I load it inside a warehouse only to then pull it down to do some python processing and loading it back. Which makes the data travel from DWH storage to a data lake and then to my compute cluster and then the same cumbersome roadtrip back. Pretty wasteful. Spark at least allows me to schedule a python function across a cluster while copying only from lake to my compute node and back.
Data warehouses like BQ and Snowflake are great for data scientists after a bunch of engineers slice and dice raw data into clean tables. For anyone working with not yet structured data, data lake wins hands down.
Not quite -- Snowflake allows you to query data directly from S3 without having to ingest it. You can either query the S3 file directly [1], or define an external table on top of it [2]. If the data is JSON, Avro, XML, or some other "semi-structured" format, then Snowflake's "variant" data type makes this type of data very easy to work with. For raw text or log lines, you can still load to Snowflake and process using regex or UDFs. For unstructured or "complex" data like images and PDFs, then you may need to reach for a different compute environment.
Yeah I thought Snowflake external tables would do this but it is not the case.
External table in Snowflake only allows you to ingest data from s3 to their storage (which is also s3 behind the scenes).
Perhaps something has changed since the last time I tried, but when I tried my conclusion was that "external tables" in Snowflake are not what you think they are.
Also I have not seen examples of "select * from s3://file.json" in the links you provided.
> Can't do a simple select * from s3://file.parquet
That's not really true; Snowflake can query directly from S3 just fine, even from other clouds. You just need to set up the credentials (or supply them in the query, but that's not usually a good idea).
> While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.
This is not true. Snowflake allows you to create an external stage (pointer to s3) and then query any prefix you want as long as you provide the correct file format arguments or types.
select * from @somestage/someprefix file_format(....)
edit: hadn't refreshed the page and i can see others have already responded to this point.
In BQ you can query data as an external table with Hive Partioning. No need for duplication and extremely useful for consuming "Landing Zones" in a Datalake.
Pretty similar as defining a Hive Table and then using any other engine to process it.
PS: BigQuery Omni (now beta) will support object storage solutions from other cloud providers.
>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
Snowflake has spark connector too. So I don't know what the difference would be writing a spark job against deltalake vs snowflake.
> 80% of the things you will need - which is great for some newbie stuff or for sales presentations.
This is obviously wrong.
> You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
We store unstructured data in snowflake. I don't understand why you need a datalake on top of it.
EXACTLY. You absolutely can store unstructured and semi structured data in Snowflake. I find it baffling and at this point a bit irritating that there is this community of people insisting that is not allowed for...some unspecified reason.
Why would I store eg. bunch of html files as string columns in Snowflake, only to download them down, process them in python and load back into some other string table.
Because it actually costs the same, and if you process them in Snowflake using SQL or UDFs, you will get your results in seconds and you won't have to manage any of the underlying infrastructure.
>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.
Snowflake supports enough SQL constructs to allow for very complicated queries. If that doesn't suit your needs then there's stored procedures and custom javascript UDFs you can write. That covers probably 99+% of the use cases at most companies and usually the rest can be done somewhere else on pre-aggregated data.
1. Why tf would javascript be picked as the UDF language when Python dominates the data world.
2. People usually load complex Python libraries for data processing. I wonder if Snowflake UDF would support that or just allow you to use standard library.
1. why not? it's not like you're getting external libraries either way.
2. if you want complex libraries, you can use external functions (aws lambdas, etc) in Snowflake.
You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.
Databricks/Spark are just good platforms to help you do something with structured data in your lake. With the recent additions to its execution engine and Delta (strange naming tbh) it will be pretty much the same as Snowflake for you.