Snowflake and Bigquery will bite you in the ass later on. You can do 80% of the ...

mrbungie · on Feb 1, 2021

BigQuery/Snowflake can process Parquet and multiple other formats in Object Storage. You can use them more "freely" if you keep your raw data in open formats.

You need something more complicated than what can be done using BigQuery/Snowflake (that remaining 20%, though I would say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever and process it with anything, including Dataproc/HDInsight/EMR or even Databricks. That's actually a common pattern.

snidane · on Feb 1, 2021

While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.

Can't do a simple

    select * from s3://file.parquet

which you can do in Spark. Having to load it into the data warehouse means that you duplicate your data two times and it is stupidly annoying.

Many times the data doesn't even resemble anything tabular before I structure it in python scripts. Why would I load it inside a warehouse only to then pull it down to do some python processing and loading it back. Which makes the data travel from DWH storage to a data lake and then to my compute cluster and then the same cumbersome roadtrip back. Pretty wasteful. Spark at least allows me to schedule a python function across a cluster while copying only from lake to my compute node and back.

Data warehouses like BQ and Snowflake are great for data scientists after a bunch of engineers slice and dice raw data into clean tables. For anyone working with not yet structured data, data lake wins hands down.

carlineng · on Feb 1, 2021

Not quite -- Snowflake allows you to query data directly from S3 without having to ingest it. You can either query the S3 file directly [1], or define an external table on top of it [2]. If the data is JSON, Avro, XML, or some other "semi-structured" format, then Snowflake's "variant" data type makes this type of data very easy to work with. For raw text or log lines, you can still load to Snowflake and process using regex or UDFs. For unstructured or "complex" data like images and PDFs, then you may need to reach for a different compute environment.

[1] https://docs.snowflake.com/en/user-guide/querying-stage.html [2] https://docs.snowflake.com/en/user-guide/tables-external-int...

snidane · on Feb 2, 2021

Yeah I thought Snowflake external tables would do this but it is not the case.

External table in Snowflake only allows you to ingest data from s3 to their storage (which is also s3 behind the scenes).

Perhaps something has changed since the last time I tried, but when I tried my conclusion was that "external tables" in Snowflake are not what you think they are.

Also I have not seen examples of "select * from s3://file.json" in the links you provided.

I would like to be corrected if I'm mistaken.

dominotw · on Feb 2, 2021

> External table in Snowflake only allows you to ingest data from s3 to their storage

No. External table doesn't do any "ingesting" whatsoever. You can if you want to but you thats not the primary usecase.

https://docs.snowflake.com/en/user-guide/tables-external-int...

" enables querying data stored in files in an external stage as if it were inside a database "

> I have not seen examples of "select * from s3://file.json" in the links you provided.

you cannot directly query a random file off of s3( not sure why someone would want to do that) . Snowflake is a database , not a python script.

You have to build a an external table

" create external table abc location=s3://dir-to-file "

then

" select * from abc "

> "external tables" in Snowflake are not what you think they are.

They are exactly what you think they are. Tables stored externally to snowflake.

snidane · on Feb 2, 2021

Thanks for clarification. I either wasn't able to figure this out back then or things have changes since then.

fphhotchips · on Feb 1, 2021

> Can't do a simple select * from s3://file.parquet

That's not really true; Snowflake can query directly from S3 just fine, even from other clouds. You just need to set up the credentials (or supply them in the query, but that's not usually a good idea).

Disclaimer: I work for Snowflake

chrisjc · on Feb 2, 2021

> While both BQ and Snowflake are adopting lake features, they still only support parquet and other file formats for loading, not for querying.

This is not true. Snowflake allows you to create an external stage (pointer to s3) and then query any prefix you want as long as you provide the correct file format arguments or types.

select * from @somestage/someprefix file_format(....)

edit: hadn't refreshed the page and i can see others have already responded to this point.

mrbungie · on Feb 1, 2021

In BQ you can query data as an external table with Hive Partioning. No need for duplication and extremely useful for consuming "Landing Zones" in a Datalake.

Pretty similar as defining a Hive Table and then using any other engine to process it.

PS: BigQuery Omni (now beta) will support object storage solutions from other cloud providers.

dominotw · on Feb 2, 2021

>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

Snowflake has spark connector too. So I don't know what the difference would be writing a spark job against deltalake vs snowflake.

> 80% of the things you will need - which is great for some newbie stuff or for sales presentations.

This is obviously wrong.

> You will have to develop some kind of data lake to store unstructured data anyway. You will end up with a Snowflake data warehouse and a data lake. Why not just go with data lake first then.

We store unstructured data in snowflake. I don't understand why you need a datalake on top of it.

georgewfraser · on Feb 2, 2021

EXACTLY. You absolutely can store unstructured and semi structured data in Snowflake. I find it baffling and at this point a bit irritating that there is this community of people insisting that is not allowed for...some unspecified reason.

snidane · on Feb 2, 2021

Why would I store eg. bunch of html files as string columns in Snowflake, only to download them down, process them in python and load back into some other string table.

I could do the same in s3 for much cheaper.

georgewfraser · on Feb 2, 2021

Because it actually costs the same, and if you process them in Snowflake using SQL or UDFs, you will get your results in seconds and you won't have to manage any of the underlying infrastructure.

snidane · on Feb 2, 2021

Can I do some advanced xml stuff say using lxml with beautifulsoup using UDFs?

marcinzm · on Feb 1, 2021

>Once you need something complicated, you're on your own, while being stuck in a proprietary environment that you cannot extend.

Snowflake supports enough SQL constructs to allow for very complicated queries. If that doesn't suit your needs then there's stored procedures and custom javascript UDFs you can write. That covers probably 99+% of the use cases at most companies and usually the rest can be done somewhere else on pre-aggregated data.

snidane · on Feb 2, 2021

1. Why tf would javascript be picked as the UDF language when Python dominates the data world.

2. People usually load complex Python libraries for data processing. I wonder if Snowflake UDF would support that or just allow you to use standard library.

chrisjc · on Feb 2, 2021

1. why not? it's not like you're getting external libraries either way. 2. if you want complex libraries, you can use external functions (aws lambdas, etc) in Snowflake.