Cloud Dataflow is also vastly more expensive than using pyspark and Apache Beam ...

ramraj07 · on Dec 18, 2019

$25 for cleaning a relatively small dataset does sound a bit high but not outrageous.. given the process abstracts out all the infra tweaking and setting up various systems. $25 an hour is probably a good estimate of how much a data warehouse like snowflake costs for a reasonable cluster as well, so unless you spin up your own spot ec2 instances and set up a spark cluster and perform the operations the cost savings are going to be marginal at best?

Do you need to run this pipeline more than once a day? Is it sufficiently important for your business case? I feel like 25 is a very small sum if any of them are remotely true.

On a related note, how practically reliable is the DataPrep->DataFlow workflow? Could a reasonably smart analyst with zero programming experience set them up and run them? Feel like that's who this workflow is built for, but I've not had the best of experiences with Trifacta demos in AWS with reliability.

choppaface · on Dec 18, 2019

In my experience $25 for 1TB of ETL is outrageously high. That’s 100 cores of GCP or AWS for 4-5hrs. I have some very very CPU intensive jobs that would take about that time for 300GB of data. But if it’s just parsing CSV / JSON and doing some joins then I’d think $5/TB would be more reasonable.

I agree a warehouse could get more expensive, but that’s where Athena or BigQuery is supposed to come in? BigQuery is rather expensive though. It can’t support reading Parquet natively (of course Google insists on their own format) so you have to pay their heavy tax versus just object storage and elastic compute.

Also the classic BigQuery UI is way less buggy than the new one, and the PMs won’t take bugs for the new UI.

choppaface · on Dec 18, 2019

Oh wow thanks for the info! Had no idea Beam was that slow.

faizshah · on Dec 18, 2019

Check this reference out: https://arxiv.org/pdf/1907.08302.pdf

“ Our benchmark results show that Apache Beam has a noticeable impact on the performance of DSPSs in almost all cases. Programs developed using Apache Beam suffered from a slowdown of up to a factor of 58 in the worst case. At the same time, there is one scenario where the query developed using Apache Beam is about as fast as its counterparts using the APIs of the corresponding DSPS. However, for most scenarios we observed a slowdown of at least a factor three.”