Cloud Dataflow is also vastly more expensive than using pyspark and Apache Beam is 10 times slower for most operations than Flink or Spark. I have to try the new flexible scheduling but DataFlow the last time I tried it 2 years ago was a massive rip off.
Also DataPrep was amazing until you run the DataFlow pipeline it produces and a < 1TB dataset cleaning costs $25...
Have had good experiences once the data is in bigquery and bq keeps costs low. But lately Ive been trying to figure out a semi-managed way of doing exactly-once stream processing for cheaper than dataflow. Possibly an architecture around cloud run might work.
$25 for cleaning a relatively small dataset does sound a bit high but not outrageous.. given the process abstracts out all the infra tweaking and setting up various systems. $25 an hour is probably a good estimate of how much a data warehouse like snowflake costs for a reasonable cluster as well, so unless you spin up your own spot ec2 instances and set up a spark cluster and perform the operations the cost savings are going to be marginal at best?
Do you need to run this pipeline more than once a day? Is it sufficiently important for your business case? I feel like 25 is a very small sum if any of them are remotely true.
On a related note, how practically reliable is the DataPrep->DataFlow workflow? Could a reasonably smart analyst with zero programming experience set them up and run them? Feel like that's who this workflow is built for, but I've not had the best of experiences with Trifacta demos in AWS with reliability.
In my experience $25 for 1TB of ETL is outrageously high. That’s 100 cores of GCP or AWS for 4-5hrs. I have some very very CPU intensive jobs that would take about that time for 300GB of data. But if it’s just parsing CSV / JSON and doing some joins then I’d think $5/TB would be more reasonable.
I agree a warehouse could get more expensive, but that’s where Athena or BigQuery is supposed to come in? BigQuery is rather expensive though. It can’t support reading Parquet natively (of course Google insists on their own format) so you have to pay their heavy tax versus just object storage and elastic compute.
Also the classic BigQuery UI is way less buggy than the new one, and the PMs won’t take bugs for the new UI.
“ Our benchmark results show that Apache Beam has a noticeable impact on the performance of DSPSs in almost all cases. Programs developed using Apache Beam suffered from a slowdown of up to a factor of 58 in the worst case. At the same time, there is one scenario where the query developed using Apache Beam is about as fast as its counterparts using the APIs of the corresponding DSPS. However, for most scenarios we observed a slowdown of at least a factor three.”
Also DataPrep was amazing until you run the DataFlow pipeline it produces and a < 1TB dataset cleaning costs $25...
Have had good experiences once the data is in bigquery and bq keeps costs low. But lately Ive been trying to figure out a semi-managed way of doing exactly-once stream processing for cheaper than dataflow. Possibly an architecture around cloud run might work.