Data Engineering Patterns with Apache Airflow [video]

grillorafael · on Aug 29, 2018

Apache Airflow seems like a really interesting project but I don't know anyone using that can give a real life pros/cons to it.

Anyone here dares to give some feedback in that sense?

Ps: Why do people still use Prezi? It gives me vertigo.

tpaschalis · on Aug 29, 2018

There's a bunch of different tools to do the same job, from manual cron jobs, to Luigi, Pinball, Azkaban, Oozie, Taverna, Mistral.

I've started to use it for personal projects, and slowly probing for adoption in our shop, where applicable.

The good points I have seen

- It's simple Python, and not XML like Azkaban. I've seen people with less technical expertise build useful stuff quickly, and automate their workflows.

- Very good UI, which just lets you do what you need without fuss.

- Easy to build modular and interactive flows, with interesting stuff as sensors, communications between operators, triggers etc.

- Everything is stored into a database, which I can query about anything related to the processes run and Airflow itself

- Its source is grok-able and documented, it allows you to easily add your own modules (or "operators" as they're called)

- Many add-on modules for operators already exist from the community

- Easier to force the team to version control your process flows

Some cons, from the light use I've seen

- If you scale beyond a point, you have to take care of scaling the database as well, adding DBA work

- I've encountered some issues with scheduler and backfilled jobs, and `depends_on_past`, but it might be my limited experience

- People may start to use specific external dependencies/modules, which you will then need to keep track of

- Uses its own lingo/terminology, which you'll have to learn and use

- Uses system time, so no running in different timezones

I have high hopes for the project, as it's currently incubating for the Apache Foundation, and I hope it remains minimal and keeps the present scope.

If it seems interesting to you, my suggestion is to start small, keep in mind that it handles relations between tasks and not data, and try to automate some easy bash script that you currently handle with cron.

neuromantik8086 · on Aug 29, 2018

> There's a bunch of different tools to do the same job

Yup. Almost, too many tools, in fact.

https://s.apache.org/existing-workflow-systems

neuromantik8086 · on Aug 29, 2018

As a somewhat related addendum, some of the worst pipelines I've encountered have been in scientific computing. Conceptually, DAGs are quite simple but for some reason things always end up gummed up in implementation, which is partially why scientific results are much harder to reproduce than they should be. The disconnect between how much harder pipeline creation is compared to how easy it should be in the sciences has always confused me a bit.

Maro · on Aug 29, 2018

We base our whole DS infrastructure on Airflow (and Superset):

http://bytepawn.com/fetchr-airflow.html

http://bytepawn.com/fetchr-data-science-infra.html

Airflow is somewhere between good enough and pretty cool, it's based on what we had at Facebook (called Dataswarm).

IMO in 2-3 years Airflow will be the de-facto ETL standard, like Hadoop used to be for "Big data". If you're rolling your own ETL at this point, you're wasting your time. If you're using something else, you're (probably) missing out on ETL-as-code goodness.

sweml · on Aug 29, 2018

IMO Airflow currently is the de-facto standard and in 2-3 years it will go the way of Hadoop

rhombocombus · on Aug 29, 2018

How does it compare to big iron enterprise ETL tools like IBM datastage? I have only dabbled but it looks far more appealing for a variety of reasons.

caravel · on Aug 29, 2018

For context, I used Datastage, Informatica, Ab Initio and SSIS in previous lives an went on to write the first version of Airflow. I developed a taste for pipelines-as-code while working at Facebook using an internal tool that is not open source.

I'd argue that pipelines as code, as opposed to dragndrop GUIs, is a better approach, at least for people who are comfortable writing code. Code is easy to version, test, diff, collaborate and allows for the creation of arbitrary abstractions.

philosopherlawr · on Aug 29, 2018

ETL tools just can't compete with a tool that forces code to do anything. It might seem backwards, but we've abandoned all non-code environments and force pure-code for configuration for all of our pipelines.

thisisit · on Aug 30, 2018

> like Hadoop used to be for "Big data"

So, you are saying it will be going with time, after it becomes the de-facto ETL tool?

boulos · on Aug 29, 2018

Disclosure: I work on Google Cloud.

We ended up choosing Airflow for our managed workflow service, Composer [1].

One of the main cons is that this space is generally fragmented. When you’re starting out with a simple repeatable task, the natural default is just a cron job (and if you’re “fancy”, using Jenkins or similar to kick it off). By comparison, getting started with Airflow means setting it all up, and then writing a Python script that represents your DAG. It may be better hygiene, but that’s not people’s first preference.

Beyond the obvious “this thing is a real workflow orchestration system, it handles failures and dependencies”, I think the main advantage is the pre-built Operators. Instead of being just given a blank bash or python script, this is a community-driven effort to avoid everyone needing to roll their own. It’s still a young community, but growing quickly (and we’re intent on pushing).

[1] https://cloud.google.com/composer/

tedmiston · on Aug 29, 2018

Disclosure: I work on Astronomer.io, an open source Airflow platform and SaaS [1][2], and also contribute to Airflow.

Our experience was similar to boulous' — Airflow is awesome once it's running but getting it running in an environment that scales to the point that you can deploy your first production DAG can take some effort. That's what led us to trying to do that work once for everyone.

Reusability and composability of components are some of my favorite aspects of working with Airflow.

[1]: https://www.astronomer.io/

[2]: https://github.com/astronomerio/helm.astronomer.io

tedmiston · on Aug 29, 2018

Disclosure: I work on Astronomer.io, an open source Airflow platform and SaaS, and also contribute to Airflow.

IMO the pros / cons are really relative to what you're comparing it to and what your workflow needs are.

We've written some guides on "Airflow vs ___" [1] (currently AWS Glue and Oozie). Feel free to email me (in profile) if you'd like to see a comparison to something else.

I like these two posts on Airflow vs Luigi, Pinball, Azkaban, etc [2][3].

[1]: https://www.astronomer.io/guides/

[2]: http://bytepawn.com/luigi-airflow-pinball.html

[3]: https://robinhood.engineering/why-robinhood-uses-airflow-aed...

politelemon · on Aug 29, 2018

We consider it a glorified cron replacement. The main selling point is its scheduling feature and the ability to view logs via the web UI it provides.

You write DAGs in Python to do 'stuff', schedule it to run, say, every hour. You can then get a history of its runs, failures, what went wrong. Rerun things if needed.

Those are the pros.

Cons - when new devs try to treat it as a programming paradigm, things can get difficult to work with. Some aspects aren't easily automatable - eg creating users. Needs to make its authentication options obvious and would be good to have some finer grain control over who can do what in the Airflow UI.

Overall we're quite happy with it and also using it for datascience as well as data feeds, data workflows, ETL processes.

yoquan · on Aug 29, 2018

There is an attempt to create a role based access control by the guys at WePay according to the shared slide "RBAC talk" below [1]. Don't know why their repo [2] can't be accessed now, though.

[1] https://www.meetup.com/ja-JP/Bay-Area-Apache-Airflow-Incubat...

[2] https://github.com/wepay/airflow-webserver

tedmiston · on Aug 29, 2018

The RBAC UI has since been merged into Airflow and is now released in Airflow 1.10.

https://github.com/apache/incubator-airflow/tree/master/airf...

https://github.com/apache/incubator-airflow/commit/05e1861e2...

To enable it, set `rbac = True` under the `webserver` group in your airflow.cfg, or via env var:

    export AIRFLOW__WEBSERVER__RBAC=True

yoquan · on Aug 30, 2018

That's great to know, thank you. Eager to try 1.10!

monksy · on Aug 29, 2018

If you've ever used a tool similar to HP Operations Orchestration. It's pretty much a stripped down version of that.

Cons:

- It's not very stable. (It requires a lot of configuration to get it to do more than one process at a time)

- It's very easy to get the UI to fall over.

- It's very difficult to get tasks+jobs to stop running once they started. (You can delete/stop/cancel a job.. but under the covers it keeps running and your next iteration is going to wait.. if it ever does complete before you can go through the develop, test cycle again)

- It's written in Python: Expect to have issues with your environment. The latest version of Airflow doesn't work with 3.7.xish because the async word was made a keyword. There goes that method.

- There is no sharing (xcoms is frowned upon) of data from one process to another. This means that if you're trying to pull data from S3, you're going to have to hard code it to a predictable place. The next operation acts completely indepenently and runs that.

- The connections between tasks are superficial. They're just there to order it based on how you specified it. Also, it can be a bit difficult to debug when you have multiple layers and multiple depedency declarations where something is both a upstream and downstream of the same depedency.

- No optimization. It will not split up the work per task. You have to define that work manually. (See the next complaint)

- No Dynamic tasks or Dags. You cannot generate a new dag or task after the dag is initialized. That means that if you have to perform 1 000 000 000 000 API calls, you can't just break that up into 200 api calls per task and then max out your compacity in your workers.

- That example that they had of a dag of thousands of tasks. That's a bad practice. Timeouts on dags are going to be reached by the time that completes, and it'll try to restart on a schedule.

gbrown · on Aug 29, 2018

Any opinion on NiFi or other open source alternatives? I'm evaluating products in this space, but it seems pretty hard to tell without just giving them a try how well they'd integrate into my work.

monksy · on Aug 29, 2018

At the moment I haven't found any other OSS alternatives. Maybe Chef or ansible. I'll ask my friend Warren.

monksy · on Aug 29, 2018

Another reason: Your tasks/dags aren't portable from one system to another. Your tasks+dags are your code base. You can't create something and share it with others very easily.

neuromantik8086 · on Aug 29, 2018

This is why I wish attempts at standardizing ways to express DAGs in YAML/JSON/XML like CWL [0] and WDL [1] had more steam to them. If these standards took off, you'd be able to take your workflow and execute it on another batch system scheduler if you got tired of your current workflow orchestration tool.

[0] https://www.commonwl.org/ [1] https://software.broadinstitute.org/wdl/

caravel · on Aug 29, 2018

[full disclosure, Airflow committer here] I've never heard of "HP Operation Orchestration", but that looks like a drag and drop enterprise tool from a different Windows-GUI era. Airflow is very different, it's workflows defined as code which is a totally different paradigm. The Airflow community believes that when workflows are defined as code, it's easier to collaborate, test, evolve and maintain them. Though maybe the HP tool exposes an API?

To address some of your comments:

- About stability, I'm not sure what version you've used, or which executor you were using, but if stability was a concern I don't think we'd have such a large and thriving community. Nothing is perfect, but clearly it's working well for hundreds of companies.

- About stopping jobs, if the task instance state is altered or cleared (through the web ui or CLI), the task will exit and the failure will be handled. Earlier versions (maybe 1-2 years back?) did not always do that properly.

- About Python: Python 3.7 was released late June 2018, and I think there are PRs addressing the `async` issue already. We fully support 2.7 to 3.6, and 3.7 very soon. You need to give software a chance and a bit of time to adapt to new standards. I wonder which % of pypi packages support 3.7 at this point, or how many have 3.7 in their build matrix, but my guess is that it's very low.

- XCOMs are fine in many cases, though if you're not passing data or metadata from a task to another that doesn't mean that there's no context that exist for the execution of the task. We recommend having determinisc context and data locations (meaning the same task instance would always target the same partition or data location).

- Dynamic: the talk linked to above clarifies what kind of dynamism Airflow supports. It's very common to build DAGs dynamically, though the shape of the DAG cannot shape at runtime. Conceptually an Airflow DAG is a proper directed acyclic graph, not a DAG factory or many DAGs at once. Note that you can still write dynamic DAG factories if you want to create DAGs that change based on input.

- No optimization: the contract is simple, Airflow executes the tasks you define. While you can do data-processing work in Python inside an Airflow task (data flow), we recommend to use Airflow to orchestrate more specialized systems like Spark/Hadoop/Flink/BigQuery/Presto to do the actual data processing. Airflow is first and foremost an orchestrator.

- DAG parsing timeouts are configurable. DAG creation times for large DAGs have been improved in the past versions. But clearly it's easier for humans to reason about smaller DAGs. DAGs with thousands of tasks aren't ideal but they are manageable.

One thing to keep in mind is that thriving open source software evolves quickly, and Airflow gets 100+ commits monthly, and has dozens of engineers from many organizations working on it full time. From what I can see it's clearly the most active project in this space.

monksy · on Aug 29, 2018

HP OO is an entire system that has an XML structure command structure to customize the job that you're working on. It has a GUI that is used to build out the flows, run and test. It has a backend system to audit, admin, and visualize the current process, and it has workers to scale out the work. It's a bit of a more mature setup.

The claim that the setup of your workflow has to be code isn't necessarily a good thing. Your recipe should be descriptive, not imperative.

----

To answer your response:

1. Stability: I was working with apache-airflow 1.9 (last release: Jan2018) 1.10 was just released 2 days ago. I frequently had issues where deleting more than 3 tasks would cause that mushroom cloud error message. Also, I've had cases where the task could max out on memory and take the whole system down.

2. 1.9.0: Stopping jobs: I saw this issue where that a task would be running, I would stop+delete the task and start a new one. I frequently saw the case where I had to wait a while for the triggered dag to continue running.

3. Python3.7: Yes, it was addressed on the PR. However, for things like that we (the users) need a quick turn arround/hotfix for stuff like that. It got released late (lets say 27 June, and the latest version 1.10 was 28 August [with a 7month gap]) It's just painful to have this upgrade just break something internally in Airflow.

4. From what I've seen in situations where the work for the task is huge is that there is an expectation of the task to handle the workload and splitting up the workload it's self. (Since you can't define a span out of the tasks based on the workload) That's no beuno.

Timeouts: From what I have seen there are issues where the next dag run scheduled can interfere with the last one. This is an issue given the timeouts, retries, and reoccurring schedules. (yes you can say.. that's user's choice.. however, workloads and performance can change without notice)

----

Another issue I had: There is no way to trigger a task and it's depending tasks without triggering the whole dag. This makes long-running dags with lots of tasks difficult to debug and test.

Also there is a slight difference between airflow run (task) and test. Sometimes you use one vs the other.

sejtnjir · on Aug 29, 2018

Adopting Airflow over cron was a huge leap for the analytics company I worked for. I think the biggest win was making the complex net of dependent batch operations explicit. That logic lived inside the head of the lead dev before that point. This allowed management to reason about how new functionality effected operational complexity, which would have been unimaginable before.

framebit · on Aug 29, 2018

On a Data Engineering/Data Science team we were using it as the missing Apache Spark scheduler. At first cron was good enough but then as more projects and processes and people came online we turned to Airflow to help wrangle everything and it helped!

glogla · on Aug 29, 2018

We tested it, but the performancd was bad. We needed hundred workflows with few hundred taks each, and Airflow would just topple over daily.

We ended up with proprietary tool from Teradata thats basically Airflow written in perl - but it can handle all the work.

Other than scalability, Airflow is pretty nice.

caravel · on Aug 29, 2018

[full disclosure, I'm the creator of Airflow]

Many environments run tens of thousands of concurrent tasks, and hundreds of thousands of tasks daily. The list of companies using Airflow speaks for itself https://github.com/apache/incubator-superset#who-uses-apache...

But hey, it's like anything, you have to do a bit of work to get distributed systems to run at scale. There are now hosted solutions to help with that (Google Cloud Composer and Astronomer.io)

tedmiston · on Aug 29, 2018

https://github.com/apache/incubator-airflow#who-uses-airflow

:)

tedmiston · on Aug 29, 2018

I run an Airflow instance that does millions of tasks per month across dozens of DAGs. There's some performance tuning involved in the configuration file and of course you need the underlying resources available but Airflow has scaled to this level well for us.

If you are able to reproduce and can post to the dev mailing list, we are happy to help... especially so if it gets you off of a proprietary tool written in Perl ;).

1.10 was just released and adds a ton of commits. I'd really encourage you to give Airflow another shot if you have the time.

vgy7ujm · on Aug 30, 2018

Millions per MONTH is just ridiculous. Stay with what works, revisiting this slow Python tool will be a cost and time sink. A mistake companies let engineers with millennial complex do all too often.

tpaschalis · on Aug 29, 2018

It sounds weird that you had these problems. Silly question, but might it have been a database optimization issue?

rywalker · on Aug 29, 2018

My guess is that you had an underpowered database instance backing Airflow.

vgy7ujm · on Aug 29, 2018

Perl saved the day again.

xkcd-sucks · on Aug 29, 2018

What strategies do people use to make Airflow behave like an "event-driven" scheduler versus a "time-driven" scheduler? Like, for example, processing data as it is received versus processing data at set time intervals

tedmiston · on Aug 30, 2018

One way to achieve that is to use externally triggered DAGs [1][2].

Instead of trying to make the Airflow scheduler work as event-driven though, I've more tried to adapt my approach.

I've used Airflow as a bridge between a micro batch streaming system (Sparking Streaming --> S3, Airflow: S3 --> ...) successfully, but there are some challenges around the edge cases like files landing later than expected.

One approach you can take is to have the Airflow sensor hit a file that's a proxy for all of the files being present. For example, if your process could write hundreds of S3 files, once it's finished the last write for that hour (even if that happens late for whatever reason), then it could write a top-level OK file that the sensor hits.

Can you elaborate on your specific use case?

[1]: https://stackoverflow.com/a/50592573/149428

[2]: https://airflow.readthedocs.io/en/stable/scheduler.html#exte...

yzmtf2008 · on Aug 29, 2018

For "event-driven" scheduler, you may want to have a streaming system like Flink or Spark Streaming instead. You can schedule the startup of these pipelines with Airflow (e.g., using BashOperator), and handle the actual streaming with the streaming systems directly.

We use Airflow with this pattern extensively.

tpaschalis · on Aug 29, 2018

(Newbie Airflow user here). I believe one easy way to do it is by using Airflow's 'sensors'.

Sensors are operators which poke continuously with an action until it returns True (eg. until a file exists, an API gives a specific response, a process/query has finished).

Another way to do it would be to 'XComs', small pieces of information flying between DAGs, or 'Triggers', but these require some more setup, and IMO depend more on the way you're setting up your tasks.

xkcd-sucks · on Aug 30, 2018

Yeah the issue is once a sensor fires once, it doesn't reset and keep firing on new data

My homegrown solution is a sensor at the beginning, and at the end an airflow api call to trigger a dag run of the same dag. Not DagRunOperator because then no dag would never finish due to infinite recursion

It seems kinda sketchy so I'm considering a lower level Celery implementation or even GenStage

weego · on Aug 29, 2018

I've no idea what that presentation thing is, but no. It's stuttering, grinding my mac to a halt, makes skimming through tedious.

tpfour · on Aug 29, 2018

I couldn't be bothered to even pay attention to the material. I found the presentation repugnant.

From Prezi's homepage: "Harvard researchers find Prezi more engaging, persuasive, and effective than PowerPoint.". My experience was the complete opposite. The medium completely destroyed the message.

dec0dedab0de · on Aug 29, 2018

Prezi was pretty popular at my local PUG 5 or 6 years ago. When done right, it can be an effective way of illustrating how different parts of a talk are connected. But yeah, clicking through this is a pain.

grillorafael · on Aug 29, 2018

The animations kills me

caravel · on Aug 29, 2018

Here's the actual talk: https://www.youtube.com/watch?v=23_1WlxGGM4

dang · on Aug 29, 2018

Ok, since people were complaining so much about the previous url, which was https://prezi.com/p/adxlaplcwzho/advanced-data-engineering-p..., we've switched to the video. Thanks!

yoquan · on Aug 30, 2018

Thank you for switching to the video - I was late to response due to different timezone. (I've chosen slide over video thinking people generally don't like video. Seems the slide format was more disturbing)

ckdarby · on Aug 29, 2018

This is the type of slides that would benefit HN if they had the video with them as well.

blobbers · on Aug 29, 2018

https://en.wikipedia.org/wiki/Matthew_7:7%E2%80%938

:

https://www.youtube.com/watch?v=23_1WlxGGM4