Why Airflow Is the Wrong Tool for Most Data Engineering Teams

Stefan Vodilovski
Feb 11
3 min read

You know the drill, if ETL is mentioned Airflow is assumed as the scheduler behind the tasks.

Airflow has become one of those tools where you directly associate the process with the tool, but do we need it all the time? - Probably not.

What Airflow does well?

Let’s at least start with the positives, because they’re real.

Airflow is very good at:

expressing dependencies explicitly
making execution order visible
scheduling jobs in a predictable way
giving you a clear UI to see what ran, when, and why it failed

Once everything is wired correctly, the mental model of DAGs is powerful. You can reason about workflows as graphs instead of scripts, which is genuinely useful.

When you are dealing with very complex workflows, having all of that helps to visualize what actually happens.

As an orchestrator, Airflow does its job.

What you don’t expect to be so hard

All of the friction comes before Airflow starts doing anything useful.

To get to a point where you could even see your DAG in the UI, you have to:

get the local setup right
configure the scheduler, webserver, and metadata database
debug why the DAG wasn’t being picked up
understand how Airflow loads and interprets files

At this stage, no data has moved, you just set up your environment. And that drives a lot of data engineers mad, especially for the ones who are just starting out or if you have tight deadlines.

Docker fixes things a bit

You don’t have to install everything on your local machine to spin up your Airflow DAG, you can use docker. But that also has its own setup.

It simplifies the whole process, but you have so many environment variables to set up you will drown yourself in the documentation.

The deployment is the easy part, right?

One of the best ways you will learn DevOps is by trying to deploy Airflow. You finally ran your DAG locally, and the first thing you can struggle with is dockerizig the app. If the team lacks knowledge in the DevOps area this is the part where you can waste time in, not in the things that actually matter to the data engineers.

After you have your image, and you want to deploy it using K8’s good luck!

Now you’re no longer using Airflow, you’re operating it. You need to think about:

persistent volumes for logs and metadata
how schedulers and workers scale
secrets and credentials
upgrades and database migrations
monitoring the orchestrator itself

None of this has anything to do with extracting or transforming data, but all of it is required just to keep Airflow alive.

At this point, Airflow becomes infrastructure and infrastructure comes with a cost.

Airflow makes you care about everything around the data

This is where the mismatch usually appears, data engineers want to:

move data reliably
reason about transformations
validate outputs
evolve schemas safely

Airflow doesn’t help with any of that directly, it doesn’t know if your data is correct, if a retry corrupts downstream tables or if today’s successful run silently broke six months of history.

What it does know is whether a task exited with a zero or non-zero code.

And because of that, teams start adding logic checks, guards, validations, conventions, workarounds.

Slowly, Airflow turns into the place where everything lives.

Is there an easy way?

If you are really keen on using Airflow and you have a budget for the infrastructure costs it brings, use AWS managed Airflow solution.

It gives you everything you need minus the hard part of managing the infrastructure. The caveat is still the local testing but with the right Docker set up you will do the hard part once and foger about it.