Production Starts Before Data: Why Environments, Dependencies, and Linting Matter

Stefan Vodilovski
Feb 1
3 min read

Most data and AI projects start at the same place: Ingestion, pipelines, transformations, models.

That’s not wrong. But it skips the part where many real projects quietly fall apart.

Before extracting a single row of data for your project, you should stop and ask a much more basic question:

Can someone else reproduce this system? - If the answer is no, nothing else matters.

Why environment issues are production issues

In theory, environments are boring, in practice, they are one of the biggest sources of friction in real teams.

If you’ve worked on production systems, this will sound familiar:

"It works on my machine"
"Which Python version are you using?"
"It broke after I upgraded a dependency"
"I can’t reproduce this bug locally"
"CI is failing but I don’t know why"

None of these are the data or AI model related problems. They are environment problems, and they scale badly.

When environments drift, teams lose trust:

trust in test results
trust in pipelines
trust in each other’s changes

That’s why, for your project, the first thing you should build is the environment.

Dependency management is not optional

Dependencies are part of your system whether you manage them intentionally or not.

Unpinned dependencies create hidden risk:

installs change over time
behavior shifts without code changes
bugs appear that can’t be reproduced later

For a learning project, that’s annoying but for a production system, it’s dangerous.

So the baseline you should aim for is simple:

dependencies are pinned
installs are deterministic
everyone runs the same versions locally and in CI

The exact tool doesn’t matter as much as the outcome. What matters is that dependency resolution is explicit and repeatable, not implicit and fragile. If you can’t recreate an environment from scratch, you don’t control your system.

Formatting and linting are not about style

Formatting and linting are often framed as "code quality" or "nice to have" and that’s underselling them.

In production systems, formatting and linting are about:

reducing ambiguity
preventing accidental complexity
catching issues early, before runtime

Consistent formatting removes entire classes of discussions from code review and linting catches problems when they’re cheapest to fix. More importantly, they make intent clearer.

When someone new opens the repository, they shouldn’t have to guess:

how code should look
which patterns are acceptable
what will fail CI

These rules shouldn’t live in people’s heads. They should live in the tooling.

Formatting should be enforced, not suggested and linting should fail fast, not warn quietly.

Reproducibility is a team feature

Even if you’re building solo, you’re not really building for yourself.

You’re building for:

your future self
a teammate who joins later
a reviewer
a CI system
or a production incident at the worst possible time

Reproducibility means:

a clean setup process
minimal hidden assumptions
no reliance on global state or local hacks

If onboarding takes hours of explanation, the system is already fragile.

A reproducible environment doesn’t make a system faster it makes it survivable.

Why this comes before ingestion

It’s tempting to start with data, that’s where things feel concrete, but ingestion amplifies mistakes.

Once data enters the system:

CDC will replicate it perfectly
streaming will move it faster
analytics will trust it implicitly

If your environment is unstable, debugging downstream issues becomes guesswork.

That’s why, in any project, environment setup should come first.

-Want to read more Blogs about data and AI systems? - Subscribe to our newsletter below!