Production Starts Before Data: Why Environments, Dependencies, and Linting Matter
- Stefan Vodilovski
- Feb 1
- 3 min read
Most data and AI projects start at the same place: Ingestion, pipelines, transformations, models.
That’s not wrong. But it skips the part where many real projects quietly fall apart.
Before extracting a single row of data for your project, you should stop and ask a much more basic question:
Can someone else reproduce this system? - If the answer is no, nothing else matters.

Why environment issues are production issues
In theory, environments are boring, in practice, they are one of the biggest sources of friction in real teams.
If you’ve worked on production systems, this will sound familiar:
"It works on my machine"
"Which Python version are you using?"
"It broke after I upgraded a dependency"
"I can’t reproduce this bug locally"
"CI is failing but I don’t know why"
None of these are the data or AI model related problems. They are environment problems, and they scale badly.
When environments drift, teams lose trust:
trust in test results
trust in pipelines
trust in each other’s changes
That’s why, for your project, the first thing you should build is the environment.
Dependency management is not optional
Dependencies are part of your system whether you manage them intentionally or not.
Unpinned dependencies create hidden risk:
installs change over time
behavior shifts without code changes
bugs appear that can’t be reproduced later
For a learning project, that’s annoying but for a production system, it’s dangerous.
So the baseline you should aim for is simple:
dependencies are pinned
installs are deterministic
everyone runs the same versions locally and in CI
The exact tool doesn’t matter as much as the outcome. What matters is that dependency resolution is explicit and repeatable, not implicit and fragile. If you can’t recreate an environment from scratch, you don’t control your system.
Formatting and linting are not about style
Formatting and linting are often framed as "code quality" or "nice to have" and that’s underselling them.
In production systems, formatting and linting are about:
reducing ambiguity
preventing accidental complexity
catching issues early, before runtime
Consistent formatting removes entire classes of discussions from code review and linting catches problems when they’re cheapest to fix. More importantly, they make intent clearer.
When someone new opens the repository, they shouldn’t have to guess:
how code should look
which patterns are acceptable
what will fail CI
These rules shouldn’t live in people’s heads. They should live in the tooling.
Formatting should be enforced, not suggested and linting should fail fast, not warn quietly.
Reproducibility is a team feature
Even if you’re building solo, you’re not really building for yourself.
You’re building for:
your future self
a teammate who joins later
a reviewer
a CI system
or a production incident at the worst possible time
Reproducibility means:
a clean setup process
minimal hidden assumptions
no reliance on global state or local hacks
If onboarding takes hours of explanation, the system is already fragile.
A reproducible environment doesn’t make a system faster it makes it survivable.
Why this comes before ingestion
It’s tempting to start with data, that’s where things feel concrete, but ingestion amplifies mistakes.
Once data enters the system:
CDC will replicate it perfectly
streaming will move it faster
analytics will trust it implicitly
If your environment is unstable, debugging downstream issues becomes guesswork.
That’s why, in any project, environment setup should come first.
-Want to read more Blogs about data and AI systems? - Subscribe to our newsletter below!



Comments