🧩 Data Engineering Practice Problems

After solving 1,500+ problems on LeetCode and Codeforces, I realized — none of them prepared me for broken CSVs, delayed Kafka messages, or JSONs that lie.

This repo is for engineers who’ve had enough of toy problems. It’s a collection of real-world data engineering scenarios — short, practical exercises inspired by what actually breaks in production.

Why I Built This

Most practice problems test logic. Production tests resilience.

In production, problems don’t come with test cases — they come with missing data, bad assumptions, and time pressure.

So I started collecting real scenarios I’ve seen:

Kafka topics that send data hours late,
CSVs with 2 million rows and 6 different date formats,
JSON events with new fields added mid-release,
ETL jobs that “succeed” but quietly skip records,
Dashboards that stop updating without errors, etc.

What’s Inside

Category	Scenario	What You’ll Practice
Late Data	10 GB of IoT logs arriving out of order	Handle streaming delays without duplication
Schema Drift	JSON events adding new fields mid-release	Validate and evolve safely
ETL Reliability	Long-running jobs silently skipping records	Detect silent corruptions before they spread
Data Hygiene	Partner CSVs with missing headers and fake nulls	Clean data in one pass and log every fix
Rolling Analytics	Continuous sensor feeds with infinite rows	Keep rolling metrics in memory without dying

And many more coming ...

Each problem is small enough to solve in hours, but real enough to prepare you for production.

Getting Started

# 1. Set up your environment
python -m venv venv && source venv/bin/activate

# 2. Use Python 3.10+
# 3. Pick a problem
#    Each folder has a question.md and a reference solution.py

Inputs live in data/, outputs are generated beside them for easy inspection. Data files are excluded intentionally to keep the repo lightweight.

How to Contribute

If you’ve debugged a broken pipeline, caught a silent bug before it spread, built a clever patch that saved a release or found a way to clean a 5 GB CSV in one pass — your story belongs here.

Add a new scenario, or improve an existing one. See the Contribution Guide for details.

The goal isn’t to practice coding. It’s to practice judgment — the kind that keeps systems running when logic alone isn’t enough.

⭐ Star the repo if you’ve ever learned more from production than from tutorials.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Problem 1: Log File Error Analysis		Problem 1: Log File Error Analysis
Problem 2: Rolling Average of Sensor Readings		Problem 2: Rolling Average of Sensor Readings
Problem 3: Transform and Clean Raw Data for Analytics		Problem 3: Transform and Clean Raw Data for Analytics
Problem 4: Schema Evolution & Validation for Streaming Events		Problem 4: Schema Evolution & Validation for Streaming Events
.gitignore		.gitignore
CONTRIBUTION.md		CONTRIBUTION.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧩 Data Engineering Practice Problems

Why I Built This

What’s Inside

Getting Started

How to Contribute

About

Uh oh!

Releases

Packages

Languages

shiningflash/data-engineering-practice-problems

Folders and files

Latest commit

History

Repository files navigation

🧩 Data Engineering Practice Problems

Why I Built This

What’s Inside

Getting Started

How to Contribute

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages