Building the Pipeline

Challenges, architectural decisions, and lessons learned.

Project Overview

The EU Energy Grid Monitor is a data engineering pipeline that ingests, processes, and visualizes real-time energy data from across Europe. It is built with a decoupled microservices architecture using Python, Docker, Apache Kafka, and CockroachDB, designed to run entirely on free-tier infrastructure.

1. Motivation & Constraints

I was motivated to do this project following my exchange semester at IE University in Madrid, Spain. I took a "Big Data Technologies" course where we studied data streaming concepts and Apache Kafka theoretically, but never had the opportunity to implement them in a live environment.

I built this project to bridge that gap between theory and practice. My primary goal was to learn data engineering and streaming principles hands-on. To focus on learning and improving my skills, I imposed a strict requirement: the entire infrastructure cost must be $0. This constraint drove every major architectural decision, from the choice of database to the hosting strategy.

2. The Pipeline Architecture

While the data ingestion could technically be handled by a simple cron job, I architected a streaming solution to gain experience with event-driven systems. The core pipeline consists of three decoupled Python applications orchestrated via Docker:

Ingest App: Polls the ENTSO-E API hourly, parsing raw XML documents and producing events to specific Kafka topics (e.g., raw-generation-events).
Process App: Consumes raw events and enriches them with business logic, mapping EIC codes to readable country names, calculating Carbon emissions based on generation type, and standardizing time intervals.
Storage App: A sink service that consumes enriched events and persists them to CockroachDB.

The applications run on a free Oracle VM (1GB RAM) utilizing swap space to manage memory constraints. They do not communicate directly; instead, they rely on a Kafka cluster hosted on Confluent Cloud, ensuring the system remains decoupled and resilient.

To ensure extensibility, I used an object-oriented approach. Adding a new metric (e.g., load or transmission) requires minimal boilerplate - simply creating new classes that inherit from the base ingestion logic. The business logic is backed by a pytest suite to prevent regressions during updates.

3. Data Challenges & Solutions

Domain Complexity

The energy domain is complex. Mapping EIC (Energy Identification Code) values to specific regions was difficult, as some codes represent single countries while others represent specific bidding zones or multiple countries. Furthermore, finding reliable carbon intensity factors for specific generation types required extensive research. I ensured accuracy by sourcing and citing every CO2 figure used in the mapping logic.

Scaling & Storage

I initially built the persistence layer on Supabase (PostgreSQL). However, the backfilling process quickly exceeded their free-tier limits, and I had to migrate to CockroachDB, which was able to handle the full volume. As of December 2025, the database hosts over 12 million generation events and 1.8 million price events.

Data Consistency & Idempotency

The ENTSO-E API doesn't guarantee data availability for the most recent hour immediately. To handle this, the ingest app utilizes a deep backfill strategy:

Every hour, it requests the last 4 hours of data.
Every day, it performs a deep fetch of the last 3 days.

To support this, the storage logic was designed to be idempotent. It handles duplicate data arrival, ensuring that aggressive backfilling corrects missing data without corrupting the tables.

4. The Frontend

Visualizing the data presented an unexpected deployment challenge. I originally built comprehensive dashboards using Grafana. However, I realized that safely hosting a public Grafana instance was not feasible on a $0 budget without exposing the database to expensive, uncontrolled queries. To resolve this, I pivoted to a static dashboard architecture using GitHub Pages.

Currently, a Python script runs every 4 hours via GitHub Actions and queries against CockroachDB, before saving the results as static JSON files on a dedicated GitHub branch.

To replace the Grafana dashboards, I built a custom frontend using Apache ECharts. I used AI tools to accelerate the development of the frontend, allowing me to focus on the backend. This way, I was able to create a zero-cost dashboard that updates in near real-time without any risk of expensive queries.

5. Future Improvements

The project is currently functional, but, in the future, I'd like to add:

Cross-border flows to visualise energy imports and exports.
Energy load metrics to analyse grid stress.
Possibly a notification service for specific grid events like negative pricing.