Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation. Made-with: Cursor
Electricity Price Predictor
Standalone module for ENTSO-E ingestion and feature-store creation in quant_db.
Documentation
- High-level architecture:
docs/architecture.md - Detailed developer documentation (file-by-file + UML):
docs/developer_guide.md
What it builds
- Input columns:
day_ahead_priceload_forecastwind_forecastsolar_forecast
- Derived columns:
residual_loadlagged_price(t-1...t-24, stored as array length 24)lagged_residual_load(t-1...t-24, stored as array length 24)hour_of_day_sin,hour_of_day_cosweekday_sin,weekday_cosmonth_sin,month_cos
Column units (data dictionary)
All timestamps are hourly and stored in UTC (delivery_start).
day_ahead_price- Unit:
EUR/MWh(euros per megawatt-hour).
- Unit:
load_forecast- Unit:
MW(megawatts).
- Unit:
wind_forecast- Unit:
MW(megawatts).
- Unit:
solar_forecast- Unit:
MW(megawatts).
- Unit:
residual_load = load_forecast - wind_forecast - solar_forecast- Unit:
MW(megawatts).
- Unit:
lagged_price(array of 24 values fort-1..t-24)- Unit:
EUR/MWh.
- Unit:
lagged_residual_load(array of 24 values fort-1..t-24)- Unit:
MW.
- Unit:
hour_of_day_sin,hour_of_day_cos- Unit: dimensionless in
[-1, 1].
- Unit: dimensionless in
weekday_sin,weekday_cos- Unit: dimensionless in
[-1, 1].
- Unit: dimensionless in
month_sin,month_cos- Unit: dimensionless in
[-1, 1].
- Unit: dimensionless in
Missing-data semantics
The pipeline intentionally distinguishes missing from measured zero:
- Forecast columns (
load_forecast,wind_forecast,solar_forecast) remainNaNwhen ENTSO-E has no value. residual_loadis computed directly from source columns and remainsNaNwhen any source component is missing.- Feature engineering drops rows only for lag warmup requirements (
day_ahead_priceandlagged_price_1..24), not for every nullable forecast column. - During DB persistence to
electricity_price_features, rows with nulls in NOT NULL core columns are skipped (to satisfy schema constraints) while still being available in the returned in-memory DataFrame.
Data contracts
In-memory contract (run_feature_pipeline(...) return value)
- Index:
- Type: timezone-aware
DatetimeIndex - Granularity: hourly
- Timezone: UTC
- Uniqueness: unique timestamps expected
- Type: timezone-aware
- Columns:
- Base signals:
day_ahead_price,load_forecast,wind_forecast,solar_forecast - Derived:
residual_load - Lag vectors (expanded):
lagged_price_1..24,lagged_residual_load_1..24 - Cyclical:
hour_of_day_sin,hour_of_day_cos,weekday_sin,weekday_cos,month_sin,month_cos
- Base signals:
- Nullability:
day_ahead_price: required for returned rows.lagged_price_1..24: required for returned rows.load_forecast,wind_forecast,solar_forecast,residual_load, andlagged_residual_load_*: nullable (NaNallowed).
Persistence contract (electricity_price_features)
- Persisted row key:
- (
country_code,delivery_start,feature_version)
- (
- NOT NULL core columns in schema:
day_ahead_price,load_forecast,wind_forecast,solar_forecast,residual_loadlagged_price,lagged_residual_load- cyclical columns (
hour_of_day_*,weekday_*,month_*)
- Write behavior:
- The persistence layer filters out non-conforming rows (nulls in NOT NULL core columns) before UPSERT.
- Persisted lag arrays are fixed length 24 and map to
t-1..t-24.
Raw observations contract (electricity_market_observations)
- Indexing/key semantics:
- one row per (
country_code,delivery_start)
- one row per (
- Update semantics:
- partial refreshes do not overwrite existing non-null values with nulls (
COALESCEmerge policy)
- partial refreshes do not overwrite existing non-null values with nulls (
Country-code and API behavior notes
- Use ENTSO-E bidding-zone identifiers (for example
DE_LUrather thanDE) when querying across all endpoints. - The pipeline includes a bidding-zone resolver for common country aliases:
DE -> DE_LUIT -> IT_NORD
- Persistence uses the resolved bidding-zone code so DB keys match the queried market zone.
- Some ENTSO-E endpoints may return no matches for specific windows/countries. These are handled as empty hourly frames for that endpoint rather than hard-failing the whole fetch.
- Wind/solar responses can include duplicate semantic columns after normalization; the service coalesces duplicates by taking the first non-null value per timestamp.
Database objects
sql/001_electricity_price_schema.sql creates:
entsoe_api_cache: generic decorator cache table (pickled payloads, TTL support)electricity_market_observations: raw hourly ENTSO-E observationselectricity_price_features: model-ready feature store
Setup
cd electricity_price_predictor
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Set env vars:
export ENTSOE_API_KEY="your_entsoe_key"
export QUANT_DB_HOST="localhost"
export QUANT_DB_PORT="5432"
export QUANT_DB_NAME="quant_db"
export QUANT_DB_USER="quant_user"
export QUANT_DB_PASSWORD="strong_password"
Initialize schema
PYTHONPATH=src python scripts/init_db.py
Build feature store
PYTHONPATH=src python scripts/build_feature_store.py \
--country-code DE_LU \
--start 2026-01-01T00:00:00Z \
--end 2026-02-01T00:00:00Z \
--cache-ttl-hours 24
Decorator caching behavior
The cache_to_db decorator:
- hashes function name + arguments into deterministic
cache_key - checks
entsoe_api_cachefirst - returns cached payload if key exists and not expired
- otherwise uses
electricity_market_observationsas secondary cache at timestamp level - only calls ENTSO-E for missing hourly intervals, then upserts those rows
- stores final returned object in
entsoe_api_cache
This gives a two-layer cache:
- fast function-result cache (
entsoe_api_cache) and - canonical timestamp cache (
electricity_market_observations).