Files
pricing/electricity_price_predictor/README.md
ddoebel 73641b7e5b Add electricity price ingestion and feature pipeline.
Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation.

Made-with: Cursor
2026-04-15 11:40:14 +02:00

165 lines
5.8 KiB
Markdown

# Electricity Price Predictor
Standalone module for ENTSO-E ingestion and feature-store creation in `quant_db`.
## Documentation
- High-level architecture: `docs/architecture.md`
- Detailed developer documentation (file-by-file + UML): `docs/developer_guide.md`
## What it builds
- Input columns:
- `day_ahead_price`
- `load_forecast`
- `wind_forecast`
- `solar_forecast`
- Derived columns:
- `residual_load`
- `lagged_price` (`t-1` ... `t-24`, stored as array length 24)
- `lagged_residual_load` (`t-1` ... `t-24`, stored as array length 24)
- `hour_of_day_sin`, `hour_of_day_cos`
- `weekday_sin`, `weekday_cos`
- `month_sin`, `month_cos`
## Column units (data dictionary)
All timestamps are hourly and stored in UTC (`delivery_start`).
- `day_ahead_price`
- Unit: `EUR/MWh` (euros per megawatt-hour).
- `load_forecast`
- Unit: `MW` (megawatts).
- `wind_forecast`
- Unit: `MW` (megawatts).
- `solar_forecast`
- Unit: `MW` (megawatts).
- `residual_load = load_forecast - wind_forecast - solar_forecast`
- Unit: `MW` (megawatts).
- `lagged_price` (array of 24 values for `t-1..t-24`)
- Unit: `EUR/MWh`.
- `lagged_residual_load` (array of 24 values for `t-1..t-24`)
- Unit: `MW`.
- `hour_of_day_sin`, `hour_of_day_cos`
- Unit: dimensionless in `[-1, 1]`.
- `weekday_sin`, `weekday_cos`
- Unit: dimensionless in `[-1, 1]`.
- `month_sin`, `month_cos`
- Unit: dimensionless in `[-1, 1]`.
## Missing-data semantics
The pipeline intentionally distinguishes **missing** from **measured zero**:
- Forecast columns (`load_forecast`, `wind_forecast`, `solar_forecast`) remain `NaN` when ENTSO-E has no value.
- `residual_load` is computed directly from source columns and remains `NaN` when any source component is missing.
- Feature engineering drops rows only for lag warmup requirements (`day_ahead_price` and `lagged_price_1..24`), not for every nullable forecast column.
- During DB persistence to `electricity_price_features`, rows with nulls in NOT NULL core columns are skipped (to satisfy schema constraints) while still being available in the returned in-memory DataFrame.
## Data contracts
### In-memory contract (`run_feature_pipeline(...)` return value)
- Index:
- Type: timezone-aware `DatetimeIndex`
- Granularity: hourly
- Timezone: UTC
- Uniqueness: unique timestamps expected
- Columns:
- Base signals: `day_ahead_price`, `load_forecast`, `wind_forecast`, `solar_forecast`
- Derived: `residual_load`
- Lag vectors (expanded): `lagged_price_1..24`, `lagged_residual_load_1..24`
- Cyclical: `hour_of_day_sin`, `hour_of_day_cos`, `weekday_sin`, `weekday_cos`, `month_sin`, `month_cos`
- Nullability:
- `day_ahead_price`: required for returned rows.
- `lagged_price_1..24`: required for returned rows.
- `load_forecast`, `wind_forecast`, `solar_forecast`, `residual_load`, and `lagged_residual_load_*`: nullable (`NaN` allowed).
### Persistence contract (`electricity_price_features`)
- Persisted row key:
- (`country_code`, `delivery_start`, `feature_version`)
- NOT NULL core columns in schema:
- `day_ahead_price`, `load_forecast`, `wind_forecast`, `solar_forecast`, `residual_load`
- `lagged_price`, `lagged_residual_load`
- cyclical columns (`hour_of_day_*`, `weekday_*`, `month_*`)
- Write behavior:
- The persistence layer filters out non-conforming rows (nulls in NOT NULL core columns) before UPSERT.
- Persisted lag arrays are fixed length 24 and map to `t-1..t-24`.
### Raw observations contract (`electricity_market_observations`)
- Indexing/key semantics:
- one row per (`country_code`, `delivery_start`)
- Update semantics:
- partial refreshes do not overwrite existing non-null values with nulls (`COALESCE` merge policy)
## Country-code and API behavior notes
- Use ENTSO-E bidding-zone identifiers (for example `DE_LU` rather than `DE`) when querying across all endpoints.
- The pipeline includes a bidding-zone resolver for common country aliases:
- `DE -> DE_LU`
- `IT -> IT_NORD`
- Persistence uses the resolved bidding-zone code so DB keys match the queried market zone.
- Some ENTSO-E endpoints may return no matches for specific windows/countries. These are handled as empty hourly frames for that endpoint rather than hard-failing the whole fetch.
- Wind/solar responses can include duplicate semantic columns after normalization; the service coalesces duplicates by taking the first non-null value per timestamp.
## Database objects
`sql/001_electricity_price_schema.sql` creates:
- `entsoe_api_cache`: generic decorator cache table (pickled payloads, TTL support)
- `electricity_market_observations`: raw hourly ENTSO-E observations
- `electricity_price_features`: model-ready feature store
## Setup
```bash
cd electricity_price_predictor
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Set env vars:
```bash
export ENTSOE_API_KEY="your_entsoe_key"
export QUANT_DB_HOST="localhost"
export QUANT_DB_PORT="5432"
export QUANT_DB_NAME="quant_db"
export QUANT_DB_USER="quant_user"
export QUANT_DB_PASSWORD="strong_password"
```
## Initialize schema
```bash
PYTHONPATH=src python scripts/init_db.py
```
## Build feature store
```bash
PYTHONPATH=src python scripts/build_feature_store.py \
--country-code DE_LU \
--start 2026-01-01T00:00:00Z \
--end 2026-02-01T00:00:00Z \
--cache-ttl-hours 24
```
## Decorator caching behavior
The `cache_to_db` decorator:
- hashes function name + arguments into deterministic `cache_key`
- checks `entsoe_api_cache` first
- returns cached payload if key exists and not expired
- otherwise uses `electricity_market_observations` as secondary cache at timestamp level
- only calls ENTSO-E for missing hourly intervals, then upserts those rows
- stores final returned object in `entsoe_api_cache`
This gives a two-layer cache:
1) fast function-result cache (`entsoe_api_cache`) and
2) canonical timestamp cache (`electricity_market_observations`).