Files
pricing/electricity_price_predictor/docs/architecture.md
ddoebel 73641b7e5b Add electricity price ingestion and feature pipeline.
Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation.

Made-with: Cursor
2026-04-15 11:40:14 +02:00

3.2 KiB

Architecture Notes

This document is the quick architecture map. For full file-by-file implementation details, see docs/developer_guide.md.

End-to-end data flow

  1. scripts/build_feature_store.py parses CLI arguments and validates env vars.
  2. It calls pipeline.run_feature_pipeline(...).
  3. EntsoeDataService.fetch_inputs(...) loads:
    • day_ahead_price
    • load_forecast
    • wind_forecast
    • solar_forecast
  4. Each ENTSO-E call is wrapped by cache_to_db(...) and either:
    • serves a hit from entsoe_api_cache, or
    • falls back to electricity_market_observations for already-known timestamps,
    • and performs API calls only for missing hourly intervals.
  5. Missing intervals returned from API are upserted into electricity_market_observations.
  6. The final merged result is cached in entsoe_api_cache.
  7. Raw merged series are upserted to electricity_market_observations.
  8. features.build_feature_frame(...) computes:
    • residual_load
    • lagged arrays (24 values each)
    • cyclical encodings for hour/weekday/month
    • preserves NaN for missing forecast-derived values.
  9. pipeline.persist_feature_frame(...) upserts model-ready rows to electricity_price_features.
    • filters out rows that violate feature-table NOT NULL constraints.

Process diagram

flowchart TD
    A[build_feature_store.py CLI] --> B[run_feature_pipeline]
    B --> C[EntsoeDataService.fetch_inputs]
    C --> D{Hit in entsoe_api_cache?}
    D -->|Yes| E[Load payload from entsoe_api_cache]
    D -->|No| F[Read electricity_market_observations]
    F --> G{Missing hourly timestamps?}
    G -->|No| H[Reuse DB observation rows]
    G -->|Yes| I[Call ENTSO-E only for missing ranges]
    I --> I2{NoMatchingDataError?}
    I2 -->|Yes| I3[Use empty hourly frame for endpoint]
    I2 -->|No| I4[Normalize payload]
    I4 --> I5[Coalesce duplicate columns by first non-null]
    I3 --> J[Upsert missing rows to electricity_market_observations]
    I5 --> J
    H --> K[Build merged input DataFrame]
    J --> K
    K --> L[Store payload in entsoe_api_cache]
    E --> M[Use cached input DataFrame]
    L --> N[Upsert electricity_market_observations]
    M --> N
    N --> O[build_feature_frame]
    O --> P[Create lags + cyclical features]
    P --> P2[Preserve NaN in forecast-derived columns]
    P2 --> P3[Drop rows missing day_ahead_price or lagged_price_1..24]
    P3 --> Q[Upsert persistable subset into electricity_price_features]

Key design reasons

  • DB cache avoids repeated ENTSO-E calls during iterative model work.
  • Observation-table fallback avoids re-fetching timestamps already persisted once.
  • Pickled payloads preserve exact pandas object shape and index information.
  • Feature table stores fixed-size lag arrays so one row corresponds to one prediction timestamp.
  • Missing forecasts are kept as NaN in analysis outputs, avoiding misleading zero-imputation.
  • Persistence layer enforces schema compatibility by skipping rows with nulls in NOT NULL feature columns.

Extension points

  • Add label/target tables (t+1, t+24, etc.).
  • Add training metadata + model registry tables.
  • Add partitioning strategy for multi-year production-scale data.