Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation. Made-with: Cursor
3.2 KiB
3.2 KiB
Architecture Notes
This document is the quick architecture map. For full file-by-file implementation details, see docs/developer_guide.md.
End-to-end data flow
scripts/build_feature_store.pyparses CLI arguments and validates env vars.- It calls
pipeline.run_feature_pipeline(...). EntsoeDataService.fetch_inputs(...)loads:day_ahead_priceload_forecastwind_forecastsolar_forecast
- Each ENTSO-E call is wrapped by
cache_to_db(...)and either:- serves a hit from
entsoe_api_cache, or - falls back to
electricity_market_observationsfor already-known timestamps, - and performs API calls only for missing hourly intervals.
- serves a hit from
- Missing intervals returned from API are upserted into
electricity_market_observations. - The final merged result is cached in
entsoe_api_cache. - Raw merged series are upserted to
electricity_market_observations. features.build_feature_frame(...)computes:residual_load- lagged arrays (24 values each)
- cyclical encodings for hour/weekday/month
- preserves
NaNfor missing forecast-derived values.
pipeline.persist_feature_frame(...)upserts model-ready rows toelectricity_price_features.- filters out rows that violate feature-table NOT NULL constraints.
Process diagram
flowchart TD
A[build_feature_store.py CLI] --> B[run_feature_pipeline]
B --> C[EntsoeDataService.fetch_inputs]
C --> D{Hit in entsoe_api_cache?}
D -->|Yes| E[Load payload from entsoe_api_cache]
D -->|No| F[Read electricity_market_observations]
F --> G{Missing hourly timestamps?}
G -->|No| H[Reuse DB observation rows]
G -->|Yes| I[Call ENTSO-E only for missing ranges]
I --> I2{NoMatchingDataError?}
I2 -->|Yes| I3[Use empty hourly frame for endpoint]
I2 -->|No| I4[Normalize payload]
I4 --> I5[Coalesce duplicate columns by first non-null]
I3 --> J[Upsert missing rows to electricity_market_observations]
I5 --> J
H --> K[Build merged input DataFrame]
J --> K
K --> L[Store payload in entsoe_api_cache]
E --> M[Use cached input DataFrame]
L --> N[Upsert electricity_market_observations]
M --> N
N --> O[build_feature_frame]
O --> P[Create lags + cyclical features]
P --> P2[Preserve NaN in forecast-derived columns]
P2 --> P3[Drop rows missing day_ahead_price or lagged_price_1..24]
P3 --> Q[Upsert persistable subset into electricity_price_features]
Key design reasons
- DB cache avoids repeated ENTSO-E calls during iterative model work.
- Observation-table fallback avoids re-fetching timestamps already persisted once.
- Pickled payloads preserve exact pandas object shape and index information.
- Feature table stores fixed-size lag arrays so one row corresponds to one prediction timestamp.
- Missing forecasts are kept as
NaNin analysis outputs, avoiding misleading zero-imputation. - Persistence layer enforces schema compatibility by skipping rows with nulls in NOT NULL feature columns.
Extension points
- Add label/target tables (
t+1,t+24, etc.). - Add training metadata + model registry tables.
- Add partitioning strategy for multi-year production-scale data.