# Developer Guide (Deep Dive) This guide explains each file in the module, execution order, control flow, and data/state transitions so you can reason about behavior without reading source code. ## 1) Directory map and responsibilities ### Top-level - `requirements.txt` - Python dependencies for ingestion and DB persistence. - `README.md` - Operator-focused setup and run commands. - `sql/001_electricity_price_schema.sql` - DDL for cache, raw observations, and feature store. - `scripts/init_db.py` - Applies the SQL schema to `quant_db`. - `scripts/build_feature_store.py` - CLI entrypoint for data fetch + feature persistence. - `docs/architecture.md` - High-level architecture summary. - `docs/developer_guide.md` - This detailed developer-facing explanation. ### Python package (`src/electricity_price_predictor`) - `__init__.py` - Public package exports (`get_engine`, `EntsoeDataService`, `build_feature_frame`). - `db.py` - Builds DB URL from env vars and creates SQLAlchemy `Engine`. - `cache.py` - Implements decorator-based DB cache with deterministic keying. - `entsoe_api.py` - Wraps ENTSO-E API calls, normalizes data, and writes raw observations. - `features.py` - Pure feature engineering logic (residual load, lags, cyclical encoding). - `pipeline.py` - Orchestration layer for end-to-end fetch -> raw persist -> feature build -> feature persist. ## 2) Runtime execution path (step-by-step) When you run: ```bash PYTHONPATH=src python3 scripts/build_feature_store.py --country-code ... --start ... --end ... ``` Execution sequence: 1. **Argument parsing** - `build_feature_store.py` reads country code/time range/TTL. 2. **Credential/connection bootstrap** - checks `ENTSOE_API_KEY`. - calls `get_engine()` from `db.py`. 3. **Pipeline orchestration** - `run_feature_pipeline(...)` in `pipeline.py` starts. 4. **API service creation** - initializes `EntsoePandasClient`. - creates `EntsoeDataService(client, engine, cache_ttl_hours)`. 5. **Decorator wrapping** - in `EntsoeDataService.__post_init__`, API methods are wrapped by `cache_to_db(...)`. 6. **Data retrieval** - `fetch_inputs(...)` calls: - `get_day_ahead_prices(...)` - `get_load_forecast(...)` - `get_wind_solar_forecast(...)` - country aliases are normalized to bidding zones before queries (currently `DE -> DE_LU`, `IT -> IT_NORD`). 7. **Cache check/compute loop (per call)** - decorator computes hash key from function + args. - if non-expired row exists in `entsoe_api_cache`: returns payload. - else: reads `electricity_market_observations` for requested timestamps. - if timestamps are missing there, only missing hourly ranges are requested from ENTSO-E. - `NoMatchingDataError` from ENTSO-E is converted to an empty hourly frame for that endpoint/range. - normalized responses coalesce duplicate semantic columns (for example multiple wind/solar columns) via first non-null-per-row. - missing rows are upserted into `electricity_market_observations`. - final merged dataset is stored in `entsoe_api_cache` and returned. 8. **Raw persistence** - merged inputs are upserted to `electricity_market_observations`. 9. **Feature engineering** - `build_feature_frame(...)` computes: - `residual_load = load - wind - solar` - `lagged_price_1..24` - `lagged_residual_load_1..24` - `hour_of_day_sin/cos`, `weekday_sin/cos`, `month_sin/cos` - preserves source missingness as `NaN` (no 0.0 imputation). - drops rows only when `day_ahead_price` / `lagged_price_1..24` are missing (lag warmup requirement). 10. **Feature-store persistence** - lags are materialized into PostgreSQL arrays (`DOUBLE PRECISION[]`, length 24). - rows violating NOT NULL core feature constraints are filtered out before upsert. - persistable rows are upserted to `electricity_price_features`. 11. **CLI completion** - prints persisted row count. ## 3) UML diagrams ## 3.1 Component diagram ```mermaid flowchart LR CLI[scripts/build_feature_store.py] --> PIPE[pipeline.run_feature_pipeline] PIPE --> DBMOD[db.get_engine] PIPE --> SERVICE[EntsoeDataService] SERVICE --> CACHEDEC[cache_to_db decorator] SERVICE --> ENTSOE[EntsoePandasClient] SERVICE --> SECONDARY[electricity_market_observations secondary cache] PIPE --> FEAT[features.build_feature_frame NaN-preserving] FEAT --> PERSIST[pipeline.persist_feature_frame null-filtered] CACHEDEC --> DB[(quant_db.entsoe_api_cache)] SECONDARY --> RAW[(quant_db.electricity_market_observations)] PERSIST --> STORE[(quant_db.electricity_price_features)] ``` ## 3.2 Class diagram (logical) ```mermaid classDiagram class EntsoeDataService { +client: EntsoePandasClient +engine: Engine +cache_ttl_hours: Optional[int] +fetch_inputs(country_code, start, end) DataFrame +upsert_raw_data(country_code, frame) None -_get_day_ahead_prices_impl(country_code, start, end) Series -_get_load_forecast_impl(country_code, start, end) Series -_get_wind_solar_forecast_impl(country_code, start, end) DataFrame } class CacheDecorator { +cache_to_db(engine, namespace, ttl_hours) decorator -_build_cache_key(function_name, args, kwargs) str } class FeatureBuilder { +build_feature_frame(inputs, max_lag=24) DataFrame -_cyclical_encode(values, period, prefix) DataFrame } class Pipeline { +run_feature_pipeline(engine, entsoe_api_key, country_code, start, end, cache_ttl_hours) DataFrame +persist_feature_frame(engine, country_code, feature_frame) None } Pipeline --> EntsoeDataService : uses Pipeline --> FeatureBuilder : uses EntsoeDataService --> CacheDecorator : wraps methods ``` ## 3.3 Sequence diagram (single API method with cache) ```mermaid sequenceDiagram participant Caller as fetch_inputs() participant Decorator as cache_to_db wrapper participant CacheTable as entsoe_api_cache (L1) participant ObsTable as electricity_market_observations (L2) participant API as ENTSO-E API Caller->>Decorator: get_day_ahead_prices(country, start, end) Decorator->>CacheTable: SELECT by cache_key and expires_at alt L1 cache hit CacheTable-->>Decorator: payload Decorator-->>Caller: unpickled pandas object else L1 cache miss/expired Decorator->>ObsTable: SELECT existing timestamps alt L2 fully covers range ObsTable-->>Decorator: pandas-compatible rows else L2 has gaps Decorator->>API: query only missing ranges alt API returns data API-->>Decorator: missing rows Decorator->>Decorator: normalize columns + coalesce duplicates Decorator->>ObsTable: UPSERT missing rows else NoMatchingDataError Decorator->>Decorator: synthesize empty hourly frame end end Decorator->>CacheTable: INSERT/UPSERT merged payload Decorator-->>Caller: fresh result end ``` ## 3.4 State diagram (cache entry lifecycle) ```mermaid stateDiagram-v2 [*] --> L1Missing L1Missing --> L2Check: cache miss/expiry L2Check --> Fresh: observation table fully covers range L2Check --> Partial: observation table has gaps Partial --> Fresh: fetch missing ranges, upsert L2, upsert L1 Fresh --> Fresh: reused before expiry Fresh --> Expired: TTL passes for L1 entry Expired --> L2Check: next call Fresh --> Overwritten: Same key, new payload upsert Overwritten --> Fresh ``` ## 3.5 ER diagram (database schema) ```mermaid erDiagram entsoe_api_cache { text cache_key PK text namespace text function_name jsonb args_json bytea payload timestamptz created_at timestamptz expires_at } electricity_market_observations { text country_code PK timestamptz delivery_start PK float day_ahead_price float load_forecast float wind_forecast float solar_forecast timestamptz ingested_at } electricity_price_features { text country_code PK timestamptz delivery_start PK text feature_version PK float day_ahead_price float load_forecast float wind_forecast float solar_forecast float residual_load float[] lagged_price float[] lagged_residual_load float hour_of_day_sin float hour_of_day_cos float weekday_sin float weekday_cos float month_sin float month_cos timestamptz created_at } ``` ## 4) How files collaborate ## 4.1 `db.py` + scripts - Scripts never hardcode DB URI; they call `get_engine()`. - `get_engine()` centralizes environment-driven connectivity. ## 4.2 `cache.py` + `entsoe_api.py` - `cache_to_db()` is generic and independent of ENTSO-E specifics. - `EntsoeDataService.__post_init__` binds that generic decorator to each API-fetch method. - Result: all expensive API calls automatically become cache-aware without changing call sites. ## 4.3 `entsoe_api.py` + `features.py` - `entsoe_api.py` guarantees normalized timestamp index and expected source columns. - `features.py` assumes these columns and transforms them to model features only (no DB side effects). ## 4.4 `features.py` + `pipeline.py` - `build_feature_frame()` returns wide DataFrame with `lagged_*_1..24`. - `persist_feature_frame()` converts those to PostgreSQL arrays so table rows stay compact and versioned. ## 5) Important implementation details - **Cache keys are deterministic** - Built from JSON of function name + args + kwargs with stable sorting. - **Cache payload type** - `pickle` stored in `BYTEA` to preserve pandas objects. - **TTL logic** - `expires_at IS NULL` means never expires. - Otherwise must be greater than current UTC time to be considered valid. - **Two-layer cache order** - Layer 1: `entsoe_api_cache` (function-result cache). - Layer 2: `electricity_market_observations` (timestamp-level raw cache). - API calls happen only for Layer-2 gaps. - **Upsert strategy** - Raw and feature tables use `ON CONFLICT ... DO UPDATE` for idempotent reruns. - Raw upsert uses `COALESCE(EXCLUDED.col, existing.col)` to avoid null-overwriting previously stored values during partial refreshes. - Feature upsert operates on a filtered persistable subset where core NOT NULL columns are present. - **Missingness semantics** - Forecast and derived residual columns preserve `NaN` in memory. - No zero-imputation is performed for missing forecast values. - **Bidding-zone normalization** - `resolve_bidding_zone_code(...)` maps common country aliases to ENTSO-E zone codes. - Pipeline persistence uses the resolved code, ensuring DB keys match actual queried zones. - **Timezone handling** - API index is normalized to UTC to avoid DST ambiguity in lag features. - **Feature warmup** - Rows missing `day_ahead_price` or any `lagged_price_1..24` are dropped because lag history is incomplete. ## 6) Failure modes and expected behavior - Missing `ENTSOE_API_KEY` -> CLI raises early runtime error. - Missing required input columns -> feature builder raises `ValueError`. - Duplicate normalized columns from ENTSO-E payloads -> coalesced before reindexing to avoid pandas duplicate-label reindex errors. - ENTSO-E no-data responses for an endpoint/range -> transformed to empty hourly frames and merged safely. - Empty data frame -> raw/feature persistence functions no-op safely. - Repeated identical request -> cache hit (no API roundtrip). - Expired L1 cache row + full L2 coverage -> no API call required. - Expired L1 cache row + partial L2 coverage -> API called only for missing ranges. ## 7) Data contracts ### 7.1 In-memory features contract Producer: `run_feature_pipeline(...)` return value (`pd.DataFrame`). - **Index contract** - hourly UTC `DatetimeIndex`, sorted ascending. - unique timestamps expected after deduplication. - **Column contract** - base: `day_ahead_price`, `load_forecast`, `wind_forecast`, `solar_forecast` - derived: `residual_load` - lag columns: `lagged_price_1..24`, `lagged_residual_load_1..24` - cyclical: `hour_of_day_sin/cos`, `weekday_sin/cos`, `month_sin/cos` - **Nullability contract** - required non-null in returned rows: `day_ahead_price`, `lagged_price_1..24` - nullable: `load_forecast`, `wind_forecast`, `solar_forecast`, `residual_load`, and `lagged_residual_load_*` - rationale: preserve upstream missingness semantics for analysis and QC. ### 7.2 Feature-store persistence contract Consumer: `electricity_price_features` table. - **Primary key contract** - (`country_code`, `delivery_start`, `feature_version`) - **Schema constraint contract** - core numeric columns are `NOT NULL`. - lag arrays are `DOUBLE PRECISION[]` and expected length 24. - **Write-time contract** - `persist_feature_frame(...)` filters rows that violate NOT NULL core columns before UPSERT. - retained rows are idempotently upserted via `ON CONFLICT ... DO UPDATE`. ### 7.3 Raw-observation contract Consumer: `electricity_market_observations` table. - **Primary key contract** - (`country_code`, `delivery_start`) - **Merge contract** - upsert uses `COALESCE(EXCLUDED.col, existing.col)` to avoid null-overwriting prior known values. - **Coverage contract** - secondary cache guarantees fetched payloads are aligned to expected hourly index for the requested `[start, end)` range. ## 8) Practical debugging checklist 1. Run `scripts/init_db.py` and ensure tables exist. 2. Run one short-range fetch window (1-2 days) first. 3. Verify cache growth: - `SELECT namespace, function_name, COUNT(*) FROM entsoe_api_cache GROUP BY 1,2;` 4. Verify raw persistence: - `SELECT COUNT(*) FROM electricity_market_observations WHERE country_code = '...';` 5. Verify feature persistence: - check lag array sizes are 24 and row count is lower than raw by about 24. ## 9) Suggested next developer docs to add - Data quality rules (acceptable missingness, clipping policy, anomaly handling). - Training-set contract (target definition, split strategy, leakage constraints). - Backfill/replay policy for reprocessing historical periods.