Files
pricing/electricity_price_predictor/docs/developer_guide.md
ddoebel 73641b7e5b Add electricity price ingestion and feature pipeline.
Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation.

Made-with: Cursor
2026-04-15 11:40:14 +02:00

14 KiB

Developer Guide (Deep Dive)

This guide explains each file in the module, execution order, control flow, and data/state transitions so you can reason about behavior without reading source code.

1) Directory map and responsibilities

Top-level

  • requirements.txt
    • Python dependencies for ingestion and DB persistence.
  • README.md
    • Operator-focused setup and run commands.
  • sql/001_electricity_price_schema.sql
    • DDL for cache, raw observations, and feature store.
  • scripts/init_db.py
    • Applies the SQL schema to quant_db.
  • scripts/build_feature_store.py
    • CLI entrypoint for data fetch + feature persistence.
  • docs/architecture.md
    • High-level architecture summary.
  • docs/developer_guide.md
    • This detailed developer-facing explanation.

Python package (src/electricity_price_predictor)

  • __init__.py
    • Public package exports (get_engine, EntsoeDataService, build_feature_frame).
  • db.py
    • Builds DB URL from env vars and creates SQLAlchemy Engine.
  • cache.py
    • Implements decorator-based DB cache with deterministic keying.
  • entsoe_api.py
    • Wraps ENTSO-E API calls, normalizes data, and writes raw observations.
  • features.py
    • Pure feature engineering logic (residual load, lags, cyclical encoding).
  • pipeline.py
    • Orchestration layer for end-to-end fetch -> raw persist -> feature build -> feature persist.

2) Runtime execution path (step-by-step)

When you run:

PYTHONPATH=src python3 scripts/build_feature_store.py --country-code ... --start ... --end ...

Execution sequence:

  1. Argument parsing
    • build_feature_store.py reads country code/time range/TTL.
  2. Credential/connection bootstrap
    • checks ENTSOE_API_KEY.
    • calls get_engine() from db.py.
  3. Pipeline orchestration
    • run_feature_pipeline(...) in pipeline.py starts.
  4. API service creation
    • initializes EntsoePandasClient.
    • creates EntsoeDataService(client, engine, cache_ttl_hours).
  5. Decorator wrapping
    • in EntsoeDataService.__post_init__, API methods are wrapped by cache_to_db(...).
  6. Data retrieval
    • fetch_inputs(...) calls:
      • get_day_ahead_prices(...)
      • get_load_forecast(...)
      • get_wind_solar_forecast(...)
    • country aliases are normalized to bidding zones before queries (currently DE -> DE_LU, IT -> IT_NORD).
  7. Cache check/compute loop (per call)
    • decorator computes hash key from function + args.
    • if non-expired row exists in entsoe_api_cache: returns payload.
    • else: reads electricity_market_observations for requested timestamps.
    • if timestamps are missing there, only missing hourly ranges are requested from ENTSO-E.
    • NoMatchingDataError from ENTSO-E is converted to an empty hourly frame for that endpoint/range.
    • normalized responses coalesce duplicate semantic columns (for example multiple wind/solar columns) via first non-null-per-row.
    • missing rows are upserted into electricity_market_observations.
    • final merged dataset is stored in entsoe_api_cache and returned.
  8. Raw persistence
    • merged inputs are upserted to electricity_market_observations.
  9. Feature engineering
    • build_feature_frame(...) computes:
      • residual_load = load - wind - solar
      • lagged_price_1..24
      • lagged_residual_load_1..24
      • hour_of_day_sin/cos, weekday_sin/cos, month_sin/cos
    • preserves source missingness as NaN (no 0.0 imputation).
    • drops rows only when day_ahead_price / lagged_price_1..24 are missing (lag warmup requirement).
  10. Feature-store persistence
    • lags are materialized into PostgreSQL arrays (DOUBLE PRECISION[], length 24).
  • rows violating NOT NULL core feature constraints are filtered out before upsert.
  • persistable rows are upserted to electricity_price_features.
  1. CLI completion
    • prints persisted row count.

3) UML diagrams

3.1 Component diagram

flowchart LR
    CLI[scripts/build_feature_store.py] --> PIPE[pipeline.run_feature_pipeline]
    PIPE --> DBMOD[db.get_engine]
    PIPE --> SERVICE[EntsoeDataService]
    SERVICE --> CACHEDEC[cache_to_db decorator]
    SERVICE --> ENTSOE[EntsoePandasClient]
    SERVICE --> SECONDARY[electricity_market_observations secondary cache]
    PIPE --> FEAT[features.build_feature_frame NaN-preserving]
    FEAT --> PERSIST[pipeline.persist_feature_frame null-filtered]
    CACHEDEC --> DB[(quant_db.entsoe_api_cache)]
    SECONDARY --> RAW[(quant_db.electricity_market_observations)]
    PERSIST --> STORE[(quant_db.electricity_price_features)]

3.2 Class diagram (logical)

classDiagram
    class EntsoeDataService {
      +client: EntsoePandasClient
      +engine: Engine
      +cache_ttl_hours: Optional[int]
      +fetch_inputs(country_code, start, end) DataFrame
      +upsert_raw_data(country_code, frame) None
      -_get_day_ahead_prices_impl(country_code, start, end) Series
      -_get_load_forecast_impl(country_code, start, end) Series
      -_get_wind_solar_forecast_impl(country_code, start, end) DataFrame
    }

    class CacheDecorator {
      +cache_to_db(engine, namespace, ttl_hours) decorator
      -_build_cache_key(function_name, args, kwargs) str
    }

    class FeatureBuilder {
      +build_feature_frame(inputs, max_lag=24) DataFrame
      -_cyclical_encode(values, period, prefix) DataFrame
    }

    class Pipeline {
      +run_feature_pipeline(engine, entsoe_api_key, country_code, start, end, cache_ttl_hours) DataFrame
      +persist_feature_frame(engine, country_code, feature_frame) None
    }

    Pipeline --> EntsoeDataService : uses
    Pipeline --> FeatureBuilder : uses
    EntsoeDataService --> CacheDecorator : wraps methods

3.3 Sequence diagram (single API method with cache)

sequenceDiagram
    participant Caller as fetch_inputs()
    participant Decorator as cache_to_db wrapper
    participant CacheTable as entsoe_api_cache (L1)
    participant ObsTable as electricity_market_observations (L2)
    participant API as ENTSO-E API

    Caller->>Decorator: get_day_ahead_prices(country, start, end)
    Decorator->>CacheTable: SELECT by cache_key and expires_at
    alt L1 cache hit
        CacheTable-->>Decorator: payload
        Decorator-->>Caller: unpickled pandas object
    else L1 cache miss/expired
        Decorator->>ObsTable: SELECT existing timestamps
        alt L2 fully covers range
            ObsTable-->>Decorator: pandas-compatible rows
        else L2 has gaps
            Decorator->>API: query only missing ranges
            alt API returns data
                API-->>Decorator: missing rows
                Decorator->>Decorator: normalize columns + coalesce duplicates
                Decorator->>ObsTable: UPSERT missing rows
            else NoMatchingDataError
                Decorator->>Decorator: synthesize empty hourly frame
            end
        end
        Decorator->>CacheTable: INSERT/UPSERT merged payload
        Decorator-->>Caller: fresh result
    end

3.4 State diagram (cache entry lifecycle)

stateDiagram-v2
    [*] --> L1Missing
    L1Missing --> L2Check: cache miss/expiry
    L2Check --> Fresh: observation table fully covers range
    L2Check --> Partial: observation table has gaps
    Partial --> Fresh: fetch missing ranges, upsert L2, upsert L1
    Fresh --> Fresh: reused before expiry
    Fresh --> Expired: TTL passes for L1 entry
    Expired --> L2Check: next call
    Fresh --> Overwritten: Same key, new payload upsert
    Overwritten --> Fresh

3.5 ER diagram (database schema)

erDiagram
    entsoe_api_cache {
        text cache_key PK
        text namespace
        text function_name
        jsonb args_json
        bytea payload
        timestamptz created_at
        timestamptz expires_at
    }

    electricity_market_observations {
        text country_code PK
        timestamptz delivery_start PK
        float day_ahead_price
        float load_forecast
        float wind_forecast
        float solar_forecast
        timestamptz ingested_at
    }

    electricity_price_features {
        text country_code PK
        timestamptz delivery_start PK
        text feature_version PK
        float day_ahead_price
        float load_forecast
        float wind_forecast
        float solar_forecast
        float residual_load
        float[] lagged_price
        float[] lagged_residual_load
        float hour_of_day_sin
        float hour_of_day_cos
        float weekday_sin
        float weekday_cos
        float month_sin
        float month_cos
        timestamptz created_at
    }

4) How files collaborate

4.1 db.py + scripts

  • Scripts never hardcode DB URI; they call get_engine().
  • get_engine() centralizes environment-driven connectivity.

4.2 cache.py + entsoe_api.py

  • cache_to_db() is generic and independent of ENTSO-E specifics.
  • EntsoeDataService.__post_init__ binds that generic decorator to each API-fetch method.
  • Result: all expensive API calls automatically become cache-aware without changing call sites.

4.3 entsoe_api.py + features.py

  • entsoe_api.py guarantees normalized timestamp index and expected source columns.
  • features.py assumes these columns and transforms them to model features only (no DB side effects).

4.4 features.py + pipeline.py

  • build_feature_frame() returns wide DataFrame with lagged_*_1..24.
  • persist_feature_frame() converts those to PostgreSQL arrays so table rows stay compact and versioned.

5) Important implementation details

  • Cache keys are deterministic
    • Built from JSON of function name + args + kwargs with stable sorting.
  • Cache payload type
    • pickle stored in BYTEA to preserve pandas objects.
  • TTL logic
    • expires_at IS NULL means never expires.
    • Otherwise must be greater than current UTC time to be considered valid.
  • Two-layer cache order
    • Layer 1: entsoe_api_cache (function-result cache).
    • Layer 2: electricity_market_observations (timestamp-level raw cache).
    • API calls happen only for Layer-2 gaps.
  • Upsert strategy
    • Raw and feature tables use ON CONFLICT ... DO UPDATE for idempotent reruns.
    • Raw upsert uses COALESCE(EXCLUDED.col, existing.col) to avoid null-overwriting previously stored values during partial refreshes.
    • Feature upsert operates on a filtered persistable subset where core NOT NULL columns are present.
  • Missingness semantics
    • Forecast and derived residual columns preserve NaN in memory.
    • No zero-imputation is performed for missing forecast values.
  • Bidding-zone normalization
    • resolve_bidding_zone_code(...) maps common country aliases to ENTSO-E zone codes.
    • Pipeline persistence uses the resolved code, ensuring DB keys match actual queried zones.
  • Timezone handling
    • API index is normalized to UTC to avoid DST ambiguity in lag features.
  • Feature warmup
    • Rows missing day_ahead_price or any lagged_price_1..24 are dropped because lag history is incomplete.

6) Failure modes and expected behavior

  • Missing ENTSOE_API_KEY -> CLI raises early runtime error.
  • Missing required input columns -> feature builder raises ValueError.
  • Duplicate normalized columns from ENTSO-E payloads -> coalesced before reindexing to avoid pandas duplicate-label reindex errors.
  • ENTSO-E no-data responses for an endpoint/range -> transformed to empty hourly frames and merged safely.
  • Empty data frame -> raw/feature persistence functions no-op safely.
  • Repeated identical request -> cache hit (no API roundtrip).
  • Expired L1 cache row + full L2 coverage -> no API call required.
  • Expired L1 cache row + partial L2 coverage -> API called only for missing ranges.

7) Data contracts

7.1 In-memory features contract

Producer: run_feature_pipeline(...) return value (pd.DataFrame).

  • Index contract
    • hourly UTC DatetimeIndex, sorted ascending.
    • unique timestamps expected after deduplication.
  • Column contract
    • base: day_ahead_price, load_forecast, wind_forecast, solar_forecast
    • derived: residual_load
    • lag columns: lagged_price_1..24, lagged_residual_load_1..24
    • cyclical: hour_of_day_sin/cos, weekday_sin/cos, month_sin/cos
  • Nullability contract
    • required non-null in returned rows: day_ahead_price, lagged_price_1..24
    • nullable: load_forecast, wind_forecast, solar_forecast, residual_load, and lagged_residual_load_*
    • rationale: preserve upstream missingness semantics for analysis and QC.

7.2 Feature-store persistence contract

Consumer: electricity_price_features table.

  • Primary key contract
    • (country_code, delivery_start, feature_version)
  • Schema constraint contract
    • core numeric columns are NOT NULL.
    • lag arrays are DOUBLE PRECISION[] and expected length 24.
  • Write-time contract
    • persist_feature_frame(...) filters rows that violate NOT NULL core columns before UPSERT.
    • retained rows are idempotently upserted via ON CONFLICT ... DO UPDATE.

7.3 Raw-observation contract

Consumer: electricity_market_observations table.

  • Primary key contract
    • (country_code, delivery_start)
  • Merge contract
    • upsert uses COALESCE(EXCLUDED.col, existing.col) to avoid null-overwriting prior known values.
  • Coverage contract
    • secondary cache guarantees fetched payloads are aligned to expected hourly index for the requested [start, end) range.

8) Practical debugging checklist

  1. Run scripts/init_db.py and ensure tables exist.
  2. Run one short-range fetch window (1-2 days) first.
  3. Verify cache growth:
    • SELECT namespace, function_name, COUNT(*) FROM entsoe_api_cache GROUP BY 1,2;
  4. Verify raw persistence:
    • SELECT COUNT(*) FROM electricity_market_observations WHERE country_code = '...';
  5. Verify feature persistence:
    • check lag array sizes are 24 and row count is lower than raw by about 24.

9) Suggested next developer docs to add

  • Data quality rules (acceptable missingness, clipping policy, anomaly handling).
  • Training-set contract (target definition, split strategy, leakage constraints).
  • Backfill/replay policy for reprocessing historical periods.