Feature Module¶

Feature extraction for technical indicators and derived metrics.

v2: where features live

flow no longer constructs features — the .features() builder method was removed. Features now live inside a forecast artefact (pinned with its weights) or as primitive parameters on a detector. The FeaturePipeline class itself is unchanged: it remains the single computation engine and can still be used directly to turn a DataFrame into feature columns. The new FeatureSpec and ModelFeaturesPipeline add a reproducibility layer (recipe + hash) around that engine without duplicating any computation.

Base Classes¶

signalflow.feature.base.Feature `dataclass` ¶

Feature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None)

Bases: KwargsTolerantMixin

Base class for all features.

Two methods to implement

compute(df): all pairs, abstract for GlobalFeature/Pipeline
compute_pair(df): one pair, for regular features

Attributes:

Name	Type	Description
`requires`	`list[str]`	Input column templates, e.g. ["{price_col}"]
`outputs`	`list[str]`	Output column templates, e.g. ["rsi_{period}"]
`normalized`	`bool`	If True, apply rolling z-score normalization to output.
`norm_period`	`int \| None`	Window for normalization. Defaults to 3x feature period.

component_type `class-attribute` ¶

component_type: SfComponentType = FEATURE

group_col `class-attribute` `instance-attribute` ¶

group_col: str = 'pair'

is_recursive `class-attribute` ¶

is_recursive: bool = False

norm_period `class-attribute` `instance-attribute` ¶

norm_period: int | None = None

normalized `class-attribute` `instance-attribute` ¶

normalized: bool = False

outputs `class-attribute` ¶

outputs: list[str] = []

requires `class-attribute` ¶

requires: list[str] = []

test_params `class-attribute` ¶

test_params: list[dict] = []

ts_col `class-attribute` `instance-attribute` ¶

ts_col: str = 'timestamp'

warmup `property` ¶

warmup: int

Minimum bars needed before output is stable.

Override in subclasses with feature-specific logic. Default: 0 (no warmup required).

warmup_invariant `class-attribute` ¶

warmup_invariant: bool = True

assert_reproducible ¶

assert_reproducible() -> None

Assert this feature honours the warmup reproducibility contract.

A recursive feature that does not guarantee entry-point invariance (is_recursive and not warmup_invariant) will produce different values in live vs. backtest depending on the warmup start point, breaking parity. Such a feature raises here so the problem surfaces before it reaches production.

Raises:

Type	Description
`RuntimeError`	if the feature is recursive and not warmup-invariant.

Source code in src/signalflow/feature/base.py

def assert_reproducible(self) -> None:
    """Assert this feature honours the warmup reproducibility contract.

    A recursive feature that does not guarantee entry-point invariance
    (``is_recursive and not warmup_invariant``) will produce different
    values in live vs. backtest depending on the warmup start point,
    breaking parity. Such a feature raises here so the problem surfaces
    before it reaches production.

    Raises:
        RuntimeError: if the feature is recursive and not warmup-invariant.
    """
    if self.is_recursive and not self.warmup_invariant:
        raise RuntimeError(
            f"{self.__class__.__name__} is recursive (entry-point dependent) and does NOT "
            f"guarantee warmup invariance (warmup_invariant=False). Its value at a given bar "
            f"depends on where the warmup window starts, so live values will diverge from "
            f"backtest and break parity. Use a deterministically seeded (e.g. SMA-init) "
            f"implementation, or extend warmup until convergence is provably within tolerance."
        )

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Compute feature for all pairs

Source code in src/signalflow/feature/base.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Compute feature for all pairs"""
    return df.group_by(self.group_col, maintain_order=True).map_groups(self.compute_pair)

compute_pair ¶

compute_pair(df: DataFrame) -> pl.DataFrame

Compute feature for single pair. Override for per-pair features.

Source code in src/signalflow/feature/base.py

def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute feature for single pair. Override for per-pair features."""
    raise NotImplementedError(f"{self.__class__.__name__} must implement compute_pair()")

output_cols ¶

output_cols(prefix: str = '') -> list[str]

Actual output column names with parameter substitution.

Source code in src/signalflow/feature/base.py

def output_cols(self, prefix: str = "") -> list[str]:
    """Actual output column names with parameter substitution."""
    return [f"{prefix}{tpl.format(**self.__dict__)}" for tpl in self.outputs]

required_cols ¶

required_cols() -> list[str]

Actual required column names with parameter substitution.

Source code in src/signalflow/feature/base.py

def required_cols(self) -> list[str]:
    """Actual required column names with parameter substitution."""
    return [tpl.format(**self.__dict__) if "{" in tpl else tpl for tpl in self.requires]

signalflow.feature.feature_pipeline.FeaturePipeline `dataclass` ¶

FeaturePipeline(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, features: list[Feature] = list(), raw_data_type: RawDataType | str = RawDataType.SPOT)

Bases: Feature

Orchestrates multiple features with optimized execution.

Groups consecutive per-pair features into batches for single group_by.

Parameters:

Name	Type	Description	Default
`features`	`list[Feature]`	List of features to compute.	`list()`
`raw_data_type`	`RawDataType \| str`	Type of raw data (defines available columns).	`SPOT`

Example

pipeline = FeaturePipeline( ... features=[ ... RsiFeature(period=14), ... SmaFeature(period=20), ... GlobalFeature(base=RsiFeature(period=14), reference_pair="BTCUSDT"), ... ], ... raw_data_type=RawDataType.SPOT, ... ) df = pipeline.run(raw_data_view)

features `class-attribute` `instance-attribute` ¶

features: list[Feature] = field(default_factory=list)

outputs `property` ¶

outputs: list[str]

Aggregated outputs from all features.

raw_data_type `class-attribute` `instance-attribute` ¶

raw_data_type: RawDataType | str = SPOT

requires `class-attribute` ¶

requires: list[str] = []

__post_init__ ¶

__post_init__() -> None

Source code in src/signalflow/feature/feature_pipeline.py

def __post_init__(self) -> None:
    if not self.features:
        raise ValueError("FeaturePipeline requires at least one feature")
    self._validate()

_group_into_batches ¶

_group_into_batches() -> list[list[Feature]]

Group features: consecutive per-pair → batch, global → separate.

Source code in src/signalflow/feature/feature_pipeline.py

def _group_into_batches(self) -> list[list[Feature]]:
    """Group features: consecutive per-pair → batch, global → separate."""
    batches = []
    current_batch: list[Feature] = []

    for f in self.features:
        is_global = isinstance(f, (GlobalFeature, FeaturePipeline)) or getattr(f, "_is_global", False)

        if is_global:
            if current_batch:
                batches.append(current_batch)
                current_batch = []
            batches.append([f])
        else:
            current_batch.append(f)

    if current_batch:
        batches.append(current_batch)

    return batches

_is_per_pair_batch ¶

_is_per_pair_batch(batch: list[Feature]) -> bool

Check if batch contains only per-pair features.

Source code in src/signalflow/feature/feature_pipeline.py

def _is_per_pair_batch(self, batch: list[Feature]) -> bool:
    """Check if batch contains only per-pair features."""
    return not any(
        isinstance(f, (GlobalFeature, FeaturePipeline)) or getattr(f, "_is_global", False) for f in batch
    )

_validate ¶

_validate() -> None

Validate all dependencies are satisfied.

Source code in src/signalflow/feature/feature_pipeline.py

def _validate(self) -> None:
    """Validate all dependencies are satisfied."""
    available = default_registry.get_raw_data_columns(self.raw_data_type)

    for f in self.features:
        required = set(f.required_cols())
        missing = required - available

        if missing:
            raise ValueError(f"{f.__class__.__name__} requires {missing}, available: {sorted(available)}")

        available.update(f.output_cols())

assert_reproducible ¶

assert_reproducible() -> None

Assert every nested feature honours the warmup reproducibility contract.

Delegates to :meth:Feature.assert_reproducible for each nested feature and aggregates the names of all offending features into a single error, so a pipeline with one or more non-invariant recursive features fails loudly with the specific culprits named.

Raises:

Type	Description
`RuntimeError`	if any nested feature is recursive and not warmup-invariant.

Source code in src/signalflow/feature/feature_pipeline.py

def assert_reproducible(self) -> None:
    """Assert every nested feature honours the warmup reproducibility contract.

    Delegates to :meth:`Feature.assert_reproducible` for each nested feature
    and aggregates the names of all offending features into a single error,
    so a pipeline with one or more non-invariant recursive features fails
    loudly with the specific culprits named.

    Raises:
        RuntimeError: if any nested feature is recursive and not warmup-invariant.
    """
    offenders: list[str] = []
    for f in self.features:
        try:
            f.assert_reproducible()
        except RuntimeError:
            offenders.append(f.__class__.__name__)

    if offenders:
        raise RuntimeError(
            "FeaturePipeline contains recursive features that do not guarantee "
            "warmup invariance (entry-point dependent, will break parity): "
            f"{', '.join(offenders)}. Replace them with SMA-seeded implementations "
            "or remove them from the pipeline."
        )

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Compute all features with optimized batching.

Source code in src/signalflow/feature/feature_pipeline.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Compute all features with optimized batching."""
    df = df.sort([self.group_col, self.ts_col])

    batches = self._group_into_batches()

    for batch in batches:
        if self._is_per_pair_batch(batch):

            def apply_batch(pair_df: pl.DataFrame, features: list[Feature] = batch) -> pl.DataFrame:
                for f in features:
                    pair_df = f.compute_pair(pair_df)
                return pair_df

            df = df.group_by(self.group_col, maintain_order=True).map_groups(apply_batch)
        else:
            for f in batch:
                df = f.compute(df, context=context)

    return df

output_cols ¶

output_cols(prefix: str = '') -> list[str]

Source code in src/signalflow/feature/feature_pipeline.py

def output_cols(self, prefix: str = "") -> list[str]:
    return [f"{prefix}{col}" for col in self.outputs]

run ¶

run(raw_data_view: RawDataView, context: dict[str, Any] | None = None) -> pl.DataFrame

Entry point: load from RawDataView and compute.

Source code in src/signalflow/feature/feature_pipeline.py

def run(self, raw_data_view: RawDataView, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Entry point: load from RawDataView and compute."""
    key = getattr(self.raw_data_type, "value", self.raw_data_type)
    df = raw_data_view.to_polars(key)
    return self.compute(df)

to_mermaid ¶

to_mermaid() -> str

Generate Mermaid diagram of feature dependencies.

Source code in src/signalflow/feature/feature_pipeline.py

def to_mermaid(self) -> str:
    """Generate Mermaid diagram of feature dependencies."""
    lines = ["graph LR"]
    lines.append("    subgraph Input")
    for col in sorted(default_registry.get_raw_data_columns(self.raw_data_type)):
        lines.append(f"        {col}[{col}]")
    lines.append("    end")

    for f in self.features:
        name = f.__class__.__name__
        if hasattr(f, "period"):
            name = f"{name}_{f.period}"

        for req in f.required_cols():
            lines.append(f"    {req} --> {name}")
        for out in f.output_cols():
            lines.append(f"    {name} --> {out}[{out}]")

    return "\n".join(lines)

signalflow.feature.base.GlobalFeature `dataclass` ¶

GlobalFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, sources: list[str] | None = None)

Bases: Feature

Base class for features computed across all pairs.

Override compute() with custom aggregation logic.

For multi-source features, set sources to specify which exchanges to use. Use get_source_data() to retrieve data from RawData with proper handling.

Attributes:

Name	Type	Description
`sources`	`list[str] \| None`	List of source names to use (e.g., ["binance", "okx"]). If None, uses default source or all available sources.

Example

@dataclass
class AggregatedOI(GlobalFeature):
    sources: list[str] | None = None

    def compute_from_raw(self, raw: RawData, context=None) -> pl.DataFrame:
        # Get data from specified sources
        for source, df in self.iter_sources(raw, "perpetual"):
            ...

sources `class-attribute` `instance-attribute` ¶

sources: list[str] | None = field(default=None)

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Must override - compute global feature across all pairs.

Source code in src/signalflow/feature/base.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Must override - compute global feature across all pairs."""
    raise NotImplementedError(f"{self.__class__.__name__} must implement compute()")

get_source_data ¶

get_source_data(raw: RawData, data_type: str, source: str | None = None) -> pl.DataFrame

Get DataFrame from RawData for a specific source.

Parameters:

Name	Type	Description	Default
`raw`	`RawData`	RawData container.	required
`data_type`	`str`	Data type key (e.g., "perpetual", "spot").	required
`source`	`str \| None`	Specific source name. If None, uses default.	`None`

Returns:

Type	Description
`DataFrame`	pl.DataFrame: Data for the specified source.

Source code in src/signalflow/feature/base.py

def get_source_data(
    self,
    raw: "RawData",
    data_type: str,
    source: str | None = None,
) -> pl.DataFrame:
    """Get DataFrame from RawData for a specific source.

    Args:
        raw: RawData container.
        data_type: Data type key (e.g., "perpetual", "spot").
        source: Specific source name. If None, uses default.

    Returns:
        pl.DataFrame: Data for the specified source.
    """
    if source is not None:
        return raw.get(data_type, source=source)
    return raw.get(data_type)

iter_sources ¶

iter_sources(raw: RawData, data_type: str) -> Iterator[tuple[str, pl.DataFrame]]

Iterate over source DataFrames from RawData.

If self.sources is set, iterates only those sources. Otherwise, iterates all available sources.

Parameters:

Name	Type	Description	Default
`raw`	`RawData`	RawData container.	required
`data_type`	`str`	Data type key (e.g., "perpetual").	required

Yields:

Type	Description
`tuple[str, DataFrame]`	tuple[str, pl.DataFrame]: (source_name, DataFrame) pairs.

Example

for source, df in self.iter_sources(raw, "perpetual"):
    print(f"{source}: {df.shape}")

Source code in src/signalflow/feature/base.py

def iter_sources(
    self,
    raw: "RawData",
    data_type: str,
) -> Iterator[tuple[str, pl.DataFrame]]:
    """Iterate over source DataFrames from RawData.

    If `self.sources` is set, iterates only those sources.
    Otherwise, iterates all available sources.

    Args:
        raw: RawData container.
        data_type: Data type key (e.g., "perpetual").

    Yields:
        tuple[str, pl.DataFrame]: (source_name, DataFrame) pairs.

    Example:
        ```python
        for source, df in self.iter_sources(raw, "perpetual"):
            print(f"{source}: {df.shape}")
        ```
    """
    if data_type not in raw:
        return

    accessor = getattr(raw, data_type)

    # Determine which sources to iterate
    sources_to_use = self.sources if self.sources else accessor.sources

    for source in sources_to_use:
        if source in accessor:
            yield source, getattr(accessor, source)

signalflow.feature.offset_feature.OffsetFeature `dataclass` ¶

OffsetFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, feature_name: str | None = None, feature_params: dict = dict(), window: int = 15, prefix: str | None = None)

Bases: Feature

Multi-timeframe feature via offset resampling.

Creates window parallel time series with different offsets. Each offset computes features as if on window-minute bars.

Supports both regular Feature (compute_pair) and GlobalFeature (compute).

Parameters:

Name	Type	Description	Default
`feature_name`	`str \| None`	Registered component name (sf_component name).	`None`
`feature_params`	`dict`	Parameters for feature class.	`dict()`
`window`	`int`	Aggregation window in minutes. Default: 15.	`15`
`prefix`	`str \| None`	Prefix for output columns. Default: "{window}m_".	`None`

Example

offset = OffsetFeature( ... feature_name="test_rsi", ... feature_params={"period": 14}, ... window=15, ... )

Outputs: 15m_rsi_14, offset¶

With GlobalFeature¶

offset = OffsetFeature( ... feature_name="global/market_log_return", ... feature_params={}, ... window=15, ... )

feature_name `class-attribute` `instance-attribute` ¶

feature_name: str | None = None

feature_params `class-attribute` `instance-attribute` ¶

feature_params: dict = field(default_factory=dict)

outputs `class-attribute` ¶

outputs: list[str] = ['offset']

prefix `class-attribute` `instance-attribute` ¶

prefix: str | None = None

requires `class-attribute` ¶

requires: list[str] = ['open', 'high', 'low', 'close', 'volume', 'timestamp']

window `class-attribute` `instance-attribute` ¶

window: int = 15

__post_init__ ¶

__post_init__() -> None

Source code in src/signalflow/feature/offset_feature.py

def __post_init__(self) -> None:
    if self.feature_name is None:
        raise ValueError("OffsetFeature requires 'feature_name'")

    self._feature_cls = default_registry.get(SfComponentType.FEATURE, self.feature_name)
    self._base = self._feature_cls(**self.feature_params)
    self._is_global = isinstance(self._base, GlobalFeature)

    if self.prefix is None:
        self.prefix = f"{self.window}m_"

_compute_all_pairs_global ¶

_compute_all_pairs_global(df: DataFrame) -> pl.DataFrame

Compute features for all pairs with global base feature.

Source code in src/signalflow/feature/offset_feature.py

def _compute_all_pairs_global(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute features for all pairs with global base feature."""
    df = df.sort([self.group_col, self.ts_col])
    original_len = len(df)

    df = df.with_columns(
        pl.col(self.ts_col).rank("ordinal").over(self.group_col).cast(pl.UInt32).alias("_orig_idx") - 1
    )

    df = df.with_columns((pl.col("_orig_idx") % self.window).cast(pl.UInt8).alias("offset"))

    offset_results = []
    for offset in range(self.window):
        resampled = (
            df.drop(["_orig_idx", "offset"])
            .with_columns(
                (pl.col(self.ts_col).rank("ordinal").over(self.group_col).cast(pl.Int64) - 1 - offset)
                .floordiv(self.window)
                .alias("_grp")
            )
            .group_by([self.group_col, "_grp"], maintain_order=True)
            .agg(
                [
                    pl.col(self.ts_col).last(),
                    pl.col("open").first(),
                    pl.col("high").max(),
                    pl.col("low").min(),
                    pl.col("close").last(),
                    pl.col("volume").sum(),
                ]
            )
        )

        with_feat = self._compute_base_feature(resampled)
        with_feat = with_feat.with_columns(pl.lit(offset).cast(pl.UInt8).alias("_offset"))

        for col in self._base.output_cols():
            if col in with_feat.columns:
                with_feat = with_feat.rename({col: f"{self.prefix}{col}"})

        offset_results.append(with_feat)

    all_offsets = pl.concat(offset_results)

    df = df.with_columns(
        ((pl.col("_orig_idx").cast(pl.Int64) - pl.col("offset").cast(pl.Int64)) // self.window).alias("_grp")
    )

    feature_cols = [f"{self.prefix}{col}" for col in self._base.output_cols()]

    result = df.join(
        all_offsets.select([self.group_col, "_grp", "_offset", *feature_cols]),
        left_on=[self.group_col, "_grp", "offset"],
        right_on=[self.group_col, "_grp", "_offset"],
        how="left",
    )

    result = result.drop(["_orig_idx", "_grp"])
    assert len(result) == original_len

    return result

_compute_base_feature ¶

_compute_base_feature(resampled: DataFrame) -> pl.DataFrame

Compute base feature - handles both Feature and GlobalFeature.

Source code in src/signalflow/feature/offset_feature.py

def _compute_base_feature(self, resampled: pl.DataFrame) -> pl.DataFrame:
    """Compute base feature - handles both Feature and GlobalFeature."""
    if self._is_global:
        return cast(pl.DataFrame, self._base.compute(resampled))
    else:
        return cast(pl.DataFrame, self._base.compute_pair(resampled))

_compute_single_pair ¶

_compute_single_pair(df: DataFrame) -> pl.DataFrame

Compute features for single pair (non-global base).

Source code in src/signalflow/feature/offset_feature.py

def _compute_single_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute features for single pair (non-global base)."""
    df = df.sort(self.ts_col)
    original_len = len(df)
    df = df.with_row_index("_orig_idx")

    df = df.with_columns((pl.col("_orig_idx") % self.window).cast(pl.UInt8).alias("offset"))

    offset_results = []
    for offset in range(self.window):
        resampled = self._resample_ohlcv(df.drop(["_orig_idx", "offset"]), offset)

        with_feat = self._compute_base_feature(resampled)
        with_feat = with_feat.with_columns(pl.lit(offset).cast(pl.UInt8).alias("_offset"))

        for col in self._base.output_cols():
            if col in with_feat.columns:
                with_feat = with_feat.rename({col: f"{self.prefix}{col}"})

        offset_results.append(with_feat)

    all_offsets = pl.concat(offset_results)

    df = df.with_columns(
        ((pl.col("_orig_idx").cast(pl.Int64) - pl.col("offset").cast(pl.Int64)) // self.window).alias("_grp")
    )

    feature_cols = [f"{self.prefix}{col}" for col in self._base.output_cols()]

    result = df.join(
        all_offsets.select(["_grp", "_offset", *feature_cols]),
        left_on=["_grp", "offset"],
        right_on=["_grp", "_offset"],
        how="left",
    )

    result = result.drop(["_orig_idx", "_grp"])
    assert len(result) == original_len

    return result

_resample_ohlcv ¶

_resample_ohlcv(df: DataFrame, offset: int) -> pl.DataFrame

Resample 1m OHLCV to window-minute bars with given offset.

Source code in src/signalflow/feature/offset_feature.py

def _resample_ohlcv(self, df: pl.DataFrame, offset: int) -> pl.DataFrame:
    """Resample 1m OHLCV to window-minute bars with given offset."""
    df = df.with_row_index("_row_idx")

    df = df.with_columns(((pl.col("_row_idx").cast(pl.Int64) - offset) // self.window).alias("_grp"))

    agg_exprs = [
        pl.col(self.ts_col).last(),
        pl.col("open").first(),
        pl.col("high").max(),
        pl.col("low").min(),
        pl.col("close").last(),
        pl.col("volume").sum(),
    ]
    if self.group_col in df.columns:
        agg_exprs.append(pl.col(self.group_col).first())

    return df.group_by("_grp", maintain_order=True).agg(agg_exprs)

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Compute for all pairs.

Source code in src/signalflow/feature/offset_feature.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Compute for all pairs."""
    if self._is_global:
        return self._compute_all_pairs_global(df)
    else:
        return df.group_by(self.group_col, maintain_order=True).map_groups(self._compute_single_pair)

compute_pair ¶

compute_pair(df: DataFrame) -> pl.DataFrame

Compute for single pair (only for non-global base).

Source code in src/signalflow/feature/offset_feature.py

def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute for single pair (only for non-global base)."""
    if self._is_global:
        raise NotImplementedError("GlobalFeature base requires compute(), not compute_pair()")
    return self._compute_single_pair(df)

from_dict `classmethod` ¶

from_dict(data: dict) -> OffsetFeature

Deserialize from config.

Source code in src/signalflow/feature/offset_feature.py

@classmethod
def from_dict(cls, data: dict) -> "OffsetFeature":
    """Deserialize from config."""
    return cls(
        feature_name=data["feature_name"],
        feature_params=data["feature_params"],
        window=data["window"],
        prefix=data.get("prefix"),
    )

output_cols ¶

output_cols(prefix: str = '') -> list[str]

Source code in src/signalflow/feature/offset_feature.py

def output_cols(self, prefix: str = "") -> list[str]:
    base_cols = self._base.output_cols(prefix=f"{prefix}{self.prefix}")
    return [*base_cols, f"{prefix}offset"]

required_cols ¶

required_cols() -> list[str]

Source code in src/signalflow/feature/offset_feature.py

def required_cols(self) -> list[str]:
    return ["open", "high", "low", "close", "volume", self.ts_col]

to_dict ¶

to_dict() -> dict

Serialize for Kedro.

Source code in src/signalflow/feature/offset_feature.py

def to_dict(self) -> dict:
    """Serialize for Kedro."""
    return {
        "feature_name": self.feature_name,
        "feature_params": self.feature_params,
        "window": self.window,
        "prefix": self.prefix,
    }

signalflow.feature.lin_reg_forecast.LinRegForecastFeature `dataclass` ¶

LinRegForecastFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, source_col: str = 'rsi_14', n_lags: int = 10, trend_window: int = 5, mean_window: int = 20, refit_period: Literal['hour', 'day', 'week', 'month', None] = 'day', alpha: float = 1.0, forecast_horizon: int = 1, min_samples: int = 50)

Bases: Feature

Enhanced linear regression forecast with trend and mean-reversion features.

Instead of predicting raw values, predicts change (diff) and adds: - Trend slope (recent momentum) - Mean reversion signal (deviation from rolling mean) - Volatility scaling

Parameters:

Name	Type	Description	Default
`source_col`	`str`	Column to forecast.	`'rsi_14'`
`n_lags`	`int`	Number of lagged diffs. Default: 10.	`10`
`trend_window`	`int`	Window for trend calculation. Default: 5.	`5`
`mean_window`	`int`	Window for mean reversion. Default: 20.	`20`
`refit_period`	`Literal['hour', 'day', 'week', 'month', None]`	When to refit. Default: 'day'.	`'day'`
`alpha`	`float`	Ridge regularization. Default: 1.0.	`1.0`
`forecast_horizon`	`int`	Steps ahead to forecast. Default: 1.	`1`

alpha `class-attribute` `instance-attribute` ¶

alpha: float = 1.0

forecast_horizon `class-attribute` `instance-attribute` ¶

forecast_horizon: int = 1

mean_window `class-attribute` `instance-attribute` ¶

mean_window: int = 20

min_samples `class-attribute` `instance-attribute` ¶

min_samples: int = 50

n_lags `class-attribute` `instance-attribute` ¶

n_lags: int = 10

outputs `class-attribute` ¶

outputs: list[str] = ['{source_col}_forecast', '{source_col}_forecast_change', '{source_col}_forecast_direction']

refit_period `class-attribute` `instance-attribute` ¶

refit_period: Literal['hour', 'day', 'week', 'month', None] = 'day'

requires `class-attribute` ¶

requires: list[str] = ['{source_col}']

source_col `class-attribute` `instance-attribute` ¶

source_col: str = 'rsi_14'

test_params `class-attribute` ¶

test_params: list[dict] = [{'source_col': 'rsi_14', 'n_lags': 10}, {'source_col': 'rsi_14', 'n_lags': 5, 'mean_window': 10}]

trend_window `class-attribute` `instance-attribute` ¶

trend_window: int = 5

warmup `property` ¶

warmup: int

__post_init__ ¶

__post_init__() -> None

Source code in src/signalflow/feature/lin_reg_forecast.py

def __post_init__(self) -> None:
    if self.n_lags < 1:
        raise ValueError("n_lags must be >= 1")

_build_features ¶

_build_features(values: ndarray) -> np.ndarray

Build enhanced feature matrix.

Source code in src/signalflow/feature/lin_reg_forecast.py

def _build_features(self, values: np.ndarray) -> np.ndarray:
    """Build enhanced feature matrix."""
    n = len(values)

    diffs = np.diff(values, prepend=values[0])

    n_features = self.n_lags + 3
    X = np.full((n, n_features), np.nan, dtype=np.float64)

    start_idx = max(self.n_lags, self.mean_window)

    for i in range(start_idx, n):
        for lag in range(self.n_lags):
            X[i, lag] = diffs[i - lag - 1]

        window = values[i - self.trend_window : i]
        if len(window) == self.trend_window:
            x_trend = np.arange(self.trend_window)
            X[i, self.n_lags] = np.polyfit(x_trend, window, 1)[0]

        mean_window = values[i - self.mean_window : i]
        if len(mean_window) == self.mean_window:
            mean_val = np.mean(mean_window)
            std_val = np.std(mean_window)
            if std_val > 1e-8:
                X[i, self.n_lags + 1] = (values[i] - mean_val) / std_val
            else:
                X[i, self.n_lags + 1] = 0

        vol_window = diffs[i - self.trend_window : i]
        if len(vol_window) == self.trend_window:
            X[i, self.n_lags + 2] = np.std(vol_window)

    return X

_build_targets ¶

_build_targets(values: ndarray) -> np.ndarray

Build target: forward diff (change).

Source code in src/signalflow/feature/lin_reg_forecast.py

def _build_targets(self, values: np.ndarray) -> np.ndarray:
    """Build target: forward diff (change)."""
    n = len(values)
    y = np.full(n, np.nan, dtype=np.float64)

    if self.forecast_horizon < n:
        y[: -self.forecast_horizon] = np.diff(
            values, n=self.forecast_horizon, append=[np.nan] * self.forecast_horizon
        )[: -self.forecast_horizon]
        for i in range(n - self.forecast_horizon):
            y[i] = values[i + self.forecast_horizon] - values[i]

    return y

_get_period_key ¶

_get_period_key(ts: datetime) -> tuple | None

Source code in src/signalflow/feature/lin_reg_forecast.py

def _get_period_key(self, ts: datetime) -> tuple | None:
    if self.refit_period == "hour":
        return (ts.year, ts.month, ts.day, ts.hour)
    elif self.refit_period == "day":
        return (ts.year, ts.month, ts.day)
    elif self.refit_period == "week":
        return (ts.year, ts.isocalendar()[1])
    elif self.refit_period == "month":
        return (ts.year, ts.month)
    return None

compute_pair ¶

compute_pair(df: DataFrame) -> pl.DataFrame

Compute forecasts for single pair.

Source code in src/signalflow/feature/lin_reg_forecast.py

def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute forecasts for single pair."""
    values = df[self.source_col].to_numpy().astype(np.float64)
    timestamps = df[self.ts_col].to_list()
    n = len(values)

    X = self._build_features(values)
    y = self._build_targets(values)

    forecasts = np.full(n, np.nan, dtype=np.float64)
    forecast_changes = np.full(n, np.nan, dtype=np.float64)

    valid_mask = ~np.any(np.isnan(X), axis=1)
    target_valid = ~np.isnan(y)
    train_valid = valid_mask & target_valid

    current_period = None
    model = Ridge(alpha=self.alpha)
    fitted = False

    for i in range(self.min_samples, n):
        if not valid_mask[i]:
            continue

        period = self._get_period_key(timestamps[i])

        if period != current_period:
            current_period = period
            train_idx = np.where(train_valid[:i])[0]
            if len(train_idx) >= 20:
                model.fit(X[train_idx], y[train_idx])
                fitted = True

        if fitted:
            predicted_change = model.predict(X[i : i + 1])[0]
            forecast_changes[i] = predicted_change
            forecasts[i] = values[i] + predicted_change

    forecast_col = f"{self.source_col}_forecast"
    change_col = f"{self.source_col}_forecast_change"
    direction_col = f"{self.source_col}_forecast_direction"

    return df.with_columns(
        [
            pl.Series(name=forecast_col, values=forecasts),
            pl.Series(name=change_col, values=forecast_changes),
            pl.Series(name=direction_col, values=np.sign(forecast_changes)),
        ]
    )

signalflow.feature.atr.ATRFeature `dataclass` ¶

ATRFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, period: int = 14, smoothing: Literal['sma', 'ema'] = 'ema')

Bases: Feature

Average True Range (ATR) feature.

Measures market volatility as the moving average of True Range. True Range = max(high - low, |high - prev_close|, |low - prev_close|)

Parameters:

Name	Type	Description	Default
`period`	`int`	ATR period. Default: 14.	`14`
`smoothing`	`Literal['sma', 'ema']`	Smoothing method - "sma" or "ema" (Wilder's). Default: "ema".	`'ema'`

Example

atr = ATRFeature(period=14) atr.output_cols() # ["atr_14"]

outputs `class-attribute` ¶

outputs: list[str] = ['atr_{period}']

period `class-attribute` `instance-attribute` ¶

period: int = 14

requires `class-attribute` ¶

requires: list[str] = ['high', 'low', 'close']

smoothing `class-attribute` `instance-attribute` ¶

smoothing: Literal['sma', 'ema'] = 'ema'

test_params `class-attribute` ¶

test_params: list[dict] = [{'period': 14}, {'period': 14, 'smoothing': 'sma'}, {'period': 20}]

warmup `property` ¶

warmup: int

_get_output_name ¶

_get_output_name() -> str

Source code in src/signalflow/feature/atr.py

def _get_output_name(self) -> str:
    suffix = "_norm" if self.normalized else ""
    return f"atr_{self.period}{suffix}"

compute_pair ¶

compute_pair(df: DataFrame) -> pl.DataFrame

Compute ATR for a single pair.

Source code in src/signalflow/feature/atr.py

def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute ATR for a single pair."""
    col_name = self._get_output_name()

    high = pl.col("high")
    low = pl.col("low")
    prev_close = pl.col("close").shift(1)

    # True Range = max(H-L, |H-prevC|, |L-prevC|)
    tr = pl.max_horizontal(
        high - low,
        (high - prev_close).abs(),
        (low - prev_close).abs(),
    )

    # Apply smoothing
    if self.smoothing == "sma":
        atr = tr.rolling_mean(window_size=self.period)
    else:
        # EMA (Wilder's smoothing)
        atr = tr.ewm_mean(span=self.period, adjust=False)

    df = df.with_columns(atr.alias(col_name))

    # Optional z-score normalization
    if self.normalized:
        from signalflow.feature.examples import _get_norm_window, _normalize_zscore

        norm_window = self.norm_period or _get_norm_window(self.period)
        vals = df[col_name].to_numpy()
        normed = _normalize_zscore(vals, window=norm_window)
        df = df.with_columns(pl.Series(name=col_name, values=normed))

    return df

Examples¶

signalflow.feature.examples.ExampleRsiFeature `dataclass` ¶

ExampleRsiFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, period: int = 14, price_col: str = 'close')

Bases: Feature

Relative Strength Index.

Bounded oscillator [0, 100]. In normalized mode, rescales to [-1, 1].

Parameters:

Name	Type	Description	Default
`period`	`int`	RSI period. Default: 14.	`14`
`price_col`	`str`	Price column to use. Default: "close".	`'close'`

Example

rsi = ExampleRsiFeature(period=21) rsi.output_cols() # ["rsi_21"]

outputs `class-attribute` ¶

outputs: list[str] = ['rsi_{period}']

period `class-attribute` `instance-attribute` ¶

period: int = 14

price_col `class-attribute` `instance-attribute` ¶

price_col: str = 'close'

requires `class-attribute` ¶

requires: list[str] = ['{price_col}']

test_params `class-attribute` ¶

test_params: list[dict] = [{'period': 14}, {'period': 14, 'normalized': True}, {'period': 21}]

warmup `property` ¶

warmup: int

_get_output_name ¶

_get_output_name() -> str

Source code in src/signalflow/feature/examples.py

def _get_output_name(self) -> str:
    suffix = "_norm" if self.normalized else ""
    return f"rsi_{self.period}{suffix}"

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Compute RSI for all pairs.

Source code in src/signalflow/feature/examples.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Compute RSI for all pairs."""
    return df.group_by(self.group_col, maintain_order=True).map_groups(self.compute_pair)

compute_pair ¶

compute_pair(df: DataFrame) -> pl.DataFrame

Compute RSI for single pair.

Source code in src/signalflow/feature/examples.py

def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    """Compute RSI for single pair."""
    col_name = self._get_output_name()

    delta = pl.col(self.price_col).diff()
    gain = pl.when(delta > 0).then(delta).otherwise(0)
    loss = pl.when(delta < 0).then(-delta).otherwise(0)

    avg_gain = gain.rolling_mean(window_size=self.period)
    avg_loss = loss.rolling_mean(window_size=self.period)

    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))

    df = df.with_columns(rsi.alias(col_name))

    # Normalization: rescale bounded [0, 100] → [-1, 1]
    if self.normalized:
        df = df.with_columns(((pl.col(col_name) - 50) / 50).alias(col_name))

    return df

signalflow.feature.examples.ExampleSmaFeature `dataclass` ¶

ExampleSmaFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, period: int = 20, price_col: str = 'close')

Bases: Feature

Simple Moving Average.

outputs `class-attribute` ¶

outputs: list[str] = ['sma_{period}']

period `class-attribute` `instance-attribute` ¶

period: int = 20

price_col `class-attribute` `instance-attribute` ¶

price_col: str = 'close'

requires `class-attribute` ¶

requires: list[str] = ['{price_col}']

test_params `class-attribute` ¶

test_params: list[dict] = [{'period': 20}, {'period': 50}]

warmup `property` ¶

warmup: int

_get_output_name ¶

_get_output_name() -> str

Source code in src/signalflow/feature/examples.py

def _get_output_name(self) -> str:
    suffix = "_norm" if self.normalized else ""
    return f"sma_{self.period}{suffix}"

compute_pair ¶

compute_pair(df: DataFrame) -> pl.DataFrame

Source code in src/signalflow/feature/examples.py

def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
    col_name = self._get_output_name()
    sma = pl.col(self.price_col).rolling_mean(window_size=self.period)
    df = df.with_columns(sma.alias(col_name))

    if self.normalized:
        norm_window = self.norm_period or _get_norm_window(self.period)
        vals = df[col_name].to_numpy()
        normed = _normalize_zscore(vals, window=norm_window)
        df = df.with_columns(pl.Series(name=col_name, values=normed))

    return df

signalflow.feature.examples.ExampleGlobalMeanRsiFeature `dataclass` ¶

ExampleGlobalMeanRsiFeature(group_col: str = 'pair', ts_col: str = 'timestamp', normalized: bool = False, norm_period: int | None = None, sources: list[str] | None = None, period: int = 14, price_col: str = 'close', add_diff: bool = False)

Bases: GlobalFeature

Mean RSI across all pairs per timestamp.

Compute RSI per pair
Mean across all pairs at time t -> global_mean_rsi
Optionally: rsi_diff = pair_rsi - global_mean_rsi

Parameters:

Name	Type	Description	Default
`period`	`int`	RSI period. Default: 14.	`14`
`add_diff`	`bool`	Add per-pair difference column. Default: False.	`False`

add_diff `class-attribute` `instance-attribute` ¶

add_diff: bool = False

outputs `class-attribute` ¶

outputs: list[str] = ['global_mean_rsi_{period}']

period `class-attribute` `instance-attribute` ¶

period: int = 14

price_col `class-attribute` `instance-attribute` ¶

price_col: str = 'close'

requires `class-attribute` ¶

requires: list[str] = ['{price_col}']

test_params `class-attribute` ¶

test_params: list[dict] = [{'period': 14}, {'period': 14, 'add_diff': True}]

warmup `property` ¶

warmup: int

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Source code in src/signalflow/feature/examples.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    rsi_col = f"rsi_{self.period}"
    out_col = f"global_mean_rsi_{self.period}"

    has_rsi = rsi_col in df.columns
    if not has_rsi:
        rsi = ExampleRsiFeature(period=self.period, price_col=self.price_col)
        df = rsi.compute(df)

    mean_df = df.group_by(self.ts_col).agg(pl.col(rsi_col).mean().alias(out_col))

    df = df.join(mean_df, on=self.ts_col, how="left")

    if self.add_diff:
        df = df.with_columns((pl.col(rsi_col) - pl.col(out_col)).alias(f"rsi_{self.period}_diff"))

    if not has_rsi:
        df = df.drop(rsi_col)

    return df

output_cols ¶

output_cols(prefix: str = '') -> list[str]

Source code in src/signalflow/feature/examples.py

def output_cols(self, prefix: str = "") -> list[str]:
    cols = [f"{prefix}global_mean_rsi_{self.period}"]
    if self.add_diff:
        cols.append(f"{prefix}rsi_{self.period}_diff")
    return cols

Warmup Reproducibility Contract¶

Every Feature declares whether it can be reproduced identically in live and backtest after warmup. Two ClassVar flags on Feature express this:

Flag	Meaning
`is_recursive` (`ClassVar[bool]`, default `False`)	The indicator is stateful / entry-point dependent: its value at bar N depends on where the series starts unless correctly initialized.
`warmup_invariant` (`ClassVar[bool]`, default `True`)	The indicator guarantees entry-point invariance after warmup (e.g. deterministic SMA-seeded initialization). For a recursive feature that does not re-seed deterministically this must be `False`.

The existing warmup property remains — the minimum number of bars before the output is stable (default 0).

Feature.assert_reproducible() raises RuntimeError for exactly the dangerous combination is_recursive and not warmup_invariant — a feature whose value depends on the warmup start point, which would diverge between live and backtest and break parity:

feat.assert_reproducible()   # raises if recursive and not warmup-invariant

FeaturePipeline.assert_reproducible() delegates to each nested feature and aggregates all offending feature names into a single error.

FeatureSpec¶

FeatureSpec is the serializable, hashable recipe for a FeaturePipeline — it captures how to rebuild a pipeline (ordered features + params + ta_version + raw_data_type), not the computed values.

from signalflow.feature import FeaturePipeline, ExampleRsiFeature, ExampleSmaFeature
from signalflow.feature.spec import FeatureSpec

pipeline = FeaturePipeline(
    features=[ExampleRsiFeature(period=14), ExampleSmaFeature(period=20)],
    raw_data_type="spot",
)

spec = FeatureSpec.from_pipeline(pipeline, ta_version="1.0.0")
spec.feature_hash()          # stable SHA-256 of the recipe
config = spec.to_config()    # plain dict: {"features": [...], "meta": {...}}
spec2 = FeatureSpec.from_config(config)
assert spec2.feature_hash() == spec.feature_hash()   # round-trips
rebuilt = spec.build()       # -> FeaturePipeline

Attributes: features (ordered list of {"name", "params", "scope"} dicts — order is significant), ta_version, raw_data_type (default "spot"), order_significant (default True, provenance metadata; features are never reordered).

Method	Purpose
`from_pipeline(pipeline, *, ta_version=None)`	Extract a spec from a live pipeline (feature names via registry reverse-lookup).
`from_config(data)` / `to_config()`	Round-trip a plain dict (flat or YAML `meta:` nested form).
`build()`	Reconstruct a `FeaturePipeline` in the recorded order.
`to_yaml(path)` / `from_yaml(path)`	Persist / load the recipe as YAML (survives class refactors, unlike pickle).
`feature_hash()`	Stable SHA-256 of the canonical recipe.

feature_hash¶

canonical_feature_hash(features, ta_version, raw_data_type) is a pure function (no I/O, no global state) backing FeatureSpec.feature_hash(). The hash is:

identical for two logically-equal recipes — dict key order is irrelevant, float jitter is normalized (e.g. 0.1 == 0.1000000001), and omitted defaults are resolved explicitly (rsi() == rsi(period=14));
different whenever something meaningful changes — a param value, the order of features (never sorted), or the ta_version (the same feature name across TA library versions is not the same implementation).

It is a configuration-drift detector: recompute it when loading a model artefact and refuse to continue on mismatch.

from signalflow.feature.spec import canonical_feature_hash

h = canonical_feature_hash(
    features=[{"name": "rsi", "params": {"period": 14}, "scope": "pair"}],
    ta_version="1.0.0",
    raw_data_type="spot",
)

signalflow.feature.spec.FeatureSpec `dataclass` ¶

FeatureSpec(features: list[dict] = list(), ta_version: str | None = None, raw_data_type: str = 'spot', order_significant: bool = True)

Serializable, hashable recipe for a :class:FeaturePipeline.

Attributes:

Name	Type	Description
`features`	`list[dict]`	Ordered feature records. Each is `{"name": str, "params": dict, "scope": "pair"\|"global"}`. Order is significant and preserved everywhere.
`ta_version`	`str \| None`	Pinned TA-implementation version (part of the hash).
`raw_data_type`	`str`	Raw data type key (e.g. `"spot"`).
`order_significant`	`bool`	Declares that feature order is part of the truth. Always honored (features are never reordered); kept as serialized metadata for provenance.

features `class-attribute` `instance-attribute` ¶

features: list[dict] = field(default_factory=list)

order_significant `class-attribute` `instance-attribute` ¶

order_significant: bool = True

raw_data_type `class-attribute` `instance-attribute` ¶

raw_data_type: str = 'spot'

ta_version `class-attribute` `instance-attribute` ¶

ta_version: str | None = None

build ¶

build() -> FeaturePipeline

Reconstruct a :class:FeaturePipeline from the recipe.

For each record, instantiate via default_registry.create(FEATURE, name, **params) and assemble in the given order (order is significant).

Source code in src/signalflow/feature/spec.py

def build(self) -> FeaturePipeline:
    """Reconstruct a :class:`FeaturePipeline` from the recipe.

    For each record, instantiate via
    ``default_registry.create(FEATURE, name, **params)`` and assemble in the
    **given order** (order is significant).
    """
    feats: list[Feature] = [
        default_registry.create(
            SfComponentType.FEATURE,
            rec["name"],
            **dict(rec.get("params", {}) or {}),
        )
        for rec in self.features
    ]
    return FeaturePipeline(features=feats, raw_data_type=self.raw_data_type)

feature_hash ¶

feature_hash() -> str

Stable SHA-256 of the canonical recipe (drift detector).

Source code in src/signalflow/feature/spec.py

def feature_hash(self) -> str:
    """Stable SHA-256 of the canonical recipe (drift detector)."""
    return canonical_feature_hash(
        self.features,
        self.ta_version,
        self.raw_data_type,
    )

from_config `classmethod` ¶

from_config(data: dict) -> FeatureSpec

Reconstruct a spec from a plain config dict.

Accepts both the flat form (ta_version/raw_data_type at top level) and the YAML meta: nested form shown in VISION.md §5.4.

Source code in src/signalflow/feature/spec.py

@classmethod
def from_config(cls, data: dict) -> FeatureSpec:
    """Reconstruct a spec from a plain config dict.

    Accepts both the flat form (``ta_version``/``raw_data_type`` at top
    level) and the YAML ``meta:`` nested form shown in VISION.md §5.4.
    """
    meta = data.get("meta", {}) or {}
    features_in = data.get("features", []) or []

    features: list[dict] = []
    for rec in features_in:
        features.append(
            {
                "name": rec["name"],
                "params": dict(rec.get("params", {}) or {}),
                "scope": rec.get("scope", "pair"),
            }
        )

    ta_version = data.get("ta_version", meta.get("ta_version"))
    raw_data_type = data.get("raw_data_type", meta.get("raw_data_type", "spot"))
    order_significant = data.get("order_significant", meta.get("order_significant", True))

    return cls(
        features=features,
        ta_version=ta_version,
        raw_data_type=str(raw_data_type),
        order_significant=bool(order_significant),
    )

from_pipeline `classmethod` ¶

from_pipeline(pipeline: FeaturePipeline, *, ta_version: str | None = None) -> FeatureSpec

Extract a :class:FeatureSpec from a live pipeline.

Feature names come from the registry reverse-lookup (:func:_registry_name_for_class). If a feature class is not registered, we fall back to its class name and the resulting spec may not round-trip through :meth:build (documented limitation).

Source code in src/signalflow/feature/spec.py

@classmethod
def from_pipeline(
    cls,
    pipeline: FeaturePipeline,
    *,
    ta_version: str | None = None,
) -> FeatureSpec:
    """Extract a :class:`FeatureSpec` from a live pipeline.

    Feature names come from the registry reverse-lookup
    (:func:`_registry_name_for_class`). If a feature class is not
    registered, we fall back to its class name and the resulting spec may
    not round-trip through :meth:`build` (documented limitation).
    """
    records: list[dict] = []
    for feat in pipeline.features:
        name = _registry_name_for_class(type(feat))
        if name is None:
            # Fallback: unregistered class. build() would fail for this
            # name, but we still produce a stable, hashable record.
            name = type(feat).__name__
        records.append(
            {
                "name": name,
                "params": _resolved_params(feat),
                "scope": _scope_for_feature(feat),
            }
        )

    raw = getattr(pipeline.raw_data_type, "value", pipeline.raw_data_type)
    return cls(
        features=records,
        ta_version=ta_version,
        raw_data_type=str(raw),
    )

from_yaml `classmethod` ¶

from_yaml(path: str | Path) -> FeatureSpec

Load a spec from a YAML file.

Source code in src/signalflow/feature/spec.py

@classmethod
def from_yaml(cls, path: str | Path) -> FeatureSpec:
    """Load a spec from a YAML file."""
    path = Path(path)
    with path.open("r", encoding="utf-8") as fh:
        data = yaml.safe_load(fh) or {}
    return cls.from_config(data)

to_config ¶

to_config() -> dict

Serialize to a plain dict (YAML meta: nested form).

Source code in src/signalflow/feature/spec.py

def to_config(self) -> dict:
    """Serialize to a plain dict (YAML ``meta:`` nested form)."""
    return {
        "features": [
            {
                "name": rec["name"],
                "params": dict(rec.get("params", {}) or {}),
                "scope": rec.get("scope", "pair"),
            }
            for rec in self.features
        ],
        "meta": {
            "ta_version": self.ta_version,
            "raw_data_type": self.raw_data_type,
            "order_significant": self.order_significant,
        },
    }

to_yaml ¶

to_yaml(path: str | Path) -> None

Persist the spec as YAML (survives class refactors, unlike pickle).

Source code in src/signalflow/feature/spec.py

def to_yaml(self, path: str | Path) -> None:
    """Persist the spec as YAML (survives class refactors, unlike pickle)."""
    path = Path(path)
    with path.open("w", encoding="utf-8") as fh:
        yaml.safe_dump(self.to_config(), fh, sort_keys=False, allow_unicode=True)

signalflow.feature.spec.canonical_feature_hash ¶

canonical_feature_hash(features: list[dict], ta_version: str | None, raw_data_type: str) -> str

Compute the stable feature hash from a feature recipe.

Pure function (no I/O, no global state) — trivially unit-testable.

Canonicalization rules (VISION.md §5.5): * json.dumps(..., sort_keys=True) makes dict key order irrelevant; * floats are rounded to fixed precision so 0.1 == 0.1000000001; * the order of features is significant — the list is hashed as-is, never sorted; * ta_version is part of the hash (same feature name across TA library versions is not the same implementation).

Parameters:

Name	Type	Description	Default
`features`	`list[dict]`	Ordered list of `{"name", "params", "scope"}` dicts.	required
`ta_version`	`str \| None`	Pinned TA-implementation version, or `None`.	required
`raw_data_type`	`str`	Raw data type key (e.g. `"spot"`).	required

Returns:

Type	Description
`str`	Hex-encoded SHA-256 digest.

Source code in src/signalflow/feature/spec.py

def canonical_feature_hash(
    features: list[dict],
    ta_version: str | None,
    raw_data_type: str,
) -> str:
    """Compute the stable feature hash from a feature recipe.

    Pure function (no I/O, no global state) — trivially unit-testable.

    Canonicalization rules (VISION.md §5.5):
      * ``json.dumps(..., sort_keys=True)`` makes dict **key** order irrelevant;
      * floats are rounded to fixed precision so 0.1 == 0.1000000001;
      * the **order of features is significant** — the list is hashed as-is,
        never sorted;
      * ``ta_version`` is part of the hash (same feature name across TA library
        versions is not the same implementation).

    Args:
        features: Ordered list of ``{"name", "params", "scope"}`` dicts.
        ta_version: Pinned TA-implementation version, or ``None``.
        raw_data_type: Raw data type key (e.g. ``"spot"``).

    Returns:
        Hex-encoded SHA-256 digest.
    """
    payload = {
        "features": _canonicalize(features),  # order preserved
        "ta_version": ta_version,
        "raw_data_type": raw_data_type,
    }
    blob = json.dumps(payload, sort_keys=True, separators=(",", ":"), ensure_ascii=False)
    return hashlib.sha256(blob.encode("utf-8")).hexdigest()

ModelFeaturesPipeline¶

ModelFeaturesPipeline is the reproducibility wrapper for train↔serve parity. It is a composition (not a subclass) over a FeaturePipeline (the one and only compute engine) plus a FeatureSpec (the serializable recipe). There is zero duplicated feature computation: compute() delegates straight into the wrapped pipeline.

from signalflow.feature import FeaturePipeline, ExampleRsiFeature
from signalflow.feature.model_features import ModelFeaturesPipeline

pipeline = FeaturePipeline(features=[ExampleRsiFeature(period=14)], raw_data_type="spot")
mfp = ModelFeaturesPipeline.from_pipeline(pipeline, ta_version="1.0.0")

# Bundle recipe + hash to store alongside the trained model:
artifact = mfp.to_artifact_dict()    # {"features_config": {...}, "feature_hash": "..."}

# At serve time, rebuild from config and guard against drift:
served = ModelFeaturesPipeline.from_config(artifact["features_config"])
served.verify_hash(artifact["feature_hash"])   # raises RuntimeError on mismatch
served.validate_reproducible()                 # raises if a recursive non-invariant feature is present
features_df = served.compute(df)               # pure delegation to FeaturePipeline.compute

Member	Purpose
`from_config(data)`	Build the recipe and instantiate the engine from it (single source of truth).
`from_pipeline(pipeline, *, ta_version=None)`	Wrap an already-built pipeline, deriving its recipe.
`to_config()`	Serialize the recipe (delegates to the spec).
`to_artifact_dict()`	Bundle `features_config` + `feature_hash` for storing with a model.
`feature_hash` (property)	Stable SHA-256 of the recipe.
`validate_reproducible()`	Assert every nested feature honours the warmup-invariance contract.
`verify_hash(expected)`	Recompute the hash and raise `RuntimeError` if it differs (drift detector).
`compute(df, context=None)`	Compute features by delegating to the wrapped engine (no math here).

signalflow.feature.model_features.ModelFeaturesPipeline `dataclass` ¶

ModelFeaturesPipeline(_pipeline: FeaturePipeline, _spec: FeatureSpec)

Reproducibility wrapper around a :class:FeaturePipeline.

Holds both the live compute engine (_pipeline) and its serializable, hashable recipe (_spec). All feature computation is delegated to the nested pipeline — this class only adds reconstruction, hash verification and warmup-reproducibility checks.

Attributes:

Name	Type	Description
`_pipeline`	`FeaturePipeline`	The wrapped computational engine (single source of compute).
`_spec`	`FeatureSpec`	The serializable recipe used for config round-trip and hashing.

_pipeline `instance-attribute` ¶

_pipeline: FeaturePipeline

_spec `instance-attribute` ¶

_spec: FeatureSpec

feature_hash `property` ¶

feature_hash: str

Stable SHA-256 of the recipe (delegates to the spec).

compute ¶

compute(df: DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame

Compute all features by delegating to the wrapped engine.

No feature math lives here — this is a thin pass-through to :meth:FeaturePipeline.compute, which is the single computational engine for the whole system.

Source code in src/signalflow/feature/model_features.py

def compute(self, df: pl.DataFrame, context: dict[str, Any] | None = None) -> pl.DataFrame:
    """Compute all features by delegating to the wrapped engine.

    No feature math lives here — this is a thin pass-through to
    :meth:`FeaturePipeline.compute`, which is the single computational
    engine for the whole system.
    """
    return self._pipeline.compute(df, context)

from_config `classmethod` ¶

from_config(data: dict) -> Self

Reconstruct from a plain config dict.

Builds the recipe (:class:FeatureSpec) and uses it to instantiate the compute engine. The spec is the single source of truth for both the engine and the hash, so they can never disagree.

Source code in src/signalflow/feature/model_features.py

@classmethod
def from_config(cls, data: dict) -> Self:
    """Reconstruct from a plain config dict.

    Builds the recipe (:class:`FeatureSpec`) and uses it to instantiate the
    compute engine. The spec is the single source of truth for both the
    engine and the hash, so they can never disagree.
    """
    spec = FeatureSpec.from_config(data)
    pipeline = spec.build()
    return cls(_pipeline=pipeline, _spec=spec)

from_pipeline `classmethod` ¶

from_pipeline(pipeline: FeaturePipeline, *, ta_version: str | None = None) -> Self

Wrap an already-built pipeline, deriving its recipe.

The recipe is extracted from the live pipeline via :meth:FeatureSpec.from_pipeline; ta_version pins the TA implementation in the hash when provided.

Source code in src/signalflow/feature/model_features.py

@classmethod
def from_pipeline(
    cls,
    pipeline: FeaturePipeline,
    *,
    ta_version: str | None = None,
) -> Self:
    """Wrap an already-built pipeline, deriving its recipe.

    The recipe is extracted from the live pipeline via
    :meth:`FeatureSpec.from_pipeline`; ``ta_version`` pins the TA
    implementation in the hash when provided.
    """
    spec = FeatureSpec.from_pipeline(pipeline, ta_version=ta_version)
    return cls(_pipeline=pipeline, _spec=spec)

to_artifact_dict ¶

to_artifact_dict() -> dict

Bundle the recipe with its hash for storing alongside a model.

The persisted feature_hash is what :meth:verify_hash checks at load time, turning silent recipe drift into a loud failure.

Source code in src/signalflow/feature/model_features.py

def to_artifact_dict(self) -> dict:
    """Bundle the recipe with its hash for storing alongside a model.

    The persisted ``feature_hash`` is what :meth:`verify_hash` checks at
    load time, turning silent recipe drift into a loud failure.
    """
    return {
        "features_config": self._spec.to_config(),
        "feature_hash": self.feature_hash,
    }

to_config ¶

to_config() -> dict

Serialize the recipe to a plain dict (delegates to the spec).

Source code in src/signalflow/feature/model_features.py

def to_config(self) -> dict:
    """Serialize the recipe to a plain dict (delegates to the spec)."""
    return self._spec.to_config()

validate_reproducible ¶

validate_reproducible() -> None

Assert every nested feature honours the warmup-invariance contract.

Delegates to :meth:FeaturePipeline.assert_reproducible, which raises if any nested feature is recursive and not warmup-invariant (entry-point dependent → would break live/backtest parity).

Raises:

Type	Description
`RuntimeError`	if a recursive non-invariant feature is present.

Source code in src/signalflow/feature/model_features.py

def validate_reproducible(self) -> None:
    """Assert every nested feature honours the warmup-invariance contract.

    Delegates to :meth:`FeaturePipeline.assert_reproducible`, which raises if
    any nested feature is recursive and not warmup-invariant (entry-point
    dependent → would break live/backtest parity).

    Raises:
        RuntimeError: if a recursive non-invariant feature is present.
    """
    self._pipeline.assert_reproducible()

verify_hash ¶

verify_hash(expected: str) -> None

Verify the recipe hash matches an expected value (drift detector).

Recompute :attr:feature_hash and compare it to the hash recorded when the artifact was produced. A mismatch means the feature recipe changed between train and serve — never continue silently.

Raises:

Type	Description
`RuntimeError`	if the recomputed hash differs from `expected`.

Source code in src/signalflow/feature/model_features.py

def verify_hash(self, expected: str) -> None:
    """Verify the recipe hash matches an expected value (drift detector).

    Recompute :attr:`feature_hash` and compare it to the hash recorded when
    the artifact was produced. A mismatch means the feature recipe changed
    between train and serve — never continue silently.

    Raises:
        RuntimeError: if the recomputed hash differs from ``expected``.
    """
    actual = self.feature_hash
    if actual != expected:
        raise RuntimeError(
            "Feature hash mismatch — the feature recipe drifted between "
            "train and serve. Refusing to continue.\n"
            f"  expected: {expected}\n"
            f"  actual:   {actual}"
        )

Feature Informativeness¶

Measures how informative each feature is relative to multiple targets at multiple prediction horizons. Combines MI magnitude with temporal stability into a composite score.

Usage¶

from signalflow.feature.informativeness import FeatureInformativenessAnalyzer
from signalflow.detector.market import MarketZScoreDetector

analyzer = FeatureInformativenessAnalyzer(
    event_detector=MarketZScoreDetector(z_threshold=3.0),
)
report = analyzer.analyze(df, feature_columns=["rsi_14", "sma_20", "volume_ratio"])

print(report.top_features(10))      # best features by composite score
print(report.score_matrix)          # NMI heatmap: feature x (horizon, target)
report.feature_detail("rsi_14")     # per-target breakdown for one feature

signalflow.feature.informativeness.FeatureInformativenessAnalyzer `dataclass` ¶

FeatureInformativenessAnalyzer(target_generator: MultiTargetGenerator = MultiTargetGenerator(), event_detector: SignalDetector | None = _default_event_detector(), rolling_mi: RollingMIConfig = RollingMIConfig(), weights: CompositeWeights = CompositeWeights(), bins: int = 20, pair_col: str = 'pair', ts_col: str = 'timestamp', aggregate_pairs: bool = True)

Orchestrator for feature informativeness analysis.

Pipeline

Generate multi-horizon, multi-target labels
Detect and mask global events
Compute MI between each feature and each target
Compute rolling MI for temporal stability
Compute composite weighted scores
Generate report

Attributes:

Name	Type	Description
`target_generator`	`MultiTargetGenerator`	Multi-target label generator.
`event_detector`	`SignalDetector \| None`	Global event detector. None to disable.
`rolling_mi`	`RollingMIConfig`	Rolling MI stability configuration.
`weights`	`CompositeWeights`	Composite scoring weights.
`bins`	`int`	Number of histogram bins for MI estimation.
`pair_col`	`str`	Pair column name.
`ts_col`	`str`	Timestamp column name.
`aggregate_pairs`	`bool`	If True, pool all pairs for MI computation.

aggregate_pairs `class-attribute` `instance-attribute` ¶

aggregate_pairs: bool = True

bins `class-attribute` `instance-attribute` ¶

bins: int = 20

event_detector `class-attribute` `instance-attribute` ¶

event_detector: SignalDetector | None = field(default_factory=_default_event_detector)

pair_col `class-attribute` `instance-attribute` ¶

pair_col: str = 'pair'

rolling_mi `class-attribute` `instance-attribute` ¶

rolling_mi: RollingMIConfig = field(default_factory=RollingMIConfig)

target_generator `class-attribute` `instance-attribute` ¶

target_generator: MultiTargetGenerator = field(default_factory=MultiTargetGenerator)

ts_col `class-attribute` `instance-attribute` ¶

ts_col: str = 'timestamp'

weights `class-attribute` `instance-attribute` ¶

weights: CompositeWeights = field(default_factory=CompositeWeights)

_build_score_matrix ¶

_build_score_matrix(raw_mi: DataFrame) -> pl.DataFrame

Build pivoted Feature x (Horizon, Target) matrix.

Source code in src/signalflow/feature/informativeness.py

def _build_score_matrix(self, raw_mi: pl.DataFrame) -> pl.DataFrame:
    """Build pivoted Feature x (Horizon, Target) matrix."""
    if raw_mi.height == 0:
        return pl.DataFrame()

    matrix = raw_mi.with_columns((pl.col("horizon") + "_" + pl.col("target_type")).alias("_col_key")).pivot(
        on="_col_key",
        index="feature",
        values="nmi",
    )

    return matrix

_compute_all_mi ¶

_compute_all_mi(df: DataFrame, feature_columns: list[str], target_meta: list[dict[str, str]]) -> list[dict]

Compute MI for all (feature, target) pairs.

Source code in src/signalflow/feature/informativeness.py

def _compute_all_mi(
    self,
    df: pl.DataFrame,
    feature_columns: list[str],
    target_meta: list[dict[str, str]],
) -> list[dict]:
    """Compute MI for all (feature, target) pairs."""
    rows = []

    for feat_col in feature_columns:
        for tmeta in target_meta:
            target_col = tmeta["column"]
            target_kind = tmeta["kind"]

            feat_arr, target_arr = self._extract_arrays(df, feat_col, target_col)
            if feat_arr is None or target_arr is None:
                rows.append(self._nan_row(feat_col, tmeta))
                continue

            mi = self._compute_mi_pair(feat_arr, target_arr, target_kind)
            h_feat = entropy_continuous(feat_arr, bins=self.bins)
            h_target = (
                entropy_discrete(target_arr)
                if target_kind == "discrete"
                else entropy_continuous(target_arr, bins=self.bins)
            )
            nmi = normalized_mutual_information(mi, h_feat, h_target)

            stability = self._compute_stability(feat_arr, target_arr, target_kind)

            rows.append(
                {
                    "feature": feat_col,
                    "horizon": tmeta["horizon"],
                    "target_type": tmeta["target_type"],
                    "mi": mi,
                    "nmi": nmi,
                    "stability_score": stability,
                }
            )

    return rows

_compute_composite ¶

_compute_composite(raw_mi: DataFrame) -> pl.DataFrame

Compute composite scores from raw MI results.

Source code in src/signalflow/feature/informativeness.py

def _compute_composite(self, raw_mi: pl.DataFrame) -> pl.DataFrame:
    """Compute composite scores from raw MI results."""
    if raw_mi.height == 0:
        return pl.DataFrame(schema={"feature": pl.Utf8, "composite_score": pl.Float64, "rank": pl.UInt32})

    w = self.weights
    alpha = w.stability_weight

    # Build weights per (horizon, target_type)
    horizons = raw_mi.get_column("horizon").unique().to_list()
    targets = raw_mi.get_column("target_type").unique().to_list()

    h_weights = w.horizon_weights or {h: 1.0 / len(horizons) for h in horizons}
    t_weights = w.target_weights or {t: 1.0 / len(targets) for t in targets}

    # Normalize
    h_total = sum(h_weights.values())
    t_total = sum(t_weights.values())
    h_weights = {k: v / h_total for k, v in h_weights.items()}
    t_weights = {k: v / t_total for k, v in t_weights.items()}

    scored = (
        raw_mi.with_columns(
            [
                pl.col("horizon").replace_strict(h_weights, default=0.0).alias("_h_w"),
                pl.col("target_type").replace_strict(t_weights, default=0.0).alias("_t_w"),
            ]
        )
        .with_columns((pl.col("_h_w") * pl.col("_t_w")).alias("_weight"))
        .with_columns(
            ((1.0 - alpha) * pl.col("nmi").fill_null(0.0) + alpha * pl.col("stability_score").fill_null(0.0)).alias(
                "_cell_score"
            )
        )
        .with_columns((pl.col("_cell_score") * pl.col("_weight")).alias("_weighted_score"))
    )

    result = (
        scored.group_by("feature")
        .agg(pl.col("_weighted_score").sum().alias("composite_score"))
        .sort("composite_score", descending=True)
        .with_row_index("rank", offset=1)
        .select(["feature", "composite_score", "rank"])
    )

    return result

_compute_mi_pair ¶

_compute_mi_pair(feat: ndarray, target: ndarray, target_kind: str) -> float

Compute MI between one feature and one target.

Source code in src/signalflow/feature/informativeness.py

def _compute_mi_pair(
    self,
    feat: np.ndarray,
    target: np.ndarray,
    target_kind: str,
) -> float:
    """Compute MI between one feature and one target."""
    if target_kind == "discrete":
        return mutual_information_continuous_discrete(feat, target, bins=self.bins)
    return mutual_information_continuous(feat, target, bins=self.bins)

_compute_stability ¶

_compute_stability(feat: ndarray, target: ndarray, target_kind: str) -> float

Compute temporal stability via rolling MI windows.

Source code in src/signalflow/feature/informativeness.py

def _compute_stability(
    self,
    feat: np.ndarray,
    target: np.ndarray,
    target_kind: str,
) -> float:
    """Compute temporal stability via rolling MI windows."""
    cfg = self.rolling_mi
    n = len(feat)
    step = cfg.window_size
    min_fill = int(step * cfg.min_window_fill)

    mi_values = []
    for start in range(0, n - min_fill + 1, step):
        end = min(start + step, n)
        f_win = feat[start:end]
        t_win = target[start:end]

        # Check fill rate
        valid = np.isfinite(f_win).sum() if np.issubdtype(f_win.dtype, np.floating) else len(f_win)

        if valid < min_fill:
            continue

        mi = self._compute_mi_pair(f_win, t_win, target_kind)
        if not np.isnan(mi):
            mi_values.append(mi)

    if len(mi_values) < 2:
        return np.nan

    mean_mi = np.mean(mi_values)
    std_mi = np.std(mi_values)

    if mean_mi <= 0:
        return 0.0

    cv = std_mi / mean_mi
    return float(1.0 / (1.0 + cv))

_extract_arrays ¶

_extract_arrays(df: DataFrame, feat_col: str, target_col: str) -> tuple[np.ndarray | None, np.ndarray | None]

Extract aligned numpy arrays, dropping rows with nulls in either.

Source code in src/signalflow/feature/informativeness.py

def _extract_arrays(
    self,
    df: pl.DataFrame,
    feat_col: str,
    target_col: str,
) -> tuple[np.ndarray | None, np.ndarray | None]:
    """Extract aligned numpy arrays, dropping rows with nulls in either."""
    if feat_col not in df.columns or target_col not in df.columns:
        return None, None

    subset = df.select([feat_col, target_col]).drop_nulls()
    if subset.height < 10:
        return None, None

    feat_arr = subset.get_column(feat_col).to_numpy().astype(np.float64)
    target_series = subset.get_column(target_col)

    if target_series.dtype == pl.Utf8:
        target_arr = target_series.to_numpy()
    else:
        target_arr = target_series.to_numpy().astype(np.float64)

    return feat_arr, target_arr

_nan_row ¶

_nan_row(feat_col: str, tmeta: dict) -> dict

Source code in src/signalflow/feature/informativeness.py

def _nan_row(self, feat_col: str, tmeta: dict) -> dict:
    return {
        "feature": feat_col,
        "horizon": tmeta["horizon"],
        "target_type": tmeta["target_type"],
        "mi": np.nan,
        "nmi": np.nan,
        "stability_score": np.nan,
    }

_validate ¶

_validate(df: DataFrame, feature_columns: list[str]) -> None

Source code in src/signalflow/feature/informativeness.py

def _validate(self, df: pl.DataFrame, feature_columns: list[str]) -> None:
    if not feature_columns:
        raise ValueError("feature_columns must not be empty")

    required = {self.pair_col, self.ts_col}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {sorted(missing)}")

    missing_features = [c for c in feature_columns if c not in df.columns]
    if missing_features:
        raise ValueError(f"Feature columns not found in DataFrame: {missing_features}")

analyze ¶

analyze(df: DataFrame, feature_columns: list[str]) -> InformativenessReport

Run full informativeness analysis.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	OHLCV DataFrame with pre-computed feature columns.	required
`feature_columns`	`list[str]`	List of feature column names to evaluate.	required

Returns:

Type	Description
`InformativenessReport`	InformativenessReport with all results.

Raises:

Type	Description
`ValueError`	If required columns are missing or feature_columns is empty.

Source code in src/signalflow/feature/informativeness.py

def analyze(
    self,
    df: pl.DataFrame,
    feature_columns: list[str],
) -> InformativenessReport:
    """Run full informativeness analysis.

    Args:
        df: OHLCV DataFrame with pre-computed feature columns.
        feature_columns: List of feature column names to evaluate.

    Returns:
        InformativenessReport with all results.

    Raises:
        ValueError: If required columns are missing or feature_columns is empty.
    """
    self._validate(df, feature_columns)

    # 1. Generate targets
    logger.info("Generating multi-horizon targets...")
    df = self.target_generator.generate(df)
    target_meta = self.target_generator.target_columns()

    # 2. Detect and mask global events
    global_events = None
    if self.event_detector is not None:
        logger.info("Detecting global events...")
        # Convert DataFrame to RawDataView for SignalDetector
        raw_view = _df_to_raw_data_view(df, self.pair_col, self.ts_col)
        signals = self.event_detector.run(raw_view)
        global_events = signals.value

        # Get all target columns
        target_columns = [meta["column"] for meta in target_meta]

        # Mask targets using the maximum horizon
        max_horizon = max(h.horizon for h in self.target_generator.horizons)

        df = mask_targets_by_signals(
            df=df,
            signals=signals,
            mask_signal_types=self.event_detector.allowed_signal_types or set(),  # type: ignore[attr-defined]
            horizon_bars=max_horizon,
            cooldown_bars=60,
            target_columns=target_columns,
            pair_col=self.pair_col,
            ts_col=self.ts_col,
        )

    # 3-4. Compute MI and rolling stability
    logger.info(f"Computing MI for {len(feature_columns)} features x {len(target_meta)} targets...")
    mi_rows = self._compute_all_mi(df, feature_columns, target_meta)
    raw_mi = pl.DataFrame(mi_rows)

    # 5. Composite scoring
    logger.info("Computing composite scores...")
    composite_scores = self._compute_composite(raw_mi)
    score_matrix = self._build_score_matrix(raw_mi)

    n_events = 0
    if global_events is not None:
        n_events = global_events.height

    metadata = {
        "n_features": len(feature_columns),
        "n_horizons": len(self.target_generator.horizons),
        "n_target_types": len(self.target_generator.target_types),
        "n_global_events": n_events,
        "bins": self.bins,
        "aggregate_pairs": self.aggregate_pairs,
    }

    logger.info("Informativeness analysis complete.")
    return InformativenessReport(
        raw_mi=raw_mi,
        composite_scores=composite_scores,
        score_matrix=score_matrix,
        global_events=global_events,
        metadata=metadata,
    )

signalflow.feature.informativeness.InformativenessReport `dataclass` ¶

InformativenessReport(raw_mi: DataFrame, composite_scores: DataFrame, score_matrix: DataFrame, global_events: DataFrame | None, metadata: dict)

Container for informativeness analysis results.

Attributes:

Name	Type	Description
`raw_mi`	`DataFrame`	Full MI results (feature x horizon x target).
`composite_scores`	`DataFrame`	Aggregated scores per feature, ranked.
`score_matrix`	`DataFrame`	Pivoted Feature x (Horizon, Target) matrix.
`global_events`	`DataFrame \| None`	Global event detection results (if enabled).
`metadata`	`dict`	Analysis configuration and statistics.

composite_scores `instance-attribute` ¶

composite_scores: DataFrame

global_events `instance-attribute` ¶

global_events: DataFrame | None

metadata `instance-attribute` ¶

metadata: dict

raw_mi `instance-attribute` ¶

raw_mi: DataFrame

score_matrix `instance-attribute` ¶

score_matrix: DataFrame

bottom_features ¶

bottom_features(n: int = 20) -> pl.DataFrame

Return bottom N features (least informative).

Source code in src/signalflow/feature/informativeness.py

def bottom_features(self, n: int = 20) -> pl.DataFrame:
    """Return bottom N features (least informative)."""
    return self.composite_scores.tail(n)

feature_detail ¶

feature_detail(feature_name: str) -> pl.DataFrame

Return detailed MI breakdown for a single feature.

Source code in src/signalflow/feature/informativeness.py

def feature_detail(self, feature_name: str) -> pl.DataFrame:
    """Return detailed MI breakdown for a single feature."""
    return self.raw_mi.filter(pl.col("feature") == feature_name)

top_features ¶

top_features(n: int = 20) -> pl.DataFrame

Return top N features by composite score.

Source code in src/signalflow/feature/informativeness.py

def top_features(self, n: int = 20) -> pl.DataFrame:
    """Return top N features by composite score."""
    return self.composite_scores.head(n)

signalflow.feature.informativeness.RollingMIConfig `dataclass` ¶

RollingMIConfig(window_size: int = 5000, min_window_fill: float = 0.7)

Configuration for rolling MI stability computation.

Attributes:

Name	Type	Description
`window_size`	`int`	Number of bars per rolling window.
`min_window_fill`	`float`	Minimum fraction of non-null values in a window.

signalflow.feature.informativeness.CompositeWeights `dataclass` ¶

CompositeWeights(horizon_weights: dict[str, float] | None = None, target_weights: dict[str, float] | None = None, stability_weight: float = 0.3)

Weights for composite informativeness scoring.

Attributes:

Name	Type	Description
`horizon_weights`	`dict[str, float] \| None`	Per-horizon weights. None = equal weights.
`target_weights`	`dict[str, float] \| None`	Per-target weights. None = equal weights.
`stability_weight`	`float`	Fraction of score from stability (rest from NMI).

Mutual Information Functions¶

signalflow.feature.mutual_information ¶

Mutual Information estimation for feature-target pairs.

Provides histogram-based MI estimation for continuous and discrete variables. Used by FeatureInformativenessAnalyzer to measure feature informativeness against multiple target types.

References

Cover & Thomas (2006) - Elements of Information Theory
Kraskov et al. (2004) - MI estimation

_bin_continuous ¶

_bin_continuous(x: ndarray, bins: int) -> np.ndarray

Bin continuous values into integer bin indices.

Source code in src/signalflow/feature/mutual_information.py

def _bin_continuous(x: np.ndarray, bins: int) -> np.ndarray:
    """Bin continuous values into integer bin indices."""
    _, edges = np.histogram(x, bins=bins)
    return np.clip(np.digitize(x, edges[:-1]) - 1, 0, bins - 1)

_isnan_any ¶

_isnan_any(arr: ndarray) -> np.ndarray

Return boolean mask for NaN-like values in any dtype.

Source code in src/signalflow/feature/mutual_information.py

def _isnan_any(arr: np.ndarray) -> np.ndarray:
    """Return boolean mask for NaN-like values in any dtype."""
    if np.issubdtype(arr.dtype, np.floating):
        result: np.ndarray = np.isnan(arr)
        return result
    if arr.dtype == object:
        return np.array([v is None or (isinstance(v, float) and np.isnan(v)) for v in arr])
    return np.zeros(len(arr), dtype=bool)

_mi_from_contingency ¶

_mi_from_contingency(x: ndarray, y: ndarray) -> float

Compute MI from two discrete arrays via contingency table.

Source code in src/signalflow/feature/mutual_information.py

def _mi_from_contingency(x: np.ndarray, y: np.ndarray) -> float:
    """Compute MI from two discrete arrays via contingency table."""
    x_vals, x_idx = np.unique(x, return_inverse=True)
    y_vals, y_idx = np.unique(y, return_inverse=True)

    contingency = np.zeros((len(x_vals), len(y_vals)), dtype=np.float64)
    np.add.at(contingency, (x_idx, y_idx), 1)

    pxy = contingency / contingency.sum()
    px = pxy.sum(axis=1)
    py = pxy.sum(axis=0)

    outer = px[:, None] * py[None, :]
    valid = (pxy > 0) & (outer > 0)
    mi = np.sum(pxy[valid] * np.log2(pxy[valid] / outer[valid]))
    return float(max(mi, 0.0))

entropy_continuous ¶

entropy_continuous(x: ndarray, bins: int = 20) -> float

Shannon entropy via histogram of a continuous variable.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	1D array of continuous values.	required
`bins`	`int`	Number of histogram bins.	`20`

Returns:

Type	Description
`float`	Entropy in bits. NaN if fewer than 2 valid values.

Source code in src/signalflow/feature/mutual_information.py

def entropy_continuous(x: np.ndarray, bins: int = 20) -> float:
    """Shannon entropy via histogram of a continuous variable.

    Args:
        x: 1D array of continuous values.
        bins: Number of histogram bins.

    Returns:
        Entropy in bits. NaN if fewer than 2 valid values.
    """
    x = x[np.isfinite(x)]
    if len(x) < 2:
        return np.nan

    counts, _ = np.histogram(x, bins=bins)
    probs = counts / counts.sum()
    probs = probs[probs > 0]
    return float(-np.sum(probs * np.log2(probs)))

entropy_discrete ¶

entropy_discrete(x: ndarray) -> float

Shannon entropy of a discrete distribution.

H(X) = -sum_x p(x) * log2(p(x))

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	1D array of discrete values.	required

Returns:

Type	Description
`float`	Entropy in bits. NaN if fewer than 2 values.

Source code in src/signalflow/feature/mutual_information.py

def entropy_discrete(x: np.ndarray) -> float:
    """Shannon entropy of a discrete distribution.

    H(X) = -sum_x p(x) * log2(p(x))

    Args:
        x: 1D array of discrete values.

    Returns:
        Entropy in bits. NaN if fewer than 2 values.
    """
    x = x[~_isnan_any(x)]
    if len(x) < 2:
        return np.nan

    _, counts = np.unique(x, return_counts=True)
    probs = counts / counts.sum()
    probs = probs[probs > 0]
    return float(-np.sum(probs * np.log2(probs)))

mutual_information_continuous ¶

mutual_information_continuous(x: ndarray, y: ndarray, bins: int = 20) -> float

MI between two continuous variables.

Bins both variables and computes MI from the 2D histogram.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	1D continuous array.	required
`y`	`ndarray`	1D continuous array.	required
`bins`	`int`	Number of bins per dimension.	`20`

Returns:

Type	Description
`float`	MI in bits. NaN if insufficient data.

Source code in src/signalflow/feature/mutual_information.py

def mutual_information_continuous(
    x: np.ndarray,
    y: np.ndarray,
    bins: int = 20,
) -> float:
    """MI between two continuous variables.

    Bins both variables and computes MI from the 2D histogram.

    Args:
        x: 1D continuous array.
        y: 1D continuous array.
        bins: Number of bins per dimension.

    Returns:
        MI in bits. NaN if insufficient data.
    """
    mask = np.isfinite(x) & np.isfinite(y)
    x, y = x[mask], y[mask]
    if len(x) < 2:
        return np.nan

    hist_2d, _, _ = np.histogram2d(x, y, bins=bins)
    pxy = hist_2d / hist_2d.sum()
    px = pxy.sum(axis=1)
    py = pxy.sum(axis=0)

    outer = px[:, None] * py[None, :]
    valid = (pxy > 0) & (outer > 0)
    mi = np.sum(pxy[valid] * np.log2(pxy[valid] / outer[valid]))
    return float(max(mi, 0.0))

mutual_information_continuous_discrete ¶

mutual_information_continuous_discrete(x: ndarray, y: ndarray, bins: int = 20) -> float

MI between a continuous feature and a discrete target.

Bins the continuous variable, then computes MI from the joint contingency table of (binned_x, y).

This is the primary use case: continuous feature columns (RSI, SMA, etc.) against discrete labels (RISE/FALL/NONE).

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	1D continuous feature array.	required
`y`	`ndarray`	1D discrete target array.	required
`bins`	`int`	Number of bins for the continuous variable.	`20`

Returns:

Type	Description
`float`	MI in bits. NaN if insufficient data.

Source code in src/signalflow/feature/mutual_information.py

def mutual_information_continuous_discrete(
    x: np.ndarray,
    y: np.ndarray,
    bins: int = 20,
) -> float:
    """MI between a continuous feature and a discrete target.

    Bins the continuous variable, then computes MI from the
    joint contingency table of (binned_x, y).

    This is the primary use case: continuous feature columns
    (RSI, SMA, etc.) against discrete labels (RISE/FALL/NONE).

    Args:
        x: 1D continuous feature array.
        y: 1D discrete target array.
        bins: Number of bins for the continuous variable.

    Returns:
        MI in bits. NaN if insufficient data.
    """
    mask = np.isfinite(x) & ~_isnan_any(y)
    x, y = x[mask], y[mask]
    if len(x) < 2:
        return np.nan

    x_binned = _bin_continuous(x, bins)
    return _mi_from_contingency(x_binned, y)

mutual_information_discrete ¶

mutual_information_discrete(x: ndarray, y: ndarray) -> float

MI between two discrete (categorical) arrays.

MI(X;Y) = sum_{x,y} p(x,y) * log2(p(x,y) / (p(x) * p(y)))

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	1D discrete array.	required
`y`	`ndarray`	1D discrete array of same length.	required

Returns:

Type	Description
`float`	MI in bits. NaN if insufficient data.

Source code in src/signalflow/feature/mutual_information.py

def mutual_information_discrete(x: np.ndarray, y: np.ndarray) -> float:
    """MI between two discrete (categorical) arrays.

    MI(X;Y) = sum_{x,y} p(x,y) * log2(p(x,y) / (p(x) * p(y)))

    Args:
        x: 1D discrete array.
        y: 1D discrete array of same length.

    Returns:
        MI in bits. NaN if insufficient data.
    """
    mask = ~(_isnan_any(x) | _isnan_any(y))
    x, y = x[mask], y[mask]
    if len(x) < 2:
        return np.nan

    return _mi_from_contingency(x, y)

normalized_mutual_information ¶

normalized_mutual_information(mi: float, h_x: float, h_y: float) -> float

Normalize MI to [0, 1] using NMI = MI / sqrt(H(X) * H(Y)).

Parameters:

Name	Type	Description	Default
`mi`	`float`	Raw mutual information value.	required
`h_x`	`float`	Entropy of X.	required
`h_y`	`float`	Entropy of Y.	required

Returns:

Type	Description
`float`	NMI in [0, 1]. NaN if either entropy is zero or NaN.

Source code in src/signalflow/feature/mutual_information.py

def normalized_mutual_information(mi: float, h_x: float, h_y: float) -> float:
    """Normalize MI to [0, 1] using NMI = MI / sqrt(H(X) * H(Y)).

    Args:
        mi: Raw mutual information value.
        h_x: Entropy of X.
        h_y: Entropy of Y.

    Returns:
        NMI in [0, 1]. NaN if either entropy is zero or NaN.
    """
    if np.isnan(mi) or np.isnan(h_x) or np.isnan(h_y):
        return np.nan
    denom = np.sqrt(h_x * h_y)
    if denom <= 0:
        return np.nan
    return float(min(mi / denom, 1.0))

Feature Module¶

Base Classes¶

signalflow.feature.base.Feature dataclass ¶

component_type class-attribute ¶

group_col class-attribute instance-attribute ¶

is_recursive class-attribute ¶

norm_period class-attribute instance-attribute ¶

normalized class-attribute instance-attribute ¶

outputs class-attribute ¶

requires class-attribute ¶

test_params class-attribute ¶

ts_col class-attribute instance-attribute ¶

warmup property ¶

warmup_invariant class-attribute ¶

assert_reproducible ¶

compute ¶

compute_pair ¶

output_cols ¶

required_cols ¶

signalflow.feature.feature_pipeline.FeaturePipeline dataclass ¶

features class-attribute instance-attribute ¶

outputs property ¶

raw_data_type class-attribute instance-attribute ¶

requires class-attribute ¶

__post_init__ ¶

_group_into_batches ¶

_is_per_pair_batch ¶

_validate ¶

assert_reproducible ¶

compute ¶

output_cols ¶

run ¶

to_mermaid ¶

signalflow.feature.base.GlobalFeature dataclass ¶

sources class-attribute instance-attribute ¶

compute ¶

get_source_data ¶

iter_sources ¶

signalflow.feature.offset_feature.OffsetFeature dataclass ¶

Outputs: 15m_rsi_14, offset¶

With GlobalFeature¶

feature_name class-attribute instance-attribute ¶

feature_params class-attribute instance-attribute ¶

outputs class-attribute ¶

prefix class-attribute instance-attribute ¶

requires class-attribute ¶

window class-attribute instance-attribute ¶

__post_init__ ¶

_compute_all_pairs_global ¶

_compute_base_feature ¶

_compute_single_pair ¶

_resample_ohlcv ¶

compute ¶

compute_pair ¶

from_dict classmethod ¶

output_cols ¶

required_cols ¶

to_dict ¶

signalflow.feature.lin_reg_forecast.LinRegForecastFeature dataclass ¶

alpha class-attribute instance-attribute ¶

forecast_horizon class-attribute instance-attribute ¶

mean_window class-attribute instance-attribute ¶

min_samples class-attribute instance-attribute ¶

n_lags class-attribute instance-attribute ¶

outputs class-attribute ¶

refit_period class-attribute instance-attribute ¶

requires class-attribute ¶

source_col class-attribute instance-attribute ¶

test_params class-attribute ¶

trend_window class-attribute instance-attribute ¶

warmup property ¶

__post_init__ ¶

_build_features ¶

_build_targets ¶

_get_period_key ¶

compute_pair ¶

signalflow.feature.atr.ATRFeature dataclass ¶

outputs class-attribute ¶

period class-attribute instance-attribute ¶

requires class-attribute ¶

signalflow.feature.base.Feature `dataclass` ¶

component_type `class-attribute` ¶

group_col `class-attribute` `instance-attribute` ¶

is_recursive `class-attribute` ¶

norm_period `class-attribute` `instance-attribute` ¶

normalized `class-attribute` `instance-attribute` ¶

outputs `class-attribute` ¶

requires `class-attribute` ¶

test_params `class-attribute` ¶

ts_col `class-attribute` `instance-attribute` ¶

warmup `property` ¶

warmup_invariant `class-attribute` ¶

signalflow.feature.feature_pipeline.FeaturePipeline `dataclass` ¶

features `class-attribute` `instance-attribute` ¶

outputs `property` ¶

raw_data_type `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

signalflow.feature.base.GlobalFeature `dataclass` ¶

sources `class-attribute` `instance-attribute` ¶

signalflow.feature.offset_feature.OffsetFeature `dataclass` ¶

feature_name `class-attribute` `instance-attribute` ¶

feature_params `class-attribute` `instance-attribute` ¶

outputs `class-attribute` ¶

prefix `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

window `class-attribute` `instance-attribute` ¶

from_dict `classmethod` ¶

signalflow.feature.lin_reg_forecast.LinRegForecastFeature `dataclass` ¶

alpha `class-attribute` `instance-attribute` ¶

forecast_horizon `class-attribute` `instance-attribute` ¶

mean_window `class-attribute` `instance-attribute` ¶

min_samples `class-attribute` `instance-attribute` ¶

n_lags `class-attribute` `instance-attribute` ¶

outputs `class-attribute` ¶

refit_period `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

source_col `class-attribute` `instance-attribute` ¶

test_params `class-attribute` ¶

trend_window `class-attribute` `instance-attribute` ¶

warmup `property` ¶

signalflow.feature.atr.ATRFeature `dataclass` ¶

outputs `class-attribute` ¶

period `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

smoothing `class-attribute` `instance-attribute` ¶

test_params `class-attribute` ¶

warmup `property` ¶

signalflow.feature.examples.ExampleRsiFeature `dataclass` ¶

outputs `class-attribute` ¶

period `class-attribute` `instance-attribute` ¶

price_col `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

test_params `class-attribute` ¶

warmup `property` ¶

signalflow.feature.examples.ExampleSmaFeature `dataclass` ¶

outputs `class-attribute` ¶

period `class-attribute` `instance-attribute` ¶

price_col `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

test_params `class-attribute` ¶

warmup `property` ¶

signalflow.feature.examples.ExampleGlobalMeanRsiFeature `dataclass` ¶

add_diff `class-attribute` `instance-attribute` ¶

outputs `class-attribute` ¶

period `class-attribute` `instance-attribute` ¶

price_col `class-attribute` `instance-attribute` ¶

requires `class-attribute` ¶

test_params `class-attribute` ¶

warmup `property` ¶

signalflow.feature.spec.FeatureSpec `dataclass` ¶

features `class-attribute` `instance-attribute` ¶

order_significant `class-attribute` `instance-attribute` ¶

raw_data_type `class-attribute` `instance-attribute` ¶

ta_version `class-attribute` `instance-attribute` ¶

from_config `classmethod` ¶

from_pipeline `classmethod` ¶

from_yaml `classmethod` ¶

signalflow.feature.model_features.ModelFeaturesPipeline `dataclass` ¶

_pipeline `instance-attribute` ¶

_spec `instance-attribute` ¶