Validator Module¶
Signal validators (meta-labelers) predict the quality or risk of trading signals. In De Prado's terminology, this implements the meta-labeling approach - training a secondary model to predict whether a primary signal will be successful.
Available Validators¶
| Validator | Model | Best For |
|---|---|---|
LightGBMValidator |
LGBMClassifier | Fast training, good defaults |
XGBoostValidator |
XGBClassifier | Robust, regularized |
RandomForestValidator |
RandomForestClassifier | Ensemble, interpretable |
LogisticRegressionValidator |
LogisticRegression | Fast, linear relationships |
SVMValidator |
SVC | Small datasets, non-linear |
AutoSelectValidator |
Auto | Automatic model selection |
Quick Start¶
import polars as pl
from signalflow.validator import LightGBMValidator
from signalflow.core import Signals
# Prepare data - filter to active signals (not NONE)
train_df = train_df.filter(pl.col("signal_type") != "none")
# Create and train validator
validator = LightGBMValidator(n_estimators=200, learning_rate=0.05)
validator.fit(
train_df.select(["pair", "timestamp"] + feature_cols),
train_df.select("label"),
X_val=val_df.select(["pair", "timestamp"] + feature_cols), # Early stopping
y_val=val_df.select("label"),
)
# Validate new signals
validated = validator.validate_signals(
Signals(test_df.select(signal_cols)),
test_df.select(["pair", "timestamp"] + feature_cols),
)
# Filter to high-confidence predictions
confident = validated.value.filter(pl.col("probability_rise") > 0.7)
Auto Model Selection¶
Use AutoSelectValidator to automatically select the best model:
from signalflow.validator import AutoSelectValidator
validator = AutoSelectValidator()
validator.fit(X_train, y_train)
# Check which model was selected
print(validator.selected_validator) # e.g., LightGBMValidator
Hyperparameter Tuning¶
Each validator supports Optuna-based hyperparameter tuning:
from signalflow.validator import RandomForestValidator
validator = RandomForestValidator()
validator.tune_params = {"n_trials": 50, "cv_folds": 5, "timeout": 600}
best_params = validator.tune(X_train, y_train)
# Fit with best params
validator.fit(X_train, y_train)
Base Class¶
signalflow.validator.base.SignalValidator
dataclass
¶
SignalValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp')
Base class for signal validators (meta-labelers).
Validates trading signals by predicting their risk/quality. In De Prado's terminology - this is a meta-labeler.
Note: Filtering to active signals (RISE/FALL only) should be done BEFORE passing data to fit. This keeps the validator simple and gives users full control over data preparation.
Attributes:
| Name | Type | Description |
|---|---|---|
model |
Any | None
|
The trained model instance |
model_type |
str | None
|
String identifier for model type (e.g., "lightgbm", "xgboost") |
model_params |
dict | None
|
Parameters for model initialization |
train_params |
dict | None
|
Parameters for training (e.g., early stopping) |
tune_enabled |
bool
|
Whether hyperparameter tuning is enabled |
tune_params |
dict | None
|
Parameters for tuning (e.g., n_trials, cv_folds) |
feature_columns |
list[str] | None
|
List of feature column names (set after fit) |
feature_columns
class-attribute
instance-attribute
¶
fit ¶
fit(X_train: DataFrame, y_train: DataFrame | Series, X_val: DataFrame | None = None, y_val: DataFrame | Series | None = None) -> SignalValidator
Train the validator model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_train
|
DataFrame
|
Training features (Polars DataFrame) |
required |
y_train
|
DataFrame | Series
|
Training labels |
required |
X_val
|
DataFrame | None
|
Validation features (optional) |
None
|
y_val
|
DataFrame | Series | None
|
Validation labels (optional) |
None
|
Returns:
| Type | Description |
|---|---|
SignalValidator
|
Self for method chaining |
Source code in src/signalflow/validator/base.py
load
classmethod
¶
predict ¶
Predict class labels and return updated Signals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signals
|
Signals
|
Input signals container |
required |
X
|
DataFrame
|
Features (Polars DataFrame) with (pair, timestamp) + feature columns |
required |
Returns:
| Type | Description |
|---|---|
Signals
|
New Signals with prediction column added |
Source code in src/signalflow/validator/base.py
predict_proba ¶
Predict class probabilities and return updated Signals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signals
|
Signals
|
Input signals container |
required |
X
|
DataFrame
|
Features (Polars DataFrame) |
required |
Returns:
| Type | Description |
|---|---|
Signals
|
New Signals with probability columns added |
Source code in src/signalflow/validator/base.py
save ¶
tune ¶
tune(X_train: DataFrame, y_train: DataFrame | Series, X_val: DataFrame | None = None, y_val: DataFrame | Series | None = None) -> dict[str, Any]
Tune hyperparameters.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Best parameters found |
Source code in src/signalflow/validator/base.py
validate_signals ¶
Add validation predictions to signals.
Convenience method - calls predict_proba internally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signals
|
Signals
|
Input signals container |
required |
features
|
DataFrame
|
Features DataFrame with (pair, timestamp) + feature columns |
required |
prefix
|
str
|
Prefix for probability columns |
'probability_'
|
Returns:
| Type | Description |
|---|---|
Signals
|
Signals with added validation columns |
Source code in src/signalflow/validator/base.py
Sklearn Base¶
signalflow.validator.sklearn_validator.SklearnValidatorBase
dataclass
¶
SklearnValidatorBase(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50)
Bases: SignalValidator
Base class for sklearn-compatible signal validators.
Provides common functionality for feature extraction, fitting, prediction, and serialization. Subclasses define model-specific configuration via class variables.
Class Variables (override in subclasses): _model_class: Import path for the model class (e.g., "lightgbm.LGBMClassifier") _default_params: Default model parameters _tune_space: Optuna tuning space definition _supports_early_stopping: Whether the model supports early stopping
__post_init__ ¶
Source code in src/signalflow/validator/sklearn_validator.py
_create_model ¶
Create model instance with given parameters.
Source code in src/signalflow/validator/sklearn_validator.py
_extract_features ¶
Extract feature matrix from DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
DataFrame
|
Input DataFrame |
required |
fit_mode
|
bool
|
If True, infer and store feature columns |
False
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Feature matrix as numpy array |
Source code in src/signalflow/validator/sklearn_validator.py
_extract_labels ¶
Extract label array.
Source code in src/signalflow/validator/sklearn_validator.py
_get_class_labels ¶
Get class labels for probability columns.
Source code in src/signalflow/validator/sklearn_validator.py
_get_early_stopping_kwargs ¶
Get early stopping fit kwargs. Override in subclasses.
fit ¶
fit(X_train: DataFrame, y_train: DataFrame | Series, X_val: DataFrame | None = None, y_val: DataFrame | Series | None = None) -> SklearnValidatorBase
Train the validator.
Note: Filter to active signals BEFORE calling this method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_train
|
DataFrame
|
Training features (already filtered to active signals) |
required |
y_train
|
DataFrame | Series
|
Training labels |
required |
X_val
|
DataFrame | None
|
Validation features (optional, for early stopping) |
None
|
y_val
|
DataFrame | Series | None
|
Validation labels (optional) |
None
|
Returns:
| Type | Description |
|---|---|
SklearnValidatorBase
|
Self for method chaining |
Source code in src/signalflow/validator/sklearn_validator.py
load
classmethod
¶
Load validator from file.
Source code in src/signalflow/validator/sklearn_validator.py
predict ¶
Predict class labels and return updated Signals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signals
|
Signals
|
Input signals container |
required |
X
|
DataFrame
|
Features DataFrame with (pair, timestamp) + feature columns |
required |
Returns:
| Type | Description |
|---|---|
Signals
|
New Signals with 'validation_pred' column added |
Source code in src/signalflow/validator/sklearn_validator.py
predict_proba ¶
Predict class probabilities and return updated Signals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signals
|
Signals
|
Input signals container |
required |
X
|
DataFrame
|
Features DataFrame with (pair, timestamp) + feature columns |
required |
Returns:
| Type | Description |
|---|---|
Signals
|
New Signals with probability columns added |
Source code in src/signalflow/validator/sklearn_validator.py
save ¶
Save validator to file.
Source code in src/signalflow/validator/sklearn_validator.py
tune ¶
tune(X_train: DataFrame, y_train: DataFrame | Series, X_val: DataFrame | None = None, y_val: DataFrame | Series | None = None) -> dict[str, Any]
Tune hyperparameters using Optuna.
Note: Filter to active signals BEFORE calling this method.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Best parameters found |
Source code in src/signalflow/validator/sklearn_validator.py
validate_signals ¶
Add validation probabilities to signals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signals
|
Signals
|
Input Signals container |
required |
features
|
DataFrame
|
Features DataFrame with (pair, timestamp) + features |
required |
prefix
|
str
|
Prefix for probability columns (default: "probability_") |
'probability_'
|
Returns:
| Type | Description |
|---|---|
Signals
|
New Signals with probability columns added. |
Source code in src/signalflow/validator/sklearn_validator.py
LightGBM Validator¶
signalflow.validator.sklearn_validator.LightGBMValidator
dataclass
¶
LightGBMValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50, n_estimators: int = 100, max_depth: int = 6, learning_rate: float = 0.1, num_leaves: int = 31)
Bases: SklearnValidatorBase
LightGBM-based signal validator.
Gradient boosting model optimized for speed and performance. Supports early stopping with validation data.
Example
validator = LightGBMValidator(n_estimators=200) validator.fit(X_train, y_train, X_val, y_val) validated = validator.validate_signals(signals, features)
_default_params
class-attribute
¶
_default_params: dict[str, Any] = {'n_estimators': 100, 'max_depth': 6, 'learning_rate': 0.1, 'num_leaves': 31, 'min_child_samples': 20, 'subsample': 0.8, 'colsample_bytree': 0.8, 'random_state': 42, 'n_jobs': -1, 'verbosity': -1}
_tune_space
class-attribute
¶
_tune_space: dict[str, tuple] = {'n_estimators': ('int', 50, 500), 'max_depth': ('int', 3, 12), 'learning_rate': ('log_float', 0.01, 0.3), 'num_leaves': ('int', 15, 127), 'min_child_samples': ('int', 5, 100), 'subsample': ('float', 0.6, 1.0), 'colsample_bytree': ('float', 0.6, 1.0)}
__post_init__ ¶
Source code in src/signalflow/validator/sklearn_validator.py
_get_early_stopping_kwargs ¶
Source code in src/signalflow/validator/sklearn_validator.py
XGBoost Validator¶
signalflow.validator.sklearn_validator.XGBoostValidator
dataclass
¶
XGBoostValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50, n_estimators: int = 100, max_depth: int = 6, learning_rate: float = 0.1)
Bases: SklearnValidatorBase
XGBoost-based signal validator.
Robust gradient boosting with regularization. Supports early stopping with validation data.
Example
validator = XGBoostValidator(n_estimators=200) validator.fit(X_train, y_train, X_val, y_val) validated = validator.validate_signals(signals, features)
_default_params
class-attribute
¶
_default_params: dict[str, Any] = {'n_estimators': 100, 'max_depth': 6, 'learning_rate': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.8, 'random_state': 42, 'n_jobs': -1, 'verbosity': 0, 'use_label_encoder': False, 'eval_metric': 'logloss'}
_tune_space
class-attribute
¶
_tune_space: dict[str, tuple] = {'n_estimators': ('int', 50, 500), 'max_depth': ('int', 3, 12), 'learning_rate': ('log_float', 0.01, 0.3), 'subsample': ('float', 0.6, 1.0), 'colsample_bytree': ('float', 0.6, 1.0), 'min_child_weight': ('int', 1, 10), 'gamma': ('float', 0, 0.5)}
__post_init__ ¶
Source code in src/signalflow/validator/sklearn_validator.py
_get_early_stopping_kwargs ¶
Source code in src/signalflow/validator/sklearn_validator.py
Random Forest Validator¶
signalflow.validator.sklearn_validator.RandomForestValidator
dataclass
¶
RandomForestValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50, n_estimators: int = 100, max_depth: int = 10)
Bases: SklearnValidatorBase
Random Forest-based signal validator.
Ensemble of decision trees with bagging.
Example
validator = RandomForestValidator(n_estimators=200) validator.fit(X_train, y_train) validated = validator.validate_signals(signals, features)
_default_params
class-attribute
¶
_default_params: dict[str, Any] = {'n_estimators': 100, 'max_depth': 10, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'random_state': 42, 'n_jobs': -1}
_tune_space
class-attribute
¶
_tune_space: dict[str, tuple] = {'n_estimators': ('int', 50, 300), 'max_depth': ('int', 5, 30), 'min_samples_split': ('int', 2, 20), 'min_samples_leaf': ('int', 1, 10)}
__post_init__ ¶
Logistic Regression Validator¶
signalflow.validator.sklearn_validator.LogisticRegressionValidator
dataclass
¶
LogisticRegressionValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50, C: float = 1.0, max_iter: int = 1000)
Bases: SklearnValidatorBase
Logistic Regression-based signal validator.
Linear classifier with regularization.
Example
validator = LogisticRegressionValidator(C=0.1) validator.fit(X_train, y_train) validated = validator.validate_signals(signals, features)
SVM Validator¶
signalflow.validator.sklearn_validator.SVMValidator
dataclass
¶
SVMValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50, C: float = 1.0, kernel: str = 'rbf')
Bases: SklearnValidatorBase
Support Vector Machine-based signal validator.
SVM classifier with RBF kernel by default.
Example
validator = SVMValidator(C=10.0, kernel="rbf") validator.fit(X_train, y_train) validated = validator.validate_signals(signals, features)
_default_params
class-attribute
¶
_default_params: dict[str, Any] = {'C': 1.0, 'kernel': 'rbf', 'probability': True, 'random_state': 42}
_tune_space
class-attribute
¶
_tune_space: dict[str, tuple] = {'C': ('log_float', 0.001, 100), 'kernel': ('categorical', ['rbf', 'linear', 'poly']), 'gamma': ('categorical', ['scale', 'auto'])}
Auto-Select Validator¶
signalflow.validator.sklearn_validator.AutoSelectValidator
dataclass
¶
AutoSelectValidator(model: Any | None = None, model_type: str | None = None, model_params: dict | None = None, train_params: dict | None = None, tune_enabled: bool = False, tune_params: dict | None = None, feature_columns: list[str] | None = None, pair_col: str = 'pair', ts_col: str = 'timestamp', tune_metric: str = 'roc_auc', tune_cv_folds: int = 5, tune_n_trials: int = 50, tune_timeout: int = 600, early_stopping_rounds: int = 50, auto_select_metric: str = 'roc_auc', auto_select_cv_folds: int = 5, selected_validator: SklearnValidatorBase | None = None)
Bases: SklearnValidatorBase
Auto-selecting signal validator.
Automatically selects the best model via cross-validation. Tests LightGBM, XGBoost, Random Forest, and Logistic Regression.
Example
validator = AutoSelectValidator() validator.fit(X_train, y_train) # Selects best model print(validator.selected_validator) # Shows which was selected
selected_validator
class-attribute
instance-attribute
¶
fit ¶
fit(X_train: DataFrame, y_train: DataFrame | Series, X_val: DataFrame | None = None, y_val: DataFrame | Series | None = None) -> AutoSelectValidator
Train the validator, auto-selecting the best model.