Signal Validators (Meta-Labeling)¶
This guide covers training and using signal validators - the meta-labeling approach from Marcos Lopez de Prado for filtering and scoring trading signals.
Overview¶
A validator (meta-labeler) is a secondary model that predicts whether a primary signal will be successful. This two-stage approach separates:
- Detection (high recall) - find as many potential signals as possible
- Validation (high precision) - filter out false positives
Available Validators¶
| Validator | Model | Use Case |
|---|---|---|
LightGBMValidator |
LGBMClassifier | Default choice, fast training |
XGBoostValidator |
XGBClassifier | Robust, good regularization |
RandomForestValidator |
RandomForestClassifier | Interpretable, no hyperparams |
LogisticRegressionValidator |
LogisticRegression | Fast, linear relationships |
SVMValidator |
SVC | Small datasets |
AutoSelectValidator |
Auto | Automatic model selection |
Quick Start¶
1. Prepare Data¶
import polars as pl
from signalflow.detector import ExampleSmaCrossDetector
from signalflow.target import FixedHorizonLabeler
from signalflow.feature import FeaturePipeline
# Load your data
raw_data = ... # RawData with OHLCV
# Generate signals (what we want to validate)
detector = ExampleSmaCrossDetector(fast_period=20, slow_period=50)
signals = detector.run(raw_data.to_view())
# Generate labels (ground truth for training)
labeler = FixedHorizonLabeler(horizon=60, threshold_pct=0.5)
labeled_df = labeler.compute(raw_data.to_polars("spot"))
# Compute features
pipeline = FeaturePipeline([...])
features_df = pipeline.run(raw_data.to_polars("spot"))
2. Split Data¶
# Time-based split (no lookahead)
train_end = features_df["timestamp"].quantile(0.7)
val_end = features_df["timestamp"].quantile(0.85)
train_df = features_df.filter(pl.col("timestamp") <= train_end)
val_df = features_df.filter(
(pl.col("timestamp") > train_end) & (pl.col("timestamp") <= val_end)
)
test_df = features_df.filter(pl.col("timestamp") > val_end)
# Filter to active signals only (not NONE/FLAT)
train_df = train_df.filter(pl.col("signal_type").is_in(["rise", "fall"]))
val_df = val_df.filter(pl.col("signal_type").is_in(["rise", "fall"]))
3. Train Validator¶
from signalflow.validator import LightGBMValidator
# Define feature columns (exclude pair, timestamp, label)
feature_cols = [c for c in train_df.columns
if c not in ["pair", "timestamp", "label", "signal_type"]]
# Create validator
validator = LightGBMValidator(
n_estimators=200,
learning_rate=0.05,
max_depth=6,
)
# Train with early stopping
validator.fit(
X_train=train_df.select(["pair", "timestamp"] + feature_cols),
y_train=train_df.select("label"),
X_val=val_df.select(["pair", "timestamp"] + feature_cols),
y_val=val_df.select("label"),
)
4. Validate Signals¶
from signalflow.core import Signals
# Wrap test signals
test_signals = Signals(test_df.select(["pair", "timestamp", "signal", "signal_type"]))
# Get probabilities
validated = validator.validate_signals(
signals=test_signals,
features=test_df.select(["pair", "timestamp"] + feature_cols),
)
# Access results
df = validated.value
print(df.columns)
# ['pair', 'timestamp', 'signal', 'signal_type', 'probability_0', 'probability_1']
# Filter high-confidence signals
confident = df.filter(
(pl.col("signal_type") == "rise") &
(pl.col("probability_1") > 0.7)
)
Hyperparameter Tuning¶
Each validator supports Optuna-based hyperparameter tuning:
from signalflow.validator import RandomForestValidator
validator = RandomForestValidator()
# Configure tuning
validator.tune_params = {
"n_trials": 100, # Number of Optuna trials
"cv_folds": 5, # Cross-validation folds
"timeout": 1800, # Max seconds
}
validator.tune_metric = "roc_auc" # Optimization metric
# Run tuning
best_params = validator.tune(
X_train=train_df.select(["pair", "timestamp"] + feature_cols),
y_train=train_df.select("label"),
)
print(f"Best params: {best_params}")
# Fit with best params (already set)
validator.fit(
X_train=train_df.select(["pair", "timestamp"] + feature_cols),
y_train=train_df.select("label"),
)
Tune Spaces¶
Each validator has a predefined tuning space:
LightGBMValidator:
- n_estimators: 50-500
- max_depth: 3-12
- learning_rate: 0.01-0.3 (log scale)
- num_leaves: 15-127
- min_child_samples: 5-100
- subsample: 0.6-1.0
- colsample_bytree: 0.6-1.0
RandomForestValidator:
- n_estimators: 50-300
- max_depth: 5-30
- min_samples_split: 2-20
- min_samples_leaf: 1-10
Auto Model Selection¶
Use AutoSelectValidator to automatically find the best model:
from signalflow.validator import AutoSelectValidator
validator = AutoSelectValidator(
auto_select_metric="roc_auc",
auto_select_cv_folds=5,
)
validator.fit(X_train, y_train)
# Check selected model
print(f"Selected: {validator.selected_validator.__class__.__name__}")
# e.g., "LightGBMValidator"
The auto-selector tests LightGBM, XGBoost, Random Forest, and Logistic Regression using cross-validation and selects the best performing model.
Early Stopping (Boosting Models)¶
LightGBM and XGBoost support early stopping to prevent overfitting:
from signalflow.validator import LightGBMValidator
validator = LightGBMValidator(
n_estimators=1000, # Max iterations
early_stopping_rounds=50, # Stop if no improvement for 50 rounds
)
# Provide validation data for early stopping
validator.fit(
X_train=train_df.select(["pair", "timestamp"] + feature_cols),
y_train=train_df.select("label"),
X_val=val_df.select(["pair", "timestamp"] + feature_cols),
y_val=val_df.select("label"),
)
# Check actual iterations used
print(f"Best iteration: {validator.model.best_iteration_}")
Save and Load¶
# Save trained validator
validator.save("models/my_validator.pkl")
# Load later
from signalflow.validator import LightGBMValidator
loaded = LightGBMValidator.load("models/my_validator.pkl")
validated = loaded.validate_signals(signals, features)
Integration with Strategy¶
Use validated signals in your trading strategy:
from signalflow.strategy.component.entry import SignalEntryRule
class ValidatedEntryRule(SignalEntryRule):
"""Entry rule that filters by validator probability."""
min_probability: float = 0.6
def check_entries(self, signals, state, context):
# Filter to high-confidence signals
confident = signals.value.filter(
pl.col("probability_1") > self.min_probability
)
# Proceed with filtered signals
return super().check_entries(
Signals(confident), state, context
)
Or use with SignalAggregator in META_LABELING mode:
from signalflow.strategy.component.entry.aggregation import (
SignalAggregator,
VotingMode,
)
aggregator = SignalAggregator(
voting_mode=VotingMode.META_LABELING,
probability_threshold=0.6,
)
# Combines detector signals with validator probabilities
combined = aggregator.aggregate([detector_signals, validated_signals])
Best Practices¶
1. Use Time-Based Splits¶
Always use time-based train/val/test splits to avoid lookahead bias:
# Good: Time-based split
train = df.filter(pl.col("timestamp") < cutoff)
test = df.filter(pl.col("timestamp") >= cutoff)
# Bad: Random split (causes lookahead)
train, test = train_test_split(df, test_size=0.2) # Don't do this!
2. Filter to Active Signals¶
Only train on signals that require a decision (not NONE/FLAT):
3. Handle Class Imbalance¶
If your labels are imbalanced, consider:
# LightGBM with class weights
validator = LightGBMValidator()
validator.model_params["class_weight"] = "balanced"
4. Feature Engineering¶
Good features for meta-labeling: - Volatility metrics (ATR, realized vol) - Volume indicators - Market regime features - Signal confidence from detector - Time-of-day features
5. Monitor Overfitting¶
Use early stopping and check validation metrics:
validator.fit(X_train, y_train, X_val, y_val)
# Check train vs val performance
from sklearn.metrics import roc_auc_score
train_preds = validator.model.predict_proba(X_train_np)[:, 1]
val_preds = validator.model.predict_proba(X_val_np)[:, 1]
print(f"Train AUC: {roc_auc_score(y_train_np, train_preds):.3f}")
print(f"Val AUC: {roc_auc_score(y_val_np, val_preds):.3f}")
See Also¶
- API Reference: Detailed class documentation
- Signal Architecture: Meta-labeling theory
- Custom Detectors: Building primary models