Momentum Candle Backtester
Internals & Decision Record
Every module, every constant, every design decision. This is the file to read before touching code. Higher-level summary first, deep technical detail behind expandable sections so you can pick your depth.
A three-gate validation pipeline for altcoin breakout candles
The scanner finds momentum breakout candles across 300+ Binance USDT pairs. Each candidate signal is pushed through three sequential gates before any execution decision is made:
Three files. One brain.
| File | LOC | Responsibility |
|---|---|---|
| app.py | 8,264 | Scanner + Manual + all pipeline stages (Step 1/2/3), UI rendering, Groq AI verdict |
| pulse_intel.py | 1,682 | On-chain intel module: DefiLlama TVL, Etherscan / Solscan flow, LunarCrush social, macro |
| lookahead_audit.py | 290 | Causality audit — proves all 22 features are forward-looking-leak-free |
Dependencies
# requirements.txt
streamlit>=1.32.0
plotly>=5.19.0
pandas>=2.0.0
numpy>=1.26.0
scipy>=1.12.0
scikit-learn>=1.4.0
requests>=2.31.0
Three tabs, three concerns
One signal, three steps, one verdict
Why split into three buttons and not one pipeline run DESIGN
Originally two steps. Split into three on Apr 13 because ML training needs to know which method it's labeling by — which requires the user to see backtest results and make a judgment first. Friction here is a feature, not a bug: it forces the human to acknowledge the backtest output before the ML and AI layers get involved.
Also: Step 1 is expensive (72 methods × ratchet loop). Running it blindly for every scan would waste Binance API calls and sklearn CPU.
Binance klines, 1000 bars deep
All backtest/WFO/ML training pulls from _scanner_fetch_candles(symbol, interval, limit). Binance's /api/v3/klines hard-caps at 1000 bars — we always fetch the max.
_BINANCE_INTERVAL = {"1D":"1d", "4H":"4h", "1H":"1h", "1W":"1w"}
_DEEP_FETCH_LIMITS = {"1h":1000, "2h":1000, "4h":1000,
"6h":1000, "12h":1000, "1d":1000}
What 1000 bars means per timeframe
| Interval | Historical window | Bars |
|---|---|---|
| 1h | ~42 days | 1000 |
| 4h | ~167 days (5.5mo) | 1000 |
| 1d | ~2.7 years | 1000 |
| 1w | ~19 years (way past listing date on most alts) | 1000 |
Fallback & alternate sources RESILIENCE
_binance_klines is the primary path. _gateio_klines exists as a fallback for delisted Binance pairs where the ticker moved to Gate.io. fetch_live handles real-time current-bar polling for the Pulse tab.
Exchange-listed API limits are handled via adaptive retry with 500ms backoff on 429 / 418 status codes. No persistent rate limiter — thin wrapper assumes Streamlit's request cadence is low enough.
Every feature is strictly past-looking
Built by _clean_df(df). Audited by lookahead_audit.py — 20/20 CLEAN (synthetic test, 25 random bars, all features match causal recomputation). A deliberately-leaking test feature was correctly flagged — the audit works.
The 11 ML features (subset of 22)
Causality proofs per feature AUDIT
EMAuses.shift(1).ewm()— strictly past, no current bar.vol_avg_7uses.shift(1).rolling(7).mean()— past 7 bars, excluding current.candle_rank_20,vol_rank_20use.rolling(20).rank(pct=True)— causal in pandas (ranks within window ending at current bar, which is correct — the signal uses current candle's data to score itself).body_vs_atranddist_from_ema21_pctare derived from already-causal features.regime_scorecomposites ADX/F&G/funding/OI — each fetched at or before the bar's timestamp.
Full audit: python lookahead_audit.py BTCUSDT 1d
dist_from_ema21 is sign-flipped so "stretched the wrong way" is always a negative number, regardless of direction. This keeps the ML feature space directionally consistent.A 0–100 score blending trend, sentiment, funding, OI
calculate_regime_score(df, bar_index, direction, adx_df, timeframe, ticker) composites four inputs into a single bar-level score. The score is cached per bar in _bar_regime_cache so the 72-method backtest doesn't recompute it 72× per candle.
Soft regime similarity
def _regime_similarity_weight(current, historical):
return max(0.15, 1 - abs(current - historical) / 100)
Applied to:
- Backtest
EVw/WRw— analogs from similar regimes pull harder - ML
sample_weight— same logic, at the classifier level
NOT applied to WFO — WFO tests generalization across regimes on purpose. Regime-filtering the WFO would defeat the point.
0.15 (never 0). Prevents sample-size cliffs on illiquid coins where very few historical analogs match the current regime closely. A far-regime analog still contributes 15% signal.Finds breakout candidates across 300+ USDT pairs
Universe built by _scanner_get_universe(min_volume_usdt) — Binance spot USDT pairs with min 24h volume. Each coin scans its last 3 closed candles (not current bar — that's still forming) via _scan_one_symbol → _scanner_score_signal.
Scoring a candidate
_scanner_score_signal(df, adx_df, bar_idx, direction,
timeframe, symbol, min_body_pct, min_vol_mult)
# → returns sig dict OR None if below threshold
Returns a sig dict containing: direction, entry/SL/TP prices, body_pct, vol_mult, ADX, DI gap, ATR ratio, EMA alignment score, regime score, bar_index, timeframe, symbol, 11-feature vector extracted for current bar.
Scanner threshold defaults CONFIG
Defaults are intentionally loose — better to surface a borderline signal and let the 3-gate pipeline reject it than miss one entirely.
min_body_pct≈ 0.60 (body takes up ≥60% of range)min_vol_mult≈ 1.8× (volume ≥1.8× 7-bar avg)- ADX filter soft — used in scoring, not hard-cut
Signals below threshold return None and never enter the pipeline.
72 method combos against historical analogs
When the user clicks "Run Step 1": _scanner_quick_backtest(sig) scans historical bars looking for analogs to the current signal, then simulates all 72 method combinations on those analogs. _scanner_mini_wfo(sig, bt) then runs a rolling purged walk-forward on the full df to test generalization.
Key constants
NEUTRAL_R_THRESHOLD = 0.30 # ±0.30R = NEUTRAL, excluded from ML
MAX_HOLD = 20 # max bars per trade
FIXED_SL = 1.5% # fixed stop width (alt: ATR SL)
Progressive relaxation when analogs are scarce
Problem: a strong 85%-body, 5× volume signal has very few exact historical analogs. Requiring 70% of its own body/vol would find 4 samples on ETH 4H. Too few to backtest, way too few to train ML on.
Solution: a ratchet that relaxes the threshold until enough samples are found, with hard floors to prevent pure-noise analogs.
# Backtest ratchet (target: 50 bars)
_BT_RATCHET_RATIOS = [0.70, 0.55, 0.45, 0.35, 0.25, 0.20]
_BT_MIN_BODY_FLOOR = 0.20
_BT_MIN_VOL_FLOOR = 1.10
# ML ratchet (target: 80 samples)
_RATCHET_RATIOS = [0.70, 0.55, 0.45, 0.35, 0.25, 0.20]
Two-pass optimization PERF
The backtest does a cheap _count_passing() scan at each ratchet level first. Only when a level meets the 50-bar target does it run the full 72-method loop once. This avoids 6× redundant simulation.
Why hard floors matter GUARDRAIL
Without floors, a signal on an ultra-rare extreme candle could ratchet all the way down to matching any candle, producing garbage analogs. body_floor=0.20 and vol_floor=1.10 ensure every analog is at least a "real" candle, not dust.
UI badge
The result flows through to a user-facing badge:
STRICT 70% — analogs tightly matched, trust ML probability numerically. RELAXED 45% — broad analog set. LOOSE 20% — very broad; treat ML probability as directional only.
Every signal tested against 72 trade strategies
The cartesian product of entry × SL × mgmt × TP:
ENTRY_ZONES = ["Aggressive", "Standard", "Sniper"] # 3
SL_METHODS = ["Fixed SL", "ATR SL"] # 2
MGMT_MODES = ["Simple", "Partial", "Partial-NoBE", "Trailing"] # 4
TP_MULTS = [2.0, 2.5, 3.0] # 3
# 3 × 2 × 4 × 3 = 72 combinations
Entry zones
_invalid_zones and excluded from candidate selection. This is mechanically correct — do not "fix" by widening SL. Widening SL changes the strategy's semantics.Management modes
| Mode | Behavior |
|---|---|
| Simple | Full size, hold to TP2 or original SL. No BE, no partials. |
| Partial | TP 50% at 1R, auto-move SL to BE on remaining half. |
| Partial-NoBE | TP 50% at 1R, keep original SL (real downside remains). |
| Trailing | Full size, BE at 1R, then trail 0.5×ATR from close. |
Why Partial vs Partial-NoBE distinction matters LABEL BUG
Partial trades that hit TP1 then reverse to BE produce r_mult ≈ +0.498R. Mathematically this is positive PnL but the outcome is "basically flat." Labeling these as WIN caused:
- ML seeing only wins on trending coins → single-class collapse → can't train
- Backtest looking invincible: PF=∞, WR=100% on REZ-like trending alts
Fix: _classify_outcome(r_mult) → WIN / LOSS / NEUTRAL. |r_mult| ≤ 0.30R = NEUTRAL, still counted in PF/WR but excluded from ML labels.
Partial-NoBE was added on Apr 17 because the user believed they were trading that style. Partial was moving SL to BE automatically — a different strategy entirely. Both are now available for honest comparison.
All 72 methods (visual)
72 combinations · each with n, WR, EV, EVw, PF, fill_rate, 4-bucket decay breakdown
The dual-candidate decision system
Rather than picking a single "best" method from the 72, the system surfaces two:
bt["candidate_newest"]
bt["candidate_weighted"]
Time-decay bucket scheme (adaptive)
| Sample count | Buckets | Weights (oldest → newest) |
|---|---|---|
| n ≥ 400 | 4 | 0.40, 0.60, 0.80, 1.00 |
| n ≥ 200 | 3 | 0.50, 0.75, 1.00 |
| n ≥ 80 | 2 | 0.60, 1.00 |
| n < 80 | 1 | 1.00 |
ev_weighted, not raw ev. Previous bug selected by raw EV, which could pick a method that crushed it in 2021 but has since gone dormant. Using EVw ensures newer trades pull harder on the selection. This is #2 on the do-not-break list.5 rolling windows with de Prado-style purge + embargo
A single in-sample/out-of-sample cut is one point estimate. Five rolling cuts give a distribution. "5/5 windows OOS PF ≥ 1.0" is real edge; "1 good cut" could be luck.
Rolling cuts at 50/60/70/80/90%
Purge + embargo (de Prado Ch.7)
Every trade stores bar_index (entry bar) and label_end_bar (= j, resolution bar). PurgedTimeSeriesSplit drops training samples whose label period overlaps the test fold. Embargo = ceil(0.01 × n_total) after each fold.
TimeSeriesSplit doesn't know labels span MAX_HOLD=20 bars. A training sample at the fold boundary has its label resolved inside the test fold → leak. On daily bars that's 20 days of leakage. Always PurgedTimeSeriesSplit.WFO verdict thresholds
| n_oos | Verdict |
|---|---|
| ≥ 8 | PASS |
| 5–7 | BORDERLINE |
| < 5 | INSUFFICIENT |
WFO return dict shape SCHEMA
{
"ok": bool,
"verdict": "PASS" | "BORDERLINE" | "FAIL" | "INSUFFICIENT",
"is_pf", "oos_pf", "oos_wr",
"is_pf_clean", "oos_pf_clean", # honest PF excl. NEUTRAL
"oos_n": int,
"purge_diag": {n_purged, n_embargoed, embargo_bars},
"label_diag": {n_neutral, raw_pf, honest_pf, pf_inflation_pct},
"oos_pf_ci": {"lo": float, "hi": float}, # 1000-resample block bootstrap
"rolling_wfo": {"edge_hit_rate", "windows": [...]},
"regime_breakdown": {"STRONG": ..., "MID": ..., "WEAK": ...},
"tier_label": "PURGED IS/OOS split (70%/30%, embargo 1%)",
}
Bootstrap 95% CI on OOS PF UNCERTAINTY
1000-resample block bootstrap on OOS r_mult list. Reports {lo, hi}. Context: n=8 trades with PF=1.3 has CI roughly [0.7, 2.8] — wide, honestly acknowledged. This beats reporting a point estimate as if it were gospel.
Regime-conditional OOS breakdown DIAGNOSTIC
Splits OOS trades by ATR-ratio proxy into STRONG/MID/WEAK regimes. Reports PF/WR per regime. An aggregate PF=1.4 can hide PF=2.8 in STRONG and PF=0.9 in WEAK — the breakdown makes this visible and actionable.
Fill-rate / survivor-bias diagnostic HONEST
Each method combo tracks:
n_qualifying— signals passing the filtern_filled— trades that actually entered the zonen_expired— never retraced to fillfill_rate = n_filled / n_qualifying
Standard/Sniper zones on trending coins mostly don't fill (price never retraces). The backtest only sees the rare pullback-continuation subset → survivor bias inflates WR. fill_rate makes this visible. AI prompt warns when fill_rate < 40% on ≥20 qualifying signals.
Classifier picked by sample count
Different sample sizes need different models. Too few samples and a heavy model overfits; too many and a simple model underutilizes the data.
| n samples | Model | Why |
|---|---|---|
| n < 20 | Heuristic fallback | No training. Deterministic rule-based probability. |
| n < 50 | Logistic Regression | Least likely to overfit on small n. Pipeline with StandardScaler, class_weight=balanced. |
| 50–149 | Random Forest | n_estimators=150, max_depth=5, min_samples_leaf=5. Robust to feature scale. |
| ≥ 150 | Gradient Boosting | n_estimators=150, max_depth=3, lr=0.05, subsample=0.8. Best generalization at scale. |
Calibration wrapper
if n_samples >= 60:
model = CalibratedClassifierCV(model, method="isotonic", cv=3)
CV splitter
cv = PurgedTimeSeriesSplit(
n_splits=min(5, n // 15),
embargo_pct=0.01,
)
# Never sklearn's default TimeSeriesSplit — would leak labels at fold boundaries.
Training labeling: by method outcome LABELS
_scanner_train_ml labels historical candles by the chosen method's WIN/LOSS/NEUTRAL outcome — not by a fixed Aggressive/Simple baseline.
Reason: the ML should learn what works for the specific method you'll actually trade, not a generic proxy method. If Cand A is "Sniper · ATR · Partial-NoBE · 2.5", the ML learns to predict WIN for that exact configuration on historical analogs.
When Cand A ≠ Cand B, the ML trains twice — once per candidate.
The |r_mult| ≤ 0.30R band excluded from ML
def _classify_outcome(r_mult):
if r_mult > NEUTRAL_R_THRESHOLD: return "WIN"
if r_mult < -NEUTRAL_R_THRESHOLD: return "LOSS"
return "NEUTRAL"
- NEUTRAL trades still counted in PF/WR accounting (the money is real).
- NEUTRAL trades excluded from ML labels (they'd cause single-class collapse on trending coins).
- WFO reports
honest_pf(excluding NEUTRAL) alongside raw PF.
Every training sample gets a two-factor weight
sample_weight = time_decay_bucket_weight × regime_similarity_weight
Pipeline sample_weight error & fix BUGFIX
CalibratedClassifierCV wrapping a Pipeline raises various exception types (not just TypeError) when sample_weight is passed. Old code only caught TypeError and jumped to heuristic fallback.
Fix: broadened to except Exception on weighted fit. Falls back to unweighted fit before giving up and going to heuristic. Applied to both main fit and CV loop.
Groq reasoning with canonical prices
_scanner_ai_verdict(sig, ml_a, ml_b, bt, wfo, cand_a, cand_b)
# → {candidate_a, candidate_b, winner, winner_rationale}
Model selection
| Model | Use |
|---|---|
| openai/gpt-oss-120b | DEFAULT Strongest free reasoning on Groq |
| openai/gpt-oss-20b | Faster, slightly weaker |
| qwen/qwen3-32b | Alt reasoning |
| llama-3.3-70b-versatile | Fallback |
| meta-llama/llama-4-scout-17b-16e-instruct | Fast fallback |
reasoning_effort="medium" only for gpt-oss/qwen. max_tokens=2500. Timeout 60s.
Price hallucination prevention
_compute_candidate_prices(cand, sig) is the single source of truth for entry/SL/TP1/TP2 prices. Both the UI cards and the AI prompt read from it. The AI prompt contains an explicit EXECUTION PRICES block with a "copy verbatim" instruction.
Dual verdict rules
- When A == B (unanimous): single analysis mirrored to both sides.
- Both TRADE → AI picks stronger as
winner. - Only one TRADE → that one wins.
- Neither →
winner = NONE.
Dual-candidate-aware A+ / A / B / C
_scanner_setup_grade(sig, ml, bt) grades by best-of the two candidates. A previous bug read from legacy aggregate bt["win_2r"], which made Cand A excellent + aggregate bad return "C — Backtest negative" (false negative).
Rescue rules
- "B rescue" — if any candidate is tradeable AND
ml_pct ≥ 60, grade won't drop below B. - Grade color follows the badge palette: A+ A B C.
On-chain Nansen-lite, free tier
get_pulse_intel(symbol,
etherscan_api_key, lunarcrush_api_key, solscan_api_key)
# → composite_score (-15 to +15)
# → composite_label: STRONGLY BULLISH / BULLISH / NEUTRAL / BEARISH / STRONGLY BEARISH
Composite weights
Per-token composite scaled ×1.2 → ±12. Macro modifier (±3) added on top → final ±15.
| Verdict | Score band |
|---|---|
| STRONGLY BULLISH | ≥ +10 |
| BULLISH | +4 to +9 |
| NEUTRAL | −3 to +3 |
| BEARISH | −4 to −9 |
| STRONGLY BEARISH | ≤ −10 |
Cache TTLs PERF
- TVL:
3600s(updates slowly) - Flow:
900s - Social:
1800s - Macro:
14400s
Things that must stay exactly as they are
- Never use sklearn
TimeSeriesSplit— alwaysPurgedTimeSeriesSplit. Label-period overlap leaks. best_keyusesev_weighted, not rawev. Fixed bug.- Don't remove
label_end_barfromtrades_raw. Purge depends on it. - Don't remove the ratchet. High-vol signals get 0–6 training samples without it.
- Don't remove NEUTRAL classification. Prevents single-class ML collapse on trending coins.
- Zone validity is correct. Do NOT widen SL to "fix" it.
- WFO is NOT regime-filtered by design. Keep it that way.
_compute_candidate_pricesis the single source of truth for AI prompt + UI.- Dead code in
render_auto_analyzer(manual_sig=None,_manual_render_signalsblock ~L4394-4430). Harmless. Do NOT activate without a dedicated session. def main():must exist. A prior str_replace edit accidentally removed it, causing NameError on deploy. Always verify before shipping.
Intentionally NOT built
Wide-SL toggle for big-body candles
Would change strategy semantics. Revisit after 30+ days of journal data if "zone unavailable" costs meaningful edge.
Cross-coin feature pooling (master model)
Ratchet fix solved most sample starvation. Illiquid alts probably shouldn't be traded if no historical analog exists. Revisit after journal data.
Regime-conditional ML (separate models per regime)
Soft regime weighting in sample_weight already addresses this with less complexity. Needs ≥30 samples per regime to be reliable — most coins won't hit that.
ICT / S&R / trendline / volume-profile modules
Do not build additional strategies until momentum candle has 30+ days of live journal data proving edge. Building untested systems on top of an unvalidated primary is premature optimization.
Current state
- Lookahead audit 20/20 CLEAN
- Purge + embargo (PurgedTimeSeriesSplit)
- Soft regime filtering
- Rolling WFO + bootstrap CI + regime breakdown
- Fill-rate survivor-bias diagnostic
- 4 MGMT modes (incl. Partial-NoBE)
- NEUTRAL label option A
- Manual tab enrichment to match Scanner
- Dead-code cleanup in
render_auto_analyzer - Zone-summary table in Manual
- Confluence Grade breakdown in Manual
- Meta-labeling (2nd classifier)
- Pulse as ML feature
- CPCV + PBO for QUANTFLOW
- IDX BSJP port (yfinance .JK)