FORMULA-X Reading Guide
how to interpret every figure, table, and optimization summaryEvery output in FORMULA-X answers exactly one question. This page walks you through what each one says, what "good" looks like, and what the common red flags mean. Sections are grouped by where they appear in the UI.
Three core rules
- Every output answers one question. When you look at a figure or summary, ask "what would change if I tweaked the experiment?" If you can't answer that, you're not reading it right.
- Stage-1 outputs describe one trajectory; Stage-2 outputs summarise it; optimization outputs reason about the campaign as a whole. Mixing those scales is the most common interpretation mistake.
- Numbers without uncertainty are guesses. When you see a single point estimate (e.g. "best run = 0.93"), look for its companion (σ, replicate count, or n_eff) before trusting it.
MD Studio - Setup & schema tables
Setup tab
| Field | What it controls | What "good" looks like |
|---|---|---|
system_class | Which Stage-2 scorer fires | Matches your actual MD experiment, not a default |
equilibration_method | How Stage-1 picks the plateau cutoff t* | block_var when "matches my eye" matters; chodera when reviewers will ask |
kinetic_fit | Only used when system_class = self_assembly | none for plateau systems; hill / avrami / exp_sat for assembly |
composite_policy | How per-criterion scores combine | multiplicative is strict (any 0 → 0); weighted_mean is forgiving |
aggregation_policy | How seed replicates collapse | trimmed_mean default; median for seed-outlier robustness |
Response Reduction column - the most overlooked decision
The response's Reduction field controls how a time series collapses to one scalar. Same trajectory, different reduction → different scalar → different surrogate.
| Reduction | When to use |
|---|---|
post_eq_mean | Default. Average over the post-plateau window. |
plateau_value | When the very tail matters (e.g. final binding pose RMSD). |
last_quartile_mean | Conservative: only the steady-state tail. |
fluctuation_std | When you want to minimise variance itself - stability as a target. |
autocorr_time | When fast convergence is the goal. |
slope | When the response should be flat - a non-zero slope means drift. |
MD Studio - Runs & Stability table
#0 T=295 P=1 salt=0.10 fid=long_200ns status=completed composite=1.000 t_eq=50ns
| Column | Meaning |
|---|---|
composite | [0,1]. Read it as how many criteria passed weighted by closeness to thresholds, NOT a probability. With multiplicative policy: < 0.5 means at least one criterion is in trouble; < 0.1 means at least one is essentially failed. |
t_eq (ns) | Where Stage-1 declared the plateau started. Compare to sim_length / 4: if t_eq > sim_length / 2, the post-eq window is shorter than the equilibration phase - treat the scalar with skepticism. |
status=crashed + composite=---- | Informative infeasibility. NOT lost data - the feasibility classifier learns from these rows. |
| Narrative | "Composite 1.00; failed: rmsd_lt_3A". Always tells you which criterion failed. |
MD Optimization summaries
Each optimization kind answers a distinct question. Misreading one for another is the most common pitfall.
stability_max - "Which run was best?"
{ "top": [ {"run_id": ..., "factor_values": {...}, "composite_stability": 0.93}, ... ] }
Useful for sanity-checking that the best run is where you expected. Not useful for prediction - it only ranks what you already ran.
feasibility_map - "Where will my next run not crash?"
{ "feasibility_rate": 0.80,
"model_kind": "logistic_regression",
"top_safe_candidates": [ {"factor_values": {...}, "predicted_pr_success": 0.99}, ... ] }
feasibility_rate | Empirical fraction of runs that completed |
model_kind = constant_rate | Red flag: all runs succeeded or all failed - no real classifier was trained, "top candidates" are arbitrary |
model_kind = logistic_regression | A real classifier learned a boundary - predicted_pr_success per candidate is a real probability |
top_safe_candidates | Where to focus the next batch if you've been crashing |
design_space - "What fraction of my parameter space is acceptable?"
{ "threshold": 0.6,
"empirical_fraction_meeting": 0.40,
"surrogate_pds_fraction": 0.44,
"top_design_space_points": [ {"factor_values": ..., "pr_meeting_threshold": 0.99}, ... ] }
empirical_fraction_meeting | Fraction of actual runs whose composite ≥ threshold |
surrogate_pds_fraction | GP-predicted fraction of the whole design space that would pass |
| Big gap between them | You sampled an unrepresentative slice |
| Small gap | Your design covers the space well |
pr_meeting_threshold | Per-candidate probability from 1 - Φ((threshold - μ) / σ). Use for next-safest design points. |
multi_response - "What's the trade-off between my responses?"
{ "response_names": ["rmsd_mean", "rg_mean"],
"directions": {"rmsd_mean": "min", "rg_mean": "min"},
"pareto_indices": [3, 7],
"pareto_front": [ ... ] }
Pareto size = 1: one run dominates across all responses. Pareto size > 1: real trade-offs - to go from one Pareto point to another you must sacrifice at least one response.
cost_aware_bo - "Which MD run should I launch next?"
{ "y_best": 0.93, "surrogate": "multi_fidelity", "n_training_points": 10,
"suggestions": [
{ "factor_values": {...},
"predicted_mu": 0.97, "predicted_sigma": 0.08,
"expected_improvement": 0.21, "feasibility": 0.95,
"expected_cost_gpu_hours": 2.5, "acquisition_score": 0.084 }, ...
]}
y_best | Current best observed value of the target |
surrogate = multi_fidelity | Two-level GP fit (delta correction). Needs ≥2 fidelity levels in your data |
predicted_mu | GP mean at this candidate |
predicted_sigma | GP uncertainty. Large σ + decent μ = explore here. |
expected_improvement | MC-EI vs y_best. Higher is better. |
feasibility | Pr(run completes ∧ equilibrates). Below 0.5 = risky run. |
expected_cost_gpu_hours | From your campaign's cost_model |
acquisition_score | EI × feasibility / cost. This is what's being maximised, not raw EI. |
Read suggestions in order. If the first three sit in the same corner, BO is in exploit mode. If they're scattered, BO is in explore mode. Both are valid; the order tells you which.
Diagnostic figures (per response, modeling tab)
| Figure | Question | Red flag |
|---|---|---|
rsm_coef (bars + significance asterisks) | Which factors and interactions matter? | All bars short → no significant effects (need more data or wider design) |
predicted_vs_observed (scatter + 1:1 line) | Does the model explain the data? | R² < 0.5, or points off the 1:1 line systematically |
residuals_vs_predicted | Is the noise model right? | Funnel shape = heteroscedastic; trend = missing factor |
residual_qq (Q-Q plot) | Are residuals normal? | S-shape = heavy tails → ANOVA p-values are unreliable |
residual_histogram | Quick distribution check | Skewed → transformation needed |
residuals_vs_order | Time-drift in measurements? | Trend across run order = experimenter drift |
permutation_importance | Model-agnostic factor ranking | Tall bars = real factor; near zero = noise |
Effect & partial-dependence figures
| Figure | What it shows | Reading rule |
|---|---|---|
effect (1-D sweep) | Marginal effect of one factor, others at midpoint | Slope direction = "more or less is better" |
partial_dependence | True PDP - marginal averaged over training rows | Differs from effect when factors interact heavily |
interaction (lines at low/med/high of 2nd factor) | Do the lines cross? | Crossing = interaction - the effect of A depends on B |
2-D maps & 3-D surfaces (per factor pair, per response)
2-D maps
| Figure | Reading rule |
|---|---|
contour_map (filled + iso-lines) | Standard Design-Expert view. Colour gradient direction = factor with most leverage. |
contour_lines (B/W) | Same content; use for print figures. Iso-line spacing = local sensitivity. |
gp_uncertainty (heatmap of GP std) | Where you don't know enough. Bright = high uncertainty = candidate for the next experiment. |
3-D surfaces
| Figure | When to use |
|---|---|
surface (smooth) | Hero figure for slides. Beautiful, but hides residuals. |
surface_wireframe | Honest view of the GP grid; no hidden interpolation. |
surface_contour3d | 2-D contour lifted to 3-D; often the most readable. |
surface_contourf3d | Filled bands in 3-D; best for asymmetric shapes. |
surface_with_data | The gold standard. Surface + training scatter coloured by residual. |
scatter3d | Training data only. Use to check coverage before reading any surface. |
Optimization-run figures
| Figure | Story |
|---|---|
pds_heatmap (Pr(all specs met), red→green) | The design space. Green = run safely. The boundary is your real operating window for ICH Q8 / QbD. |
pareto_scatter (first 2 responses) | Each dot = a non-dominated solution. Curvature = trade-off severity. |
pareto_parcoords (parallel coordinates) | One line per Pareto point. Crossing lines between two response axes = trade-off. |
pareto_scatter_matrix | Pairwise scatter + diagonal histograms. Use to spot which response pair is the binding trade-off. |
bo_trace (EI bars per suggestion) | Falling EI = converging. Flat EI = stuck or design space exhausted. |
ei_landscape (MC-EI heatmap with picked points) | Where BO wanted to go vs where it picked. Big bright regions ignored = your constraints are biting. |
bo_trajectory (best D vs cumulative iteration) | Monotonic up = healthy BO. Plateau = converged. |
Worked example - cross-mode comparison from formula_x_md_e2e
The same pipeline with three different synthetic-physics regimes produced these numbers. Reading this table teaches you the interpretation rules above in action.
| Metric | generous | discriminating | bimodal | Why the values differ |
|---|---|---|---|---|
empirical ≥ 0.6 | 0.80 | 0.40 | 0.40 | Generous's bowl is shallow → most runs land in the basin. Steep modes have most runs outside. |
surrogate_PDS | 1.00 | 0.44 | 0.41 | GP saturates in generous (no signal); learns the boundary in steep modes. |
| Pareto front size | 1 | 1 | 2 | Bimodal's two basins surface as two non-dominated points. |
BO best μ | 1.03 | 0.54 | 1.03 | Discriminating's tighter scorer thresholds cap the maximum composite seen. |
BO best EI / GPU-hour | 0.019 | 0.084 | 0.054 | Discriminating offers BO the most acquisition signal - y_best=0.54 leaves headroom; y_best=1.0 leaves none. |
The same code produced all three columns. The differences are real signal from the underlying physics, not artefacts. That's the proof the surrogate / BO / Pareto stack reacts to data, not to defaults.
One-page cheat sheet
| If you want to know... | Look at |
|---|---|
| "Did this single run equilibrate?" | t_eq + plateau_quality on the run's stability row |
| "Was this single run stable?" | composite_stability + narrative |
| "Where's the boundary of my safe operating region?" | design_space → pds_heatmap |
| "Will the next run crash?" | feasibility_map → top_safe_candidates |
| "Is there a trade-off between two responses?" | multi_response → Pareto size + pareto_scatter |
| "What should my next MD run be?" | cost_aware_bo → suggestions[0] |
| "Is my model trustworthy?" | predicted_vs_observed + the four residual diagnostics |
| "Which factors actually matter?" | rsm_coef + permutation_importance |
| "Where am I most uncertain?" | gp_uncertainty heatmap |
| "Did my BO loop converge?" | bo_trajectory + bo_trace |