FORMULA-X Reading Guide

how to interpret every figure, table, and optimization summary

Every output in FORMULA-X answers exactly one question. This page walks you through what each one says, what "good" looks like, and what the common red flags mean. Sections are grouped by where they appear in the UI.

On this page Three core rules MD Studio tables Runs & Stability MD optimization summaries stability_max feasibility_map design_space multi_response (Pareto) cost_aware_bo Diagnostic figures Effect & PDP figures 2-D / 3-D surfaces Optimization figures Cross-mode comparison One-page cheat sheet

Three core rules

Every output answers one question. When you look at a figure or summary, ask "what would change if I tweaked the experiment?" If you can't answer that, you're not reading it right.
Stage-1 outputs describe one trajectory; Stage-2 outputs summarise it; optimization outputs reason about the campaign as a whole. Mixing those scales is the most common interpretation mistake.
Numbers without uncertainty are guesses. When you see a single point estimate (e.g. "best run = 0.93"), look for its companion (σ, replicate count, or n_eff) before trusting it.

MD Studio - Setup & schema tables

Setup tab

Field	What it controls	What "good" looks like
`system_class`	Which Stage-2 scorer fires	Matches your actual MD experiment, not a default
`equilibration_method`	How Stage-1 picks the plateau cutoff t*	`block_var` when "matches my eye" matters; `chodera` when reviewers will ask
`kinetic_fit`	Only used when system_class = self_assembly	`none` for plateau systems; `hill` / `avrami` / `exp_sat` for assembly
`composite_policy`	How per-criterion scores combine	`multiplicative` is strict (any 0 → 0); `weighted_mean` is forgiving
`aggregation_policy`	How seed replicates collapse	`trimmed_mean` default; `median` for seed-outlier robustness

Response Reduction column - the most overlooked decision

The response's Reduction field controls how a time series collapses to one scalar. Same trajectory, different reduction → different scalar → different surrogate.

Reduction	When to use
`post_eq_mean`	Default. Average over the post-plateau window.
`plateau_value`	When the very tail matters (e.g. final binding pose RMSD).
`last_quartile_mean`	Conservative: only the steady-state tail.
`fluctuation_std`	When you want to minimise variance itself - stability as a target.
`autocorr_time`	When fast convergence is the goal.
`slope`	When the response should be flat - a non-zero slope means drift.

MD Studio - Runs & Stability table

#0  T=295  P=1  salt=0.10  fid=long_200ns  status=completed  composite=1.000  t_eq=50ns

Column	Meaning
`composite`	[0,1]. Read it as how many criteria passed weighted by closeness to thresholds, NOT a probability. With `multiplicative` policy: `< 0.5` means at least one criterion is in trouble; `< 0.1` means at least one is essentially failed.
`t_eq (ns)`	Where Stage-1 declared the plateau started. Compare to `sim_length / 4`: if `t_eq > sim_length / 2`, the post-eq window is shorter than the equilibration phase - treat the scalar with skepticism.
`status=crashed` + `composite=----`	Informative infeasibility. NOT lost data - the feasibility classifier learns from these rows.
Narrative	"Composite 1.00; failed: rmsd_lt_3A". Always tells you which criterion failed.

MD Optimization summaries

Each optimization kind answers a distinct question. Misreading one for another is the most common pitfall.

stability_max - "Which run was best?"

{ "top": [ {"run_id": ..., "factor_values": {...}, "composite_stability": 0.93}, ... ] }

Useful for sanity-checking that the best run is where you expected. Not useful for prediction - it only ranks what you already ran.

feasibility_map - "Where will my next run not crash?"

{ "feasibility_rate": 0.80,
  "model_kind": "logistic_regression",
  "top_safe_candidates": [ {"factor_values": {...}, "predicted_pr_success": 0.99}, ... ] }

`feasibility_rate`	Empirical fraction of runs that completed
`model_kind = constant_rate`	Red flag: all runs succeeded or all failed - no real classifier was trained, "top candidates" are arbitrary
`model_kind = logistic_regression`	A real classifier learned a boundary - `predicted_pr_success` per candidate is a real probability
`top_safe_candidates`	Where to focus the next batch if you've been crashing

design_space - "What fraction of my parameter space is acceptable?"

{ "threshold": 0.6,
  "empirical_fraction_meeting": 0.40,
  "surrogate_pds_fraction": 0.44,
  "top_design_space_points": [ {"factor_values": ..., "pr_meeting_threshold": 0.99}, ... ] }

`empirical_fraction_meeting`	Fraction of actual runs whose composite ≥ threshold
`surrogate_pds_fraction`	GP-predicted fraction of the whole design space that would pass
Big gap between them	You sampled an unrepresentative slice
Small gap	Your design covers the space well
`pr_meeting_threshold`	Per-candidate probability from `1 - Φ((threshold - μ) / σ)`. Use for next-safest design points.

multi_response - "What's the trade-off between my responses?"

{ "response_names": ["rmsd_mean", "rg_mean"],
  "directions": {"rmsd_mean": "min", "rg_mean": "min"},
  "pareto_indices": [3, 7],
  "pareto_front": [ ... ] }

Pareto size = 1: one run dominates across all responses. Pareto size > 1: real trade-offs - to go from one Pareto point to another you must sacrifice at least one response.

cost_aware_bo - "Which MD run should I launch next?"

{ "y_best": 0.93, "surrogate": "multi_fidelity", "n_training_points": 10,
  "suggestions": [
    { "factor_values": {...},
      "predicted_mu": 0.97, "predicted_sigma": 0.08,
      "expected_improvement": 0.21, "feasibility": 0.95,
      "expected_cost_gpu_hours": 2.5, "acquisition_score": 0.084 }, ...
  ]}

`y_best`	Current best observed value of the target
`surrogate = multi_fidelity`	Two-level GP fit (delta correction). Needs ≥2 fidelity levels in your data
`predicted_mu`	GP mean at this candidate
`predicted_sigma`	GP uncertainty. Large σ + decent μ = explore here.
`expected_improvement`	MC-EI vs `y_best`. Higher is better.
`feasibility`	Pr(run completes ∧ equilibrates). Below 0.5 = risky run.
`expected_cost_gpu_hours`	From your campaign's `cost_model`
`acquisition_score`	EI × feasibility / cost. This is what's being maximised, not raw EI.

Read suggestions in order. If the first three sit in the same corner, BO is in exploit mode. If they're scattered, BO is in explore mode. Both are valid; the order tells you which.

Diagnostic figures (per response, modeling tab)

Figure	Question	Red flag
`rsm_coef` (bars + significance asterisks)	Which factors and interactions matter?	All bars short → no significant effects (need more data or wider design)
`predicted_vs_observed` (scatter + 1:1 line)	Does the model explain the data?	R² < 0.5, or points off the 1:1 line systematically
`residuals_vs_predicted`	Is the noise model right?	Funnel shape = heteroscedastic; trend = missing factor
`residual_qq` (Q-Q plot)	Are residuals normal?	S-shape = heavy tails → ANOVA p-values are unreliable
`residual_histogram`	Quick distribution check	Skewed → transformation needed
`residuals_vs_order`	Time-drift in measurements?	Trend across run order = experimenter drift
`permutation_importance`	Model-agnostic factor ranking	Tall bars = real factor; near zero = noise

Effect & partial-dependence figures

Figure	What it shows	Reading rule
`effect` (1-D sweep)	Marginal effect of one factor, others at midpoint	Slope direction = "more or less is better"
`partial_dependence`	True PDP - marginal averaged over training rows	Differs from `effect` when factors interact heavily
`interaction` (lines at low/med/high of 2nd factor)	Do the lines cross?	Crossing = interaction - the effect of A depends on B

2-D maps & 3-D surfaces (per factor pair, per response)

2-D maps

Figure	Reading rule
`contour_map` (filled + iso-lines)	Standard Design-Expert view. Colour gradient direction = factor with most leverage.
`contour_lines` (B/W)	Same content; use for print figures. Iso-line spacing = local sensitivity.
`gp_uncertainty` (heatmap of GP std)	Where you don't know enough. Bright = high uncertainty = candidate for the next experiment.

3-D surfaces

Figure	When to use
`surface` (smooth)	Hero figure for slides. Beautiful, but hides residuals.
`surface_wireframe`	Honest view of the GP grid; no hidden interpolation.
`surface_contour3d`	2-D contour lifted to 3-D; often the most readable.
`surface_contourf3d`	Filled bands in 3-D; best for asymmetric shapes.
`surface_with_data`	The gold standard. Surface + training scatter coloured by residual.
`scatter3d`	Training data only. Use to check coverage before reading any surface.

Optimization-run figures

Figure	Story
`pds_heatmap` (Pr(all specs met), red→green)	The design space. Green = run safely. The boundary is your real operating window for ICH Q8 / QbD.
`pareto_scatter` (first 2 responses)	Each dot = a non-dominated solution. Curvature = trade-off severity.
`pareto_parcoords` (parallel coordinates)	One line per Pareto point. Crossing lines between two response axes = trade-off.
`pareto_scatter_matrix`	Pairwise scatter + diagonal histograms. Use to spot which response pair is the binding trade-off.
`bo_trace` (EI bars per suggestion)	Falling EI = converging. Flat EI = stuck or design space exhausted.
`ei_landscape` (MC-EI heatmap with picked points)	Where BO wanted to go vs where it picked. Big bright regions ignored = your constraints are biting.
`bo_trajectory` (best D vs cumulative iteration)	Monotonic up = healthy BO. Plateau = converged.

Worked example - cross-mode comparison from `formula_x_md_e2e`

The same pipeline with three different synthetic-physics regimes produced these numbers. Reading this table teaches you the interpretation rules above in action.

Metric	generous	discriminating	bimodal	Why the values differ
`empirical ≥ 0.6`	0.80	0.40	0.40	Generous's bowl is shallow → most runs land in the basin. Steep modes have most runs outside.
`surrogate_PDS`	1.00	0.44	0.41	GP saturates in generous (no signal); learns the boundary in steep modes.
Pareto front size	1	1	2	Bimodal's two basins surface as two non-dominated points.
BO best `μ`	1.03	0.54	1.03	Discriminating's tighter scorer thresholds cap the maximum composite seen.
BO best `EI / GPU-hour`	0.019	0.084	0.054	Discriminating offers BO the most acquisition signal - `y_best=0.54` leaves headroom; `y_best=1.0` leaves none.

The same code produced all three columns. The differences are real signal from the underlying physics, not artefacts. That's the proof the surrogate / BO / Pareto stack reacts to data, not to defaults.

One-page cheat sheet

If you want to know...	Look at
"Did this single run equilibrate?"	`t_eq` + plateau_quality on the run's stability row
"Was this single run stable?"	`composite_stability` + narrative
"Where's the boundary of my safe operating region?"	`design_space` → `pds_heatmap`
"Will the next run crash?"	`feasibility_map` → `top_safe_candidates`
"Is there a trade-off between two responses?"	`multi_response` → Pareto size + `pareto_scatter`
"What should my next MD run be?"	`cost_aware_bo` → `suggestions[0]`
"Is my model trustworthy?"	`predicted_vs_observed` + the four residual diagnostics
"Which factors actually matter?"	`rsm_coef` + `permutation_importance`
"Where am I most uncertain?"	`gp_uncertainty` heatmap
"Did my BO loop converge?"	`bo_trajectory` + `bo_trace`

Need more depth? See the About and FAQ pages, or open the MD Studio in any project to apply these rules to your own data.