InsilicoΣ
Drug Discovery, Cheminformatics & Bioinformatics
About Our Team Publications Contact Us Login Register

FORMULA-X Reading Guide

how to interpret every figure, table, and optimization summary

Every output in FORMULA-X answers exactly one question. This page walks you through what each one says, what "good" looks like, and what the common red flags mean. Sections are grouped by where they appear in the UI.

Three core rules

  1. Every output answers one question. When you look at a figure or summary, ask "what would change if I tweaked the experiment?" If you can't answer that, you're not reading it right.
  2. Stage-1 outputs describe one trajectory; Stage-2 outputs summarise it; optimization outputs reason about the campaign as a whole. Mixing those scales is the most common interpretation mistake.
  3. Numbers without uncertainty are guesses. When you see a single point estimate (e.g. "best run = 0.93"), look for its companion (σ, replicate count, or n_eff) before trusting it.

MD Studio - Setup & schema tables

Setup tab
FieldWhat it controlsWhat "good" looks like
system_classWhich Stage-2 scorer firesMatches your actual MD experiment, not a default
equilibration_methodHow Stage-1 picks the plateau cutoff t*block_var when "matches my eye" matters; chodera when reviewers will ask
kinetic_fitOnly used when system_class = self_assemblynone for plateau systems; hill / avrami / exp_sat for assembly
composite_policyHow per-criterion scores combinemultiplicative is strict (any 0 → 0); weighted_mean is forgiving
aggregation_policyHow seed replicates collapsetrimmed_mean default; median for seed-outlier robustness
Response Reduction column - the most overlooked decision

The response's Reduction field controls how a time series collapses to one scalar. Same trajectory, different reduction → different scalar → different surrogate.

ReductionWhen to use
post_eq_meanDefault. Average over the post-plateau window.
plateau_valueWhen the very tail matters (e.g. final binding pose RMSD).
last_quartile_meanConservative: only the steady-state tail.
fluctuation_stdWhen you want to minimise variance itself - stability as a target.
autocorr_timeWhen fast convergence is the goal.
slopeWhen the response should be flat - a non-zero slope means drift.

MD Studio - Runs & Stability table

#0  T=295  P=1  salt=0.10  fid=long_200ns  status=completed  composite=1.000  t_eq=50ns
ColumnMeaning
composite[0,1]. Read it as how many criteria passed weighted by closeness to thresholds, NOT a probability. With multiplicative policy: < 0.5 means at least one criterion is in trouble; < 0.1 means at least one is essentially failed.
t_eq (ns)Where Stage-1 declared the plateau started. Compare to sim_length / 4: if t_eq > sim_length / 2, the post-eq window is shorter than the equilibration phase - treat the scalar with skepticism.
status=crashed + composite=----Informative infeasibility. NOT lost data - the feasibility classifier learns from these rows.
Narrative"Composite 1.00; failed: rmsd_lt_3A". Always tells you which criterion failed.

MD Optimization summaries

Each optimization kind answers a distinct question. Misreading one for another is the most common pitfall.

stability_max - "Which run was best?"
{ "top": [ {"run_id": ..., "factor_values": {...}, "composite_stability": 0.93}, ... ] }
Useful for sanity-checking that the best run is where you expected. Not useful for prediction - it only ranks what you already ran.
feasibility_map - "Where will my next run not crash?"
{ "feasibility_rate": 0.80,
  "model_kind": "logistic_regression",
  "top_safe_candidates": [ {"factor_values": {...}, "predicted_pr_success": 0.99}, ... ] }
feasibility_rateEmpirical fraction of runs that completed
model_kind = constant_rateRed flag: all runs succeeded or all failed - no real classifier was trained, "top candidates" are arbitrary
model_kind = logistic_regressionA real classifier learned a boundary - predicted_pr_success per candidate is a real probability
top_safe_candidatesWhere to focus the next batch if you've been crashing
design_space - "What fraction of my parameter space is acceptable?"
{ "threshold": 0.6,
  "empirical_fraction_meeting": 0.40,
  "surrogate_pds_fraction": 0.44,
  "top_design_space_points": [ {"factor_values": ..., "pr_meeting_threshold": 0.99}, ... ] }
empirical_fraction_meetingFraction of actual runs whose composite ≥ threshold
surrogate_pds_fractionGP-predicted fraction of the whole design space that would pass
Big gap between themYou sampled an unrepresentative slice
Small gapYour design covers the space well
pr_meeting_thresholdPer-candidate probability from 1 - Φ((threshold - μ) / σ). Use for next-safest design points.
multi_response - "What's the trade-off between my responses?"
{ "response_names": ["rmsd_mean", "rg_mean"],
  "directions": {"rmsd_mean": "min", "rg_mean": "min"},
  "pareto_indices": [3, 7],
  "pareto_front": [ ... ] }
Pareto size = 1: one run dominates across all responses. Pareto size > 1: real trade-offs - to go from one Pareto point to another you must sacrifice at least one response.
cost_aware_bo - "Which MD run should I launch next?"
{ "y_best": 0.93, "surrogate": "multi_fidelity", "n_training_points": 10,
  "suggestions": [
    { "factor_values": {...},
      "predicted_mu": 0.97, "predicted_sigma": 0.08,
      "expected_improvement": 0.21, "feasibility": 0.95,
      "expected_cost_gpu_hours": 2.5, "acquisition_score": 0.084 }, ...
  ]}
y_bestCurrent best observed value of the target
surrogate = multi_fidelityTwo-level GP fit (delta correction). Needs ≥2 fidelity levels in your data
predicted_muGP mean at this candidate
predicted_sigmaGP uncertainty. Large σ + decent μ = explore here.
expected_improvementMC-EI vs y_best. Higher is better.
feasibilityPr(run completes ∧ equilibrates). Below 0.5 = risky run.
expected_cost_gpu_hoursFrom your campaign's cost_model
acquisition_scoreEI × feasibility / cost. This is what's being maximised, not raw EI.
Read suggestions in order. If the first three sit in the same corner, BO is in exploit mode. If they're scattered, BO is in explore mode. Both are valid; the order tells you which.

Diagnostic figures (per response, modeling tab)

FigureQuestionRed flag
rsm_coef (bars + significance asterisks)Which factors and interactions matter?All bars short → no significant effects (need more data or wider design)
predicted_vs_observed (scatter + 1:1 line)Does the model explain the data?R² < 0.5, or points off the 1:1 line systematically
residuals_vs_predictedIs the noise model right?Funnel shape = heteroscedastic; trend = missing factor
residual_qq (Q-Q plot)Are residuals normal?S-shape = heavy tails → ANOVA p-values are unreliable
residual_histogramQuick distribution checkSkewed → transformation needed
residuals_vs_orderTime-drift in measurements?Trend across run order = experimenter drift
permutation_importanceModel-agnostic factor rankingTall bars = real factor; near zero = noise

Effect & partial-dependence figures

FigureWhat it showsReading rule
effect (1-D sweep)Marginal effect of one factor, others at midpointSlope direction = "more or less is better"
partial_dependenceTrue PDP - marginal averaged over training rowsDiffers from effect when factors interact heavily
interaction (lines at low/med/high of 2nd factor)Do the lines cross?Crossing = interaction - the effect of A depends on B

2-D maps & 3-D surfaces (per factor pair, per response)

2-D maps
FigureReading rule
contour_map (filled + iso-lines)Standard Design-Expert view. Colour gradient direction = factor with most leverage.
contour_lines (B/W)Same content; use for print figures. Iso-line spacing = local sensitivity.
gp_uncertainty (heatmap of GP std)Where you don't know enough. Bright = high uncertainty = candidate for the next experiment.
3-D surfaces
FigureWhen to use
surface (smooth)Hero figure for slides. Beautiful, but hides residuals.
surface_wireframeHonest view of the GP grid; no hidden interpolation.
surface_contour3d2-D contour lifted to 3-D; often the most readable.
surface_contourf3dFilled bands in 3-D; best for asymmetric shapes.
surface_with_dataThe gold standard. Surface + training scatter coloured by residual.
scatter3dTraining data only. Use to check coverage before reading any surface.

Optimization-run figures

FigureStory
pds_heatmap (Pr(all specs met), red→green)The design space. Green = run safely. The boundary is your real operating window for ICH Q8 / QbD.
pareto_scatter (first 2 responses)Each dot = a non-dominated solution. Curvature = trade-off severity.
pareto_parcoords (parallel coordinates)One line per Pareto point. Crossing lines between two response axes = trade-off.
pareto_scatter_matrixPairwise scatter + diagonal histograms. Use to spot which response pair is the binding trade-off.
bo_trace (EI bars per suggestion)Falling EI = converging. Flat EI = stuck or design space exhausted.
ei_landscape (MC-EI heatmap with picked points)Where BO wanted to go vs where it picked. Big bright regions ignored = your constraints are biting.
bo_trajectory (best D vs cumulative iteration)Monotonic up = healthy BO. Plateau = converged.

Worked example - cross-mode comparison from formula_x_md_e2e

The same pipeline with three different synthetic-physics regimes produced these numbers. Reading this table teaches you the interpretation rules above in action.

Metric generous discriminating bimodal Why the values differ
empirical ≥ 0.60.800.400.40 Generous's bowl is shallow → most runs land in the basin. Steep modes have most runs outside.
surrogate_PDS1.000.440.41 GP saturates in generous (no signal); learns the boundary in steep modes.
Pareto front size112 Bimodal's two basins surface as two non-dominated points.
BO best μ1.030.541.03 Discriminating's tighter scorer thresholds cap the maximum composite seen.
BO best EI / GPU-hour0.0190.0840.054 Discriminating offers BO the most acquisition signal - y_best=0.54 leaves headroom; y_best=1.0 leaves none.
The same code produced all three columns. The differences are real signal from the underlying physics, not artefacts. That's the proof the surrogate / BO / Pareto stack reacts to data, not to defaults.

One-page cheat sheet

If you want to know...Look at
"Did this single run equilibrate?"t_eq + plateau_quality on the run's stability row
"Was this single run stable?"composite_stability + narrative
"Where's the boundary of my safe operating region?"design_spacepds_heatmap
"Will the next run crash?"feasibility_maptop_safe_candidates
"Is there a trade-off between two responses?"multi_response → Pareto size + pareto_scatter
"What should my next MD run be?"cost_aware_bosuggestions[0]
"Is my model trustworthy?"predicted_vs_observed + the four residual diagnostics
"Which factors actually matter?"rsm_coef + permutation_importance
"Where am I most uncertain?"gp_uncertainty heatmap
"Did my BO loop converge?"bo_trajectory + bo_trace
Need more depth? See the About and FAQ pages, or open the MD Studio in any project to apply these rules to your own data.
AI Lab