FORMULA-X
Design of Experiments + Machine Learning for formulation optimization.
Why FORMULA-X exists
Classical DoE software (Design-Expert, Minitab, JMP) gives you a quadratic response-surface model and a Derringer-Suich desirability score. That is useful, but it is also exactly what every formulation paper has done for thirty years. FORMULA-X is built around six analyses those tools structurally cannot deliver:
1. Bayesian optimization
Suggests the next experiment that most reduces uncertainty about the optimum, using a Gaussian Process surrogate. Saves experiments, framed as "AI-guided formulation".
2. Probabilistic design space
Pr(meeting specs) at every factor combination via Monte Carlo. Aligns with ICH Q8 / FDA Quality by Design (QbD).
3. Multi-objective Pareto
NSGA-II returns the full Pareto front across particle size, PDI, zeta potential, cost, etc. - no arbitrary weights forced on the formulator.
4. RSM vs ML, honestly
Quadratic RSM, Gaussian Process, and gradient-boosted trees compared with nested k-fold cross-validation. Either result is publishable.
5. Robust optimization
Finds optima least sensitive to small perturbations in factors - closer to manufacturing reality than a single point optimum.
6. Constraint-aware
Linear, non-linear, mixture-sum, and cost constraints handled natively - factorial designs encode these poorly.
How FORMULA-X compares to existing tools
Honest matrix vs the four most-used DoE tools in pharma and process development. = built-in, partial = available but limited, = not supported, manual = possible only by writing custom code outside the tool.
| Capability | FORMULA-X | Design-Expert (Stat-Ease) | Minitab | JMP / JMP Pro (SAS) | pyDOE3 / Python scripts |
|---|---|---|---|---|---|
| Design generation | |||||
| Box-Behnken, CCD, full / fractional factorial, Plackett-Burman | |||||
| D-optimal (coordinate exchange) | partial (libraries only) | ||||
| Mixture / simplex-lattice designs | ✓ (industry-leading) | partial | |||
| Latin-hypercube space-filling | partial | partial | |||
| Modeling | |||||
| Quadratic RSM (OLS) with full ANOVA + p-values | ✓ (statsmodels) | ||||
| Honest k-fold cross-validation, predictive Q² | partial (PRESS) | partial (PRESS) | partial | manual | |
| Gaussian-Process surrogate (with predictive std) | ✓ (JMP Pro) | manual (sklearn / GPyTorch) | |||
| Gradient-boosted-trees surrogate | partial (XGBoost add-on) | ✓ (JMP Pro) | manual (XGBoost / LightGBM) | ||
| RSM-vs-ML ensemble comparison with honest nested CV winner selection | partial (manual) | manual | |||
| Optimization | |||||
| Derringer-Suich desirability | manual | ||||
| Multi-objective Pareto front (NSGA-II) with crowding | ✗ (weighted-sum desirability only) | ✗ (weighted only) | ✗ (weighted only) | manual (pymoo) | |
| Probabilistic design space, ICH Q8 / FDA QbD | partial (deterministic contour) | partial | manual | ||
| Bayesian Optimization (lab-in-the-loop, Expected Improvement on desirability) | ✓ (JMP Pro 17+) | manual (BoTorch / scikit-optimize) | |||
| Robust optimization under input noise | partial (Propagation of Error) | partial | manual | ||
| Constraint-aware (linear, non-linear, sympy expressions) | partial (linear / mixture only) | partial (linear only) | manual | ||
| Diagnostics & explainability | |||||
| Replicate variance auto-flagged on upload (with pooled pure-error std) | partial (post-fit) | partial | partial | manual | |
| Duplicate-column / data-hygiene warnings on upload | manual | ||||
| Residual diagnostics: vs predicted, Q-Q, histogram, vs run order | manual (matplotlib) | ||||
| Permutation importance (model-agnostic factor ranking) | partial | manual (sklearn) | |||
| Partial-dependence plot (true PDP, not midpoint shortcut) | partial | manual | |||
| GP predictive-uncertainty heatmap | ✓ (JMP Pro) | manual | |||
| Visualisation | |||||
| 3-D surface family: smooth, wireframe, 3-D contours, filled contours, surface + data overlay, 3-D scatter | ✓ (6 styles) | partial (smooth + contour) | partial | ✓ (smooth + wireframe) | manual |
| 2-D contour map (filled + iso-lines) | manual | ||||
| Bayesian-optimization Expected-Improvement landscape | partial (JMP Pro) | manual | |||
| Pareto parallel-coordinates + scatter matrix (n_responses ≥ 3) | partial | manual | |||
| 26 server-rendered figures (Agg backend, no display required) | manual | ||||
| Platform & integration | |||||
| Web UI, multi-user, browser-based | ✗ (desktop) | ✗ (desktop) | partial (Live offers a web add-on) | ||
| REST API + Celery queue for automation | partial (JSL scripting) | manual | |||
| PDF report + ZIP CSV bundle exports | partial (Word / PDF) | manual | |||
| Integrated with the wider InsilicoΣ stack (QSAR-X, ADMET-X, RNA-Σ, Clinical ML, etc.) | |||||
| Open-source / transparent codebase | |||||
| Pricing | Free (academic / member access) | Commercial (~$2-5K / seat) | Commercial (~$1.5K / yr) | Commercial (~$15K / seat for JMP Pro) | Free |
Pick FORMULA-X when you want
- An ICH Q8 / QbD probabilistic design space, not just a deterministic contour.
- Bayesian-optimization suggestions to reduce the number of lab runs.
- An honest RSM-vs-ML comparison instead of trusting the quadratic by default.
- A multi-objective Pareto front without forcing arbitrary desirability weights.
- A web-based, scriptable, automation-friendly stack alongside QSAR-X / ADMET-X.
- Open-source code you can audit, reproduce, and cite.
Pick another tool when
- You need GMP / 21-CFR-11 audit trails out of the box - Design-Expert and JMP have decades of regulatory validation; FORMULA-X is research-grade.
- Your work depends on a specialised mixture-design feature only Stat-Ease maintains (e.g. process-mixture combined designs).
- You're embedded in a SAS / JMP shop and switching costs outweigh feature gains.
- You only need design generation and you are happy in raw Python (pyDOE3 is enough).
What goes in
- A Box-Behnken / CCD / factorial / D-optimal / Latin-hypercube / mixture design - generated by FORMULA-X or uploaded as CSV.
- Factor and response definitions (units, bounds, optimization direction, ICH-style specs).
- Optional constraints expressed in plain symbolic form, e.g.
lecithin + cosurfactant <= 320.
What comes out
- Trained surrogate model(s) per response with honest CV metrics.
- Pareto front of non-dominated formulations.
- Probabilistic design-space heatmap (ICH Q8 ready).
- Bayesian-optimization suggestions for the next experiments.
- PDF report, CSV bundle, and PNG/SVG figures for publication.
FORMULA-X is part of the InsilicoΣ platform. To request access, see the FAQ or contact the maintainer.