About POLY-X

Overview

POLY-X is a multi-tier polymer property prediction platform that predicts key thermal and physical properties from polymer repeat unit SMILES (PSMILES) notation. It integrates three complementary prediction engines with five enhanced analytical features for publication-grade polymer informatics.

Three-Tier Prediction Architecture

Tier 1 — Van Krevelen Group Contribution

Classical additive group contribution method based on Van Krevelen & te Nijenhuis (2009). The repeat unit is decomposed into functional groups using a priority-based SMARTS matching system (composite > ring > simple groups). Property values are computed additively:

T_g = Σ(Y_g,i × n_i) / M × 1000 [K]
CED = Σ(E_coh,i × n_i) / Σ(V_w,i × n_i) [J/cm³]
δ = √CED [MPa^0.5]; ρ = M / (V_w × 0.634) [g/cm³]

Strength: Excellent for simple polymers (PE ±2.5 K, PP ±1.1 K, PS ±3.4 K). Limitation: Underperforms for non-additive effects (steric hindrance, H-bonding).

Tier 2 — ML Ensemble (RF + GB)

Random Forest (500 trees) + Gradient Boosting (200 estimators) ensemble trained on 7,365 polymers from the PolyMetriX dataset (curated experimental T_g values) with ECFP4 Morgan fingerprints (radius=2, 2048-bit). Murcko scaffold split (80/10/10).

Split	N	R²	MAE (K)	RMSE (K)
Validation	737	0.738	34.4	45.6
Test	736	0.553	43.6	60.3

Strength: Captures non-additive effects (PMMA ±27 K, Nylon-6 ±2 K). Uncertainty: |RF − GB| / 2.

Tier 3 — polyBERT Embeddings

Pre-trained polymer language model (kuelumbus/polyBERT, DeBERTa architecture, ~86M parameters) pre-trained on ~100M polymer SMILES. Extracts 600-dimensional CLS token embeddings, then applies lightweight RF/GB prediction heads trained on the same PolyMetriX data.

Head	R²	MAE (K)	RMSE (K)
Random Forest (300 trees)	0.812	36.4	48.8
Gradient Boosting (200 est.)	0.844	32.8	44.5
Ensemble	0.837	33.8	45.4

Strength: Best generalization to diverse chemistries (R²=0.837 vs 0.553 for fingerprints). The transformer captures long-range structural patterns and polymer-specific semantics learned during pre-training.

Tier Complementarity

No single tier dominates across all polymer chemistries. The multi-tier design provides complementary predictions:

Polymer	Lit. T_g (K)	GC (K)	ML (K)	polyBERT (K)	Best
PE	195	192.5	295.2	295.2	GC
PP	253	251.9	292.3	311.0	GC
PS	373	376.4	360.2	366.4	GC
PMMA	378	301.6	351.2	330.3	ML
PET	342	335.1	355.8	371.5	GC
Nylon-6	323	409.2	324.9	279.0	ML
PVC	354	358.4	308.0	340.5	GC
PTFE	160	214.0	344.3	300.1	GC

Properties Predicted

Property	Symbol	Unit	Available Tiers
Glass Transition Temperature	T_g	K	All three tiers
Melting Temperature	T_m	K	Tier 1, Tier 2
Decomposition Temperature	T_d	K	Tier 2
Density	ρ	g/cm³	Tier 1, Tier 2
Solubility Parameter	δ	MPa^0.5	Tier 1, Tier 2
Cohesive Energy Density	CED	J/cm³	Tier 1

Enhanced Analytical Features

Group Contribution Analysis: Per-group property contribution breakdown showing which functional groups drive T_g up or down, with stiffness/flexibility assessment.
Applicability Domain (AD): Tanimoto similarity to 5,892 training polymers. Classification: In-domain (>0.3), Borderline (0.15–0.3), Out-of-domain (<0.15).
Prediction Reliability Index (PRI): Composite score from AD (35%), ensemble uncertainty (35%), and model R² (30%). Categories: High (>0.7), Moderate (0.4–0.7), Low (0.2–0.4), Unreliable (<0.2).
Comparative Profiling: Tanimoto similarity comparison against 20 commercial reference polymers (PE, PP, PS, PET, PVC, PMMA, PTFE, PC, PLA, PEEK, PI, Nylon-6, Nylon-66, PU, PDMS, POM, PBT, PVDF, PPS, PSU).
Processability Assessment: Evaluates T_m–T_g processing window (>50 K ideal), T_d–T_m thermal stability margin (>30 K ideal), and chain flexibility. Score 0–100 with recommended methods (injection molding, extrusion, film casting, thermoforming, 3D printing).

PSMILES Input Format

POLY-X uses PSMILES (Polymer SMILES) notation, where [*] marks the two endpoints of the polymer repeat unit:

Polymer	PSMILES
Polyethylene (PE)	`[]CC[]`
Polypropylene (PP)	`[]CC(C)[]`
Polystyrene (PS)	`[]CC(c1ccccc1)[]`
PET	`[]CCOC(=O)c1ccc(C(=O)O[])cc1`
Nylon-6	`[]CCCCCC(=O)N[]`
PMMA	`[]CC(C)(C(=O)OC)[]`
PDMS	`[]O[Si](C)(C)[]`

Training Data

ML models (Tier 2 & Tier 3 heads) were trained on experimental T_g values from the PolyMetriX dataset:

Source: PolyMetriX — A Standardized Framework for Polymer Informatics (2025)
DOI: 10.5281/zenodo.14980914
Size: 7,365 polymers after cleaning (NaN removal, RDKit validation, 3σ outlier removal, SMILES deduplication)
T_g range: 134.2–768.1 K (mean 417.1 ± 112.7 K)
Split: Murcko scaffold split (train 5,892 / val 737 / test 736)

References

Van Krevelen, D.W. & te Nijenhuis, K. Properties of Polymers, 4th Ed., Elsevier, 2009.
Kuenneth, C. & Ramprasad, R. polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nature Communications, 14, 4099 (2023).
Gurnani, R. et al. PolyMetriX: A standardized framework for polymer informatics. npj Computational Materials (2025).
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model., 50, 742–754 (2010).

Tier 1 — Van Krevelen Group Contribution

Tier 2 — ML Ensemble (RF + GB)

Tier 3 — polyBERT Embeddings

Initializing Prediction Engines