About POLY-X
POLY-X is a multi-tier polymer property prediction platform that predicts key thermal and physical properties from polymer repeat unit SMILES (PSMILES) notation. It integrates three complementary prediction engines with five enhanced analytical features for publication-grade polymer informatics.
Tier 1 — Van Krevelen Group Contribution
Classical additive group contribution method based on Van Krevelen & te Nijenhuis (2009). The repeat unit is decomposed into functional groups using a priority-based SMARTS matching system (composite > ring > simple groups). Property values are computed additively:
- Tg = Σ(Yg,i × ni) / M × 1000 [K]
- CED = Σ(Ecoh,i × ni) / Σ(Vw,i × ni) [J/cm³]
- δ = √CED [MPa0.5]; ρ = M / (Vw × 0.634) [g/cm³]
Strength: Excellent for simple polymers (PE ±2.5 K, PP ±1.1 K, PS ±3.4 K). Limitation: Underperforms for non-additive effects (steric hindrance, H-bonding).
Tier 2 — ML Ensemble (RF + GB)
Random Forest (500 trees) + Gradient Boosting (200 estimators) ensemble trained on 7,365 polymers from the PolyMetriX dataset (curated experimental Tg values) with ECFP4 Morgan fingerprints (radius=2, 2048-bit). Murcko scaffold split (80/10/10).
| Split | N | R² | MAE (K) | RMSE (K) |
|---|---|---|---|---|
| Validation | 737 | 0.738 | 34.4 | 45.6 |
| Test | 736 | 0.553 | 43.6 | 60.3 |
Strength: Captures non-additive effects (PMMA ±27 K, Nylon-6 ±2 K). Uncertainty: |RF − GB| / 2.
Tier 3 — polyBERT Embeddings
Pre-trained polymer language model (kuelumbus/polyBERT, DeBERTa architecture, ~86M parameters) pre-trained on ~100M polymer SMILES. Extracts 600-dimensional CLS token embeddings, then applies lightweight RF/GB prediction heads trained on the same PolyMetriX data.
| Head | R² | MAE (K) | RMSE (K) |
|---|---|---|---|
| Random Forest (300 trees) | 0.812 | 36.4 | 48.8 |
| Gradient Boosting (200 est.) | 0.844 | 32.8 | 44.5 |
| Ensemble | 0.837 | 33.8 | 45.4 |
Strength: Best generalization to diverse chemistries (R²=0.837 vs 0.553 for fingerprints). The transformer captures long-range structural patterns and polymer-specific semantics learned during pre-training.
No single tier dominates across all polymer chemistries. The multi-tier design provides complementary predictions:
| Polymer | Lit. Tg (K) | GC (K) | ML (K) | polyBERT (K) | Best |
|---|---|---|---|---|---|
| PE | 195 | 192.5 | 295.2 | 295.2 | GC |
| PP | 253 | 251.9 | 292.3 | 311.0 | GC |
| PS | 373 | 376.4 | 360.2 | 366.4 | GC |
| PMMA | 378 | 301.6 | 351.2 | 330.3 | ML |
| PET | 342 | 335.1 | 355.8 | 371.5 | GC |
| Nylon-6 | 323 | 409.2 | 324.9 | 279.0 | ML |
| PVC | 354 | 358.4 | 308.0 | 340.5 | GC |
| PTFE | 160 | 214.0 | 344.3 | 300.1 | GC |
| Property | Symbol | Unit | Available Tiers |
|---|---|---|---|
| Glass Transition Temperature | Tg | K | All three tiers |
| Melting Temperature | Tm | K | Tier 1, Tier 2 |
| Decomposition Temperature | Td | K | Tier 2 |
| Density | ρ | g/cm³ | Tier 1, Tier 2 |
| Solubility Parameter | δ | MPa0.5 | Tier 1, Tier 2 |
| Cohesive Energy Density | CED | J/cm³ | Tier 1 |
- Group Contribution Analysis: Per-group property contribution breakdown showing which functional groups drive Tg up or down, with stiffness/flexibility assessment.
- Applicability Domain (AD): Tanimoto similarity to 5,892 training polymers. Classification: In-domain (>0.3), Borderline (0.15–0.3), Out-of-domain (<0.15).
- Prediction Reliability Index (PRI): Composite score from AD (35%), ensemble uncertainty (35%), and model R² (30%). Categories: High (>0.7), Moderate (0.4–0.7), Low (0.2–0.4), Unreliable (<0.2).
- Comparative Profiling: Tanimoto similarity comparison against 20 commercial reference polymers (PE, PP, PS, PET, PVC, PMMA, PTFE, PC, PLA, PEEK, PI, Nylon-6, Nylon-66, PU, PDMS, POM, PBT, PVDF, PPS, PSU).
- Processability Assessment: Evaluates Tm–Tg processing window (>50 K ideal), Td–Tm thermal stability margin (>30 K ideal), and chain flexibility. Score 0–100 with recommended methods (injection molding, extrusion, film casting, thermoforming, 3D printing).
POLY-X uses PSMILES (Polymer SMILES) notation, where [*] marks the
two endpoints of the polymer repeat unit:
| Polymer | PSMILES |
|---|---|
| Polyethylene (PE) | [*]CC[*] |
| Polypropylene (PP) | [*]CC(C)[*] |
| Polystyrene (PS) | [*]CC(c1ccccc1)[*] |
| PET | [*]CCOC(=O)c1ccc(C(=O)O[*])cc1 |
| Nylon-6 | [*]CCCCCC(=O)N[*] |
| PMMA | [*]CC(C)(C(=O)OC)[*] |
| PDMS | [*]O[Si](C)(C)[*] |
ML models (Tier 2 & Tier 3 heads) were trained on experimental Tg values from the PolyMetriX dataset:
- Source: PolyMetriX — A Standardized Framework for Polymer Informatics (2025)
- DOI: 10.5281/zenodo.14980914
- Size: 7,365 polymers after cleaning (NaN removal, RDKit validation, 3σ outlier removal, SMILES deduplication)
- Tg range: 134.2–768.1 K (mean 417.1 ± 112.7 K)
- Split: Murcko scaffold split (train 5,892 / val 737 / test 736)
- Van Krevelen, D.W. & te Nijenhuis, K. Properties of Polymers, 4th Ed., Elsevier, 2009.
- Kuenneth, C. & Ramprasad, R. polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nature Communications, 14, 4099 (2023).
- Gurnani, R. et al. PolyMetriX: A standardized framework for polymer informatics. npj Computational Materials (2025).
- Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model., 50, 742–754 (2010).