InsilicoΣ
Drug Discovery, Cheminformatics & Bioinformatics
About Our Team Publications Mobile App Contact Us Login Register

About POLY-X

Overview

POLY-X is a multi-tier polymer property prediction platform that predicts key thermal and physical properties from polymer repeat unit SMILES (PSMILES) notation. It integrates three complementary prediction engines with five enhanced analytical features for publication-grade polymer informatics.

Three-Tier Prediction Architecture
Tier 1 — Van Krevelen Group Contribution

Classical additive group contribution method based on Van Krevelen & te Nijenhuis (2009). The repeat unit is decomposed into functional groups using a priority-based SMARTS matching system (composite > ring > simple groups). Property values are computed additively:

  • Tg = Σ(Yg,i × ni) / M × 1000 [K]
  • CED = Σ(Ecoh,i × ni) / Σ(Vw,i × ni) [J/cm³]
  • δ = √CED [MPa0.5]; ρ = M / (Vw × 0.634) [g/cm³]

Strength: Excellent for simple polymers (PE ±2.5 K, PP ±1.1 K, PS ±3.4 K). Limitation: Underperforms for non-additive effects (steric hindrance, H-bonding).


Tier 2 — ML Ensemble (RF + GB)

Random Forest (500 trees) + Gradient Boosting (200 estimators) ensemble trained on 7,365 polymers from the PolyMetriX dataset (curated experimental Tg values) with ECFP4 Morgan fingerprints (radius=2, 2048-bit). Murcko scaffold split (80/10/10).

SplitNMAE (K)RMSE (K)
Validation7370.73834.445.6
Test7360.55343.660.3

Strength: Captures non-additive effects (PMMA ±27 K, Nylon-6 ±2 K). Uncertainty: |RF − GB| / 2.


Tier 3 — polyBERT Embeddings

Pre-trained polymer language model (kuelumbus/polyBERT, DeBERTa architecture, ~86M parameters) pre-trained on ~100M polymer SMILES. Extracts 600-dimensional CLS token embeddings, then applies lightweight RF/GB prediction heads trained on the same PolyMetriX data.

HeadMAE (K)RMSE (K)
Random Forest (300 trees)0.81236.448.8
Gradient Boosting (200 est.)0.84432.844.5
Ensemble0.83733.845.4

Strength: Best generalization to diverse chemistries (R²=0.837 vs 0.553 for fingerprints). The transformer captures long-range structural patterns and polymer-specific semantics learned during pre-training.

Tier Complementarity

No single tier dominates across all polymer chemistries. The multi-tier design provides complementary predictions:

PolymerLit. Tg (K)GC (K)ML (K)polyBERT (K)Best
PE195192.5295.2295.2GC
PP253251.9292.3311.0GC
PS373376.4360.2366.4GC
PMMA378301.6351.2330.3ML
PET342335.1355.8371.5GC
Nylon-6323409.2324.9279.0ML
PVC354358.4308.0340.5GC
PTFE160214.0344.3300.1GC
Properties Predicted
PropertySymbolUnitAvailable Tiers
Glass Transition TemperatureTgKAll three tiers
Melting TemperatureTmKTier 1, Tier 2
Decomposition TemperatureTdKTier 2
Densityρg/cm³Tier 1, Tier 2
Solubility ParameterδMPa0.5Tier 1, Tier 2
Cohesive Energy DensityCEDJ/cm³Tier 1
Enhanced Analytical Features
  1. Group Contribution Analysis: Per-group property contribution breakdown showing which functional groups drive Tg up or down, with stiffness/flexibility assessment.
  2. Applicability Domain (AD): Tanimoto similarity to 5,892 training polymers. Classification: In-domain (>0.3), Borderline (0.15–0.3), Out-of-domain (<0.15).
  3. Prediction Reliability Index (PRI): Composite score from AD (35%), ensemble uncertainty (35%), and model R² (30%). Categories: High (>0.7), Moderate (0.4–0.7), Low (0.2–0.4), Unreliable (<0.2).
  4. Comparative Profiling: Tanimoto similarity comparison against 20 commercial reference polymers (PE, PP, PS, PET, PVC, PMMA, PTFE, PC, PLA, PEEK, PI, Nylon-6, Nylon-66, PU, PDMS, POM, PBT, PVDF, PPS, PSU).
  5. Processability Assessment: Evaluates Tm–Tg processing window (>50 K ideal), Td–Tm thermal stability margin (>30 K ideal), and chain flexibility. Score 0–100 with recommended methods (injection molding, extrusion, film casting, thermoforming, 3D printing).
PSMILES Input Format

POLY-X uses PSMILES (Polymer SMILES) notation, where [*] marks the two endpoints of the polymer repeat unit:

PolymerPSMILES
Polyethylene (PE)[*]CC[*]
Polypropylene (PP)[*]CC(C)[*]
Polystyrene (PS)[*]CC(c1ccccc1)[*]
PET[*]CCOC(=O)c1ccc(C(=O)O[*])cc1
Nylon-6[*]CCCCCC(=O)N[*]
PMMA[*]CC(C)(C(=O)OC)[*]
PDMS[*]O[Si](C)(C)[*]
Training Data

ML models (Tier 2 & Tier 3 heads) were trained on experimental Tg values from the PolyMetriX dataset:

  • Source: PolyMetriX — A Standardized Framework for Polymer Informatics (2025)
  • DOI: 10.5281/zenodo.14980914
  • Size: 7,365 polymers after cleaning (NaN removal, RDKit validation, 3σ outlier removal, SMILES deduplication)
  • Tg range: 134.2–768.1 K (mean 417.1 ± 112.7 K)
  • Split: Murcko scaffold split (train 5,892 / val 737 / test 736)
References
  1. Van Krevelen, D.W. & te Nijenhuis, K. Properties of Polymers, 4th Ed., Elsevier, 2009.
  2. Kuenneth, C. & Ramprasad, R. polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nature Communications, 14, 4099 (2023).
  3. Gurnani, R. et al. PolyMetriX: A standardized framework for polymer informatics. npj Computational Materials (2025).
  4. Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model., 50, 742–754 (2010).
AI Lab