Domain adaptation of a SMILES chemical transformer to SELFIES with limited computational resources

Abstract Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the sam...

Full description

Saved in:
Bibliographic Details
Main Authors: Obaid Khaleifah Alhmoudi, Mahmoud Aboushanab, Muhammed Thameem, Ali Elkamel, Ali A. AlHammadi
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-05017-w
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity. This study investigates whether a SMILES-pretrained transformer, ChemBERTa-zinc-base-v1, can be adapted to SELFIES using domain-adaptive pretraining without modifying the tokenizer or model architecture. Approximately 700,000 SELFIES-formatted molecules from PubChem were used for adaptation, completed within 12 h on a single NVIDIA A100 GPU. Embedding-level evaluation included t-distributed stochastic neighbor embedding (t-SNE), cosine similarity, and regression on twelve QM9 properties using frozen transformer weights. The domain-adapted model outperformed the original SMILES baseline and slightly outperformed the performance of ChemBERTa-77 M-MLM across most targets, despite a 100-fold difference in pretraining scale. For downstream evaluation, the model was fine-tuned end-to-end on ESOL, FreeSolv, and Lipophilicity, achieving root mean squared error (RMSE) values of 0.944, 2.511, and 0.746, respectively. These results demonstrate that SELFIES-based adaptation offers a cost-efficient alternative for molecular property prediction, without relying on molecular descriptors, 3D features, or large-scale infrastructure.
ISSN:2045-2322