SELFIES and the future of molecular string representations

Author(s)
Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C. Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, Rafael F. Lameiro, Dominik Lemm, Alston Lo, Seyed Mohamad Moosavi, José Manuel Nápoles-Duarte, Akshat Kumar Nigam, Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe Schwaller, Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom, Guido Falk von Rudorff, Andrew Wang, Andrew D. White, Adamo Young, Rose Yu, Alán Aspuru-Guzik
Abstract

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (SELFIES). SELFIES has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.

Organisation(s)
Computational Materials Physics
External organisation(s)
Max-Planck-Institut für die Physik des Lichts, Fordham University, Vrije Universiteit Amsterdam, Syngenta Jealott’s Hill International Research Centre, Imperial College London, Massachusetts Institute of Technology, Karlsruher Institut für Technologie, University of Toronto, IBM Research GmbH, École polytechnique fédérale de Lausanne, Instituto Oceanográfico, Freie Universität Berlin (FU), Universidad Autónoma de Chihuahua, Stanford University, Friedrich-Schiller-Universität Jena, Julius-Maximilians-Universität Würzburg, Vector Institute for Artificial Intelligence, University of Rochester, Canadian Institute for Advanced Research, University of California, San Diego, Independant researcher
Journal
Patterns
Volume
3
No. of pages
27
Publication date
10-2022
Peer reviewed
Yes
Austrian Fields of Science 2012
103006 Chemical physics, 102019 Machine learning
Keywords
ASJC Scopus subject areas
Decision Sciences(all)
Portal url
https://ucris.univie.ac.at/portal/en/publications/selfies-and-the-future-of-molecular-string-representations(89cc77a6-b151-4d9e-8683-1e1ba88a2dd1).html