Skip to content

PCQM4Mv2 invalid SMILES #498

@rballeba

Description

@rballeba

Dear OGB team,

I have detected that the smiles included in the lsc-pcqm4mv2: 'O[Si]123O[Si]3(O1)(O2)O' (position 51128) in the dataset is an invalid smiles string according to the newest versions of RDKit (concretely, version 2024.09.5). This error can be reproduced in the following way:

from ogb.lsc import PygPCQM4Mv2Dataset
from ogb.utils import smiles2graph

def debug_smiles2graph(smiles_string):
    try:
        return smiles2graph(smiles_string)
    except Exception as e:
        print(f"Exception occurred in smiles: {smiles_string}")
        print(e)
        raise e

mol_ds = PygPCQM4Mv2Dataset(root='../data/pcqm4mv2_invariants', smiles2graph=debug_smiles2graph)

As you can observe when executing, the previous snippet produces the following output:

Processing...
Converting SMILES strings into graphs...
  1%|▏         | 50930/3746620 [00:18<21:40, 2842.28it/s][11:26:20] Explicit valence for atom # 1 Si, 5, is greater than permitted
  1%|▏         | 51128/3746620 [00:18<22:08, 2781.28it/s]
Exception occurred in smiles: O[Si]123O[Si]3(O1)(O2)O
'NoneType' object has no attribute 'GetAtoms'

If we now try to convert this smiles using RDKit without the ogb package:

from rdkit import Chem
problematic_smiles = 'O[Si]123O[Si]3(O1)(O2)O'
mol_generated = Chem.MolFromSmiles(problematic_smiles)
print(mol_generated is None)

we get a True output, making the smiles string invalid.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions