-
Notifications
You must be signed in to change notification settings - Fork 407
Open
Description
Dear OGB team,
I have detected that the smiles included in the lsc-pcqm4mv2: 'O[Si]123O[Si]3(O1)(O2)O' (position 51128) in the dataset is an invalid smiles string according to the newest versions of RDKit (concretely, version 2024.09.5). This error can be reproduced in the following way:
from ogb.lsc import PygPCQM4Mv2Dataset
from ogb.utils import smiles2graph
def debug_smiles2graph(smiles_string):
try:
return smiles2graph(smiles_string)
except Exception as e:
print(f"Exception occurred in smiles: {smiles_string}")
print(e)
raise e
mol_ds = PygPCQM4Mv2Dataset(root='../data/pcqm4mv2_invariants', smiles2graph=debug_smiles2graph)
As you can observe when executing, the previous snippet produces the following output:
Processing...
Converting SMILES strings into graphs...
1%|▏ | 50930/3746620 [00:18<21:40, 2842.28it/s][11:26:20] Explicit valence for atom # 1 Si, 5, is greater than permitted
1%|▏ | 51128/3746620 [00:18<22:08, 2781.28it/s]
Exception occurred in smiles: O[Si]123O[Si]3(O1)(O2)O
'NoneType' object has no attribute 'GetAtoms'
If we now try to convert this smiles using RDKit without the ogb package:
from rdkit import Chem
problematic_smiles = 'O[Si]123O[Si]3(O1)(O2)O'
mol_generated = Chem.MolFromSmiles(problematic_smiles)
print(mol_generated is None)
we get a True output, making the smiles string invalid.
BrianPulfer, cederikhoefs, cacdzxcz and yanming-s
Metadata
Metadata
Assignees
Labels
No labels