Split QM9 by SMILES

In this example notebook, we will discuss how to use DataSAIL to compute split for the QM9 dataset to compute more challenging splits for machine learning models. Therefore, we first import all necessary tools.

[1]:

%%capture
import deepchem as dc
from rdkit import Chem
from datasail.sail import datasail

Load the Dataset

Load the dataset from deepchem. As the deepchem dataset contains the targets of QM9 as y1 to y12 as well as weights for every task. Therefore, we rename the targets and remove the weights. Finally, we also convert all RDKit molecules to SMILES strings.

[2]:

from rdkit import rdBase
blocker = rdBase.BlockLogs()

def mol2smiles(mol):
    try:
        return Chem.MolToSmiles(Chem.rdmolops.RemoveHs(mol))
    except:
        return None

dataset = dc.molnet.load_qm9(featurizer=dc.feat.DummyFeaturizer(), splitter=None)[1][0]
df = dataset.to_dataframe()
df.rename(columns=dict([(f"y{i + 1}", task) for i, task in enumerate(dataset.tasks)] + [("X", "SMILES")]), inplace=True)
df["SMILES"] = df["SMILES"].apply(mol2smiles)
df = df.dropna(subset=["SMILES"])
df["ID"] = [f"Comp{i + 1:06d}" for i in range(len(df))]
df = df[["ID", "SMILES"] + dataset.tasks.tolist()]
df

[2]:

	ID	SMILES	mu	alpha	homo	lumo	gap	r2	zpve	cv	u0	u298	h298	g298
0	Comp000001	C	-1.695514	-5.140947	-5.551545	1.965764	4.744480	-3.370877	-2.278929	-4.317699	6.607249	6.607212	6.607212	6.607387
1	Comp000002	N	-0.560317	-5.574660	-0.343349	1.281473	1.451169	-3.414138	-2.606638	-4.354192	6.229272	6.229231	6.229231	6.229431
2	Comp000003	O	-0.402845	-5.938979	-1.769924	0.997352	1.882555	-3.447759	-3.016092	-4.429086	5.761037	5.760991	5.760991	5.761238
3	Comp000004	C#C	-1.695514	-4.785881	-1.439182	0.635197	1.355305	-3.257365	-2.843707	-3.815621	5.739750	5.739706	5.739706	5.739920
4	Comp000005	C#N	0.325228	-5.166392	-4.463681	0.004929	2.244040	-3.307999	-3.166654	-4.363256	5.360441	5.360385	5.360385	5.360611
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
132475	Comp132193	C1[C@H]2[C@@H]3[C@H]2[N@H+]2[C@@H]4C[C@]12[C@H]34	-0.706407	-0.695101	-3.100863	-0.711376	0.844033	0.469551	-1.241336	-0.050394	-0.434861	-0.434856	-0.434856	-0.434869
132476	Comp132194	C1[C@H]2[C@@H]3[C@H]4[C@H]5O[C@@]13[C@@H]2[C@H]54	-0.112342	0.703189	-0.144106	-0.931470	-0.857544	0.936507	0.150709	0.962825	0.294224	0.294260	0.294260	0.294169
132477	Comp132195	C1[N@H+]2[C@@H]3[C@H]2[C@H]2[N@@H+]4C[C@]12[C@...	0.450717	0.063607	-0.984909	-1.223594	-0.727729	0.617526	-0.169841	0.431887	-0.085259	-0.085238	-0.085238	-0.085293
132478	Comp132196	C1[C@H]2[C@@H]3[C@H]2[C@H]2[N@@H+]4C[C@]12[C@H]34	0.707701	0.382820	-0.128167	-0.003074	0.061148	0.565689	-0.168043	0.597179	-0.084153	-0.084130	-0.084130	-0.084187
132479	Comp132197	C1[N@@H+]2[C@H]3[C@@H]4[C@@H]5O[C@]13[C@H]2[C@...	0.571597	-0.166550	-0.905212	-0.403245	0.053159	0.448744	-0.496793	0.128255	-0.463497	-0.463485	-0.463485	-0.463519

132197 rows × 14 columns

Run DataSAIL

Use DataSAIL to split the data into an identity-based single cold split (I1e) and a cluster-based single cold split (C1e). We define - the techniques as list: I1e abd C1e The e in the end is important to split the e-data. - the spits as list. The values will be normalized to ratios. - the names as list. Similarly to the list of split sizes, DataSAIL needs names to name the splits. - the number of runs. This will determine how many different splits to compute per technique to compute. - the solving algorithm for optimizing the final problem formulation. - the type of the dataset in the first axis. - the data as mapping from IDs to SMILES strings.

For an extensive description of the arguments please refer to the according pages of the documentation.

Given there exist a CSV or TSV file storing the data as described in the documentation, the according call to DataSAIL in the commandline would be:

$ datasail -t C1e -s 7 2 1 -n train val test -r 3 --solver SCIP --e-type M --e-data <filepath>

[3]:

%%capture
e_splits, f_splits, inter_splits = datasail(
    techniques=["C1e"],
    splits=[7, 2, 1],
    names=["train","val", "test"],
    runs=3,
    solver="SCIP",
    e_type="M",
    e_data=dict(df[["ID", "SMILES"]].values.tolist())
)

The output

Finally, we inspect the e_split object as this holds all the assignments of the datapoints to the splits, for each run and each technique. First, the overall architecture is described, lastly we look at the first five assignments of the C1 run 0.

[4]:

print(type(e_splits))
for key in e_splits.keys():
    print(f"{key} - Type: {type(e_splits[key])} - Length: {len(e_splits[key])}")
    for run in range(len(e_splits[key])):
        print(f"\tRun {run + 1} - Type: {type(e_splits[key][run])} - {len(e_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(e_splits[key][0].items())[:5]))

<class 'dict'>
C1e - Type: <class 'list'> - Length: 3
        Run 1 - Type: <class 'dict'> - 132197 assignments
        Run 2 - Type: <class 'dict'> - 132197 assignments
        Run 3 - Type: <class 'dict'> - 132197 assignments

ID: Comp000001 - Split: val
ID: Comp000002 - Split: val
ID: Comp000003 - Split: val
ID: Comp000004 - Split: train
ID: Comp000005 - Split: train