Split BACE by Weight

In this example notebook, we will discuss how to use DataSAIL to compute split for the BACE dataset to compute more challenging splits for machine learning models. Therefore, we first import all necessary tools.

[1]:

%%capture
import deepchem as dc
import numpy as np
from rdkit import Chem
from rdkit.Chem.Descriptors import ExactMolWt
from datasail.sail import datasail

Load the Dataset

Load the dataset from deepchem. As BACE is a classification dataset, we rename the columns of the dataset during preprocessing. Furthermore, we remove the weights. Finally, we also convert all RDKit molecules to SMILES strings.

[2]:

dataset = dc.molnet.load_bace_classification(featurizer=dc.feat.DummyFeaturizer(), splitter=None)[1][0]
df = dataset.to_dataframe()
df.rename(columns=dict([("y", dataset.tasks[0]), ("X", "SMILES")]), inplace=True)
df["ID"] = [f"Comp{i + 1:06d}" for i in range(len(df))]
df = df[["ID", "SMILES"] + dataset.tasks.tolist()]
df

[2]:

	ID	SMILES	Class
0	Comp000001	O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2c...	1.0
1	Comp000002	Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(...	1.0
2	Comp000003	S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...	1.0
3	Comp000004	S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c...	1.0
4	Comp000005	S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...	1.0
...	...	...	...
1508	Comp001509	Clc1cc2nc(n(c2cc1)C(CC(=O)NCC1CCOCC1)CC)N	0.0
1509	Comp001510	Clc1cc2nc(n(c2cc1)C(CC(=O)NCc1ncccc1)CC)N	0.0
1510	Comp001511	Brc1cc(ccc1)C1CC1C=1N=C(N)N(C)C(=O)C=1	0.0
1511	Comp001512	O=C1N(C)C(=NC(=C1)C1CC1c1cc(ccc1)-c1ccccc1)N	0.0
1512	Comp001513	Clc1cc2nc(n(c2cc1)CCCC(=O)NCC1CC1)N	0.0

1513 rows × 3 columns

Run DataSAIL

Use DataSAIL to split the data into an identity-based single cold split (I1e) and a cluster-based single cold split (C1e). We define - the techniques as list: I1e abd C1e The e in the end is important to split the e-data. - the spits as list. The values will be normalized to ratios. - the names as list. Similarly to the list of split sizes, DataSAIL needs names to name the splits. - the number of runs. This will determine how many different splits to compute per technique to compute. - the solving algorithm for optimizing the final problem formulation. - the type of the dataset in the first axis. - the data as mapping from IDs to SMILES strings. - a distance metric based on the molecular weights. Therefore, we compute the matrix of pairwise weight differences between the molecules.

For an extensive description of the arguments please refer to the according pages of the documentation.

Given there exist files storing the data and distance as described in the documentation, the according call to DataSAIL in the commandline would be:

$ datasail -t I1e C1e -s 7 2 1 -n train val test -r 3 --solver SCIP --e-type M --e-data <filepath> --e-dist <filepath>

[3]:

%%capture
# Compute the distance matrix of the weights for each pair of molecule
weights = [ExactMolWt(Chem.MolFromSmiles(s)) for s in df["SMILES"].values.tolist()]
dist_mat = np.zeros((len(weights), len(weights)))
for i in range(len(weights)):
    for j in range(i + 1, len(weights)):
        dist_mat[i, j] = dist_mat[j, i] = abs(weights[i] - weights[j])
dist_mat /= np.max(dist_mat)

# Use this matrix together with a list of names (i.e. the ids of the molecules in order of the matrix) as distance metric in DataSAIL.
e_splits, f_splits, inter_splits = datasail(
    techniques=["I1e", "C1e"],
    splits=[7, 2, 1],
    names=["train", "val", "test"],
    runs=3,
    solver="SCIP",
    e_type="M",
    e_data=dict(df[["ID", "SMILES"]].values.tolist()),
    e_dist=(df["ID"].values.tolist(), dist_mat),
)

The output

Finally, we inspect the e_split object as this holds all the assignments of the datapoints to the splits, for each run and each technique. First, the overall architecture is described, lastly we look at the first five assignments of the C1 run 0.

[4]:

print(type(e_splits))
for key in e_splits.keys():
    print(f"{key} - Type: {type(e_splits[key])} - Length: {len(e_splits[key])}")
    for run in range(len(e_splits[key])):
        print(f"\tRun {run + 1} - Type: {type(e_splits[key][run])} - {len(e_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(e_splits[key][0].items())[:5]))

<class 'dict'>
I1e - Type: <class 'list'> - Length: 3
        Run 1 - Type: <class 'dict'> - 1513 assignments
        Run 2 - Type: <class 'dict'> - 1513 assignments
        Run 3 - Type: <class 'dict'> - 1513 assignments
C1e - Type: <class 'list'> - Length: 3
        Run 1 - Type: <class 'dict'> - 1513 assignments
        Run 2 - Type: <class 'dict'> - 1513 assignments
        Run 3 - Type: <class 'dict'> - 1513 assignments

ID: Comp000001 - Split: train
ID: Comp000002 - Split: train
ID: Comp000003 - Split: train
ID: Comp000004 - Split: train
ID: Comp000005 - Split: train