Split Tox21 with Stratification

In this notebook, we will split the Tox21 dataset using stratification. We will use the e_strat keyword and split the dataset into 2 splits. Therefore, we first import all necessary tools.

[1]:

%%capture
import deepchem as dc
from datasail.sail import datasail

Load Tox21 Dataset

We will load the Tox21 dataset and convert it to a pandas dataframe. We will then rename the columns to match the sub-challenge names of Tox21 and reduce the dataframe to the one target we are interested in, which is the SR-ARE target.

[2]:

dataset = dc.molnet.load_tox21(featurizer=dc.feat.DummyFeaturizer(), splitter=None)[1][0]
df = dataset.to_dataframe()
name_map = dict([(f"y{i + 1}", task) for i, task in enumerate(dataset.tasks)] + [("y", dataset.tasks[0]), ("X", "SMILES")])
df.rename(columns=name_map, inplace=True)
df.rename(columns=dict([("y", dataset.tasks[0]), ("X", "SMILES")]), inplace=True)
df["ID"] = [f"Comp{i + 1:06d}" for i in range(len(df))]
df = df[["ID", "SMILES", "SR-ARE"]]
df

[2]:

	ID	SMILES	SR-ARE
0	Comp000001	CCOc1ccc2nc(S(N)(=O)=O)sc2c1	1.0
1	Comp000002	CCN1C(=O)NC(c2ccccc2)C1=O	0.0
2	Comp000003	CC[C@]1(O)CC[C@H]2[C@@H]3CCC4=CCCC[C@@H]4[C@H]...	0.0
3	Comp000004	CCCN(CC)C(CC)C(=O)Nc1c(C)cccc1C	0.0
4	Comp000005	CC(O)(P(=O)(O)O)P(=O)(O)O	0.0
...	...	...	...
7826	Comp007827	CCOc1nc2cccc(C(=O)O)c2n1Cc1ccc(-c2ccccc2-c2nnn...	0.0
7827	Comp007828	CC(=O)[C@H]1CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(...	0.0
7828	Comp007829	C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@@]43C)[C...	1.0
7829	Comp007830	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	0.0
7830	Comp007831	COc1ccc2c(c1OC)CN1CCc3cc4c(cc3C1C2)OCO4	0.0

7831 rows × 3 columns

Run DataSAIL

Use DataSAIL to split the data into an identity-based single cold split (I1e) and a cluster-based single cold split (C1e). We define - the techniques as list: C1e The e in the end is important to split the e-data. - the spits as list. The values will be normalized to ratios. - the names as list. Similarly to the list of split sizes, DataSAIL needs names to name the splits. - the number of runs. This will determine how many different splits to compute per technique to compute. - the solving algorithm for optimizing the final problem formulation. - the type of the dataset in the first axis. - the data as mapping from IDs to SMILES strings. - a mapping of sample names to the stratification target values.

For an extensive description of the arguments please refer to the according pages of the documentation.

Given there exist files storing the data and distance as described in the documentation, the according call to DataSAIL in the commandline would be:

$ datasail -t C1e -s 8 2 -n train test -r 3 --solver SCIP --e-type M --e-data <filepath> --e-strat <filepath>

[3]:

%%capture
e_splits, _, _ = datasail(
    techniques=["C1e"],
    splits=[8, 2],
    names=["train", "test"],
    runs=3,
    solver="SCIP",
    e_type="M",
    e_data=dict(df[["ID", "SMILES"]].values.tolist()),
    e_strat=dict(df[["ID", "SR-ARE"]].values.tolist()),
)

The output

Finally, we inspect the e_split object as this holds all the assignments of the datapoints to the splits, for each run and each technique. First, the overall architecture is described, lastly we look at the first five assignments of the C1 run 0.

[4]:

print(type(e_splits))
for key in e_splits.keys():
    print(f"{key} - Type: {type(e_splits[key])} - Length: {len(e_splits[key])}")
    for run in range(len(e_splits[key])):
        print(f"\tRun {run + 1} - Type: {type(e_splits[key][run])} - {len(e_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(e_splits[key][0].items())[:5]))

<class 'dict'>
C1e - Type: <class 'list'> - Length: 3
        Run 1 - Type: <class 'dict'> - 7827 assignments
        Run 2 - Type: <class 'dict'> - 7827 assignments
        Run 3 - Type: <class 'dict'> - 7827 assignments

ID: Comp000001 - Split: train
ID: Comp000002 - Split: train
ID: Comp000003 - Split: test
ID: Comp000004 - Split: train
ID: Comp000005 - Split: test

[4]: