Split PDBBind in Two Dimensions
In this example notebook, we will discuss how to use DataSAIL to compute split for the PDBBind core-dataset to compute more challenging splits for machine learning models. Here, we will demonstrate how to preprocess and split a more complex dataset. Therefore, we first import all necessary tools.
[1]:
%%capture
import os
import shutil
import deepchem as dc
from rdkit import Chem
from datasail.sail import datasail
Load the Dataset
Load the dataset from deepchem. As usual, we remove the weights.
When looking at the resulting “Target” column of the dataframe, one can see that the interaction dataset focuses on predicting ligand-pocket binding affnities. This makes sense as a model can focus on specifics of the pocket instead of dealing with the entire protein which may be uninteresting. Therefore, it makes sense to apply DataSAIL to the pockets as well.
[2]:
dataset = dc.molnet.load_pdbbind(featurizer=dc.feat.DummyFeaturizer(), splitter=None, set_name="core")
df = dataset[1][0].to_dataframe()
df.rename(columns={"X1": "Ligand", "X2": "Target"}, inplace=True)
df = df[["ids", "Ligand", "Target", "y"]]
df
[2]:
| ids | Ligand | Target | y | |
|---|---|---|---|---|
| 0 | 2d3u | /tmp/v2013-core/2d3u/2d3u_ligand.sdf | /tmp/v2013-core/2d3u/2d3u_pocket.pdb | 0.268375 |
| 1 | 3cyx | /tmp/v2013-core/3cyx/3cyx_ligand.sdf | /tmp/v2013-core/3cyx/3cyx_pocket.pdb | 0.749538 |
| 2 | 3uo4 | /tmp/v2013-core/3uo4/3uo4_ligand.sdf | /tmp/v2013-core/3uo4/3uo4_pocket.pdb | 0.090166 |
| 3 | 1p1q | /tmp/v2013-core/1p1q/1p1q_ligand.sdf | /tmp/v2013-core/1p1q/1p1q_pocket.pdb | -0.636034 |
| 4 | 3ag9 | /tmp/v2013-core/3ag9/3ag9_ligand.sdf | /tmp/v2013-core/3ag9/3ag9_pocket.pdb | 0.771814 |
| ... | ... | ... | ... | ... |
| 188 | 2x0y | /tmp/v2013-core/2x0y/2x0y_ligand.sdf | /tmp/v2013-core/2x0y/2x0y_pocket.pdb | -0.765235 |
| 189 | 3uex | /tmp/v2013-core/3uex/3uex_ligand.sdf | /tmp/v2013-core/3uex/3uex_pocket.pdb | 0.268375 |
| 190 | 2pq9 | /tmp/v2013-core/2pq9/2pq9_ligand.sdf | /tmp/v2013-core/2pq9/2pq9_pocket.pdb | 0.798545 |
| 191 | 1u1b | /tmp/v2013-core/1u1b/1u1b_ligand.sdf | /tmp/v2013-core/1u1b/1u1b_pocket.pdb | 0.660433 |
| 192 | 4gqq | /tmp/v2013-core/4gqq/4gqq_ligand.sdf | /tmp/v2013-core/4gqq/4gqq_pocket.pdb | -1.527076 |
193 rows × 4 columns
Preparation of Ligands
This time, the ligands are given in SDF files which need to be converted to SMILES strings. For this, we first write a simple converter function, apply this to all ligands, and remove eventually created NaN values.
[3]:
from rdkit import rdBase
blocker = rdBase.BlockLogs()
def sdf2smiles(x):
mols = Chem.SDMolSupplier(x)
if len(mols) != 1:
# drop ambiguous molecules. If the target binds to none or multiple ligands, the binding affinity might be ambiguous
return None
for mol in mols:
if mol is None:
# if the read molecule is invalid, this cannot be converted as well
return None
return Chem.MolToSmiles(mol)
df["Ligand"] = df["Ligand"].apply(sdf2smiles)
df.dropna(inplace=True)
df
[3]:
| ids | Ligand | Target | y | |
|---|---|---|---|---|
| 0 | 2d3u | Cc1ccccc1S(=O)(=O)Nc1cc(-c2ccc(C#N)cc2)sc1C(=O... | /tmp/v2013-core/2d3u/2d3u_pocket.pdb | 0.268375 |
| 1 | 3cyx | CC(C)(C)NC(=O)[C@@H]1C[C@@H]2CCCC[C@@H]2C[N@H+... | /tmp/v2013-core/3cyx/3cyx_pocket.pdb | 0.749538 |
| 2 | 3uo4 | O=C([O-])c1ccc(Nc2nccc(Nc3ccccc3-c3ccccc3)n2)cc1 | /tmp/v2013-core/3uo4/3uo4_pocket.pdb | 0.090166 |
| 3 | 1p1q | Cc1o[nH]c(=O)c1C[C@H]([NH3+])C(=O)[O-] | /tmp/v2013-core/1p1q/1p1q_pocket.pdb | -0.636034 |
| 5 | 2wtv | O=C([O-])c1ccc(Nc2ncc3c(n2)-c2ccc(Cl)cc2C(c2c(... | /tmp/v2013-core/2wtv/2wtv_pocket.pdb | 1.079223 |
| ... | ... | ... | ... | ... |
| 188 | 2x0y | Cn1c(=O)c2c(ncn2C[C@H](O)CO)n(C)c1=O | /tmp/v2013-core/2x0y/2x0y_pocket.pdb | -0.765235 |
| 189 | 3uex | CCCCCCCCCCCCCCCCCC(=O)[O-] | /tmp/v2013-core/3uex/3uex_pocket.pdb | 0.268375 |
| 190 | 2pq9 | O=C([O-])C1=C[C@@H](OP(=O)([O-])[O-])[C@@H](O)... | /tmp/v2013-core/2pq9/2pq9_pocket.pdb | 0.798545 |
| 191 | 1u1b | Cc1cn([C@H]2C[C@H](O[P@](=O)([O-])O[P@](=O)([O... | /tmp/v2013-core/1u1b/1u1b_pocket.pdb | 0.660433 |
| 192 | 4gqq | CCOC(=O)/C=C/c1ccc(O)c(O)c1 | /tmp/v2013-core/4gqq/4gqq_pocket.pdb | -1.527076 |
182 rows × 4 columns
Preparation of Targets
Here, we just copy all pdb files into one folder. This is a requirement of FoldSeek, the internally used algorithm to cluster PDB data.
[4]:
os.makedirs("pdbs", exist_ok=True)
for name, filename in df[["ids", "Target"]].values.tolist():
shutil.copyfile(filename, f"pdbs/{name}.pdb")
Run DataSAIL
Use DataSAIL to split pdbbind with every technique offered. We define - the techniques as list: R, I1e, I1f, I2, C1e, C1f, and C1 The e in the end is important to split the e-data, the f for f-data accordingly. - the spits as list. The values will be normalized to ratios. - the names as list. Similarly to the list of split sizes, DataSAIL needs names to name the splits. - the number of runs. This will determine how many different splits to compute per technique to compute. - the solving algorithm for optimizing the final problem formulation. - the type of the dataset in the first axis (ligands). - the data as mapping from IDs to SMILES strings (ligands). - the type of the dataset in the second axis (targets). - the location of the PDB folder.
For an extensive description of the arguments please refer to the according pages of the documentation.
Given there exist files storing the data as described in the documentation, the according call to DataSAIL in the commandline would be:
$ datasail -t R I1e I2f I2 C1e C1f C2 -s 7 2 1 -n train val test -r 3 -i inter.tsv --solver SCIP --e-type M --e-data <filepath> --f-type P --f-data <pdb_dir>
[5]:
%%capture
e_splits, f_splits, inter_splits = datasail(
techniques=["R", "I1e", "I1f", "I2", "C1e", "C1f", "C2"],
splits=[7, 2, 1],
names=["train", "val", "test"],
runs=3,
solver="SCIP",
inter=[(x[0], x[0]) for x in df[["ids"]].values.tolist()],
e_type="M",
e_data=dict(df[["ids", "Ligand"]].values.tolist()),
f_type="P",
f_data="pdbs",
)
The output
Finally, we inspect the returned split assignments as this holds all the assignments of the datapoints to the splits, for each run and each technique. First, the overall architecture is described, lastly we look at the first five assignments of the C1 run 0.
[6]:
print(type(e_splits))
for key in e_splits.keys():
print(f"{key} - Type: {type(e_splits[key])} - Length: {len(e_splits[key])}")
for run in range(len(e_splits[key])):
print(f"\tRun {run + 1} - Type: {type(e_splits[key][run])} - {len(e_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(e_splits[key][0].items())[:5]))
<class 'dict'>
I1e - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
I2 - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C1e - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C2 - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
ID: 2d3u - Split: train
ID: 3cyx - Split: train
ID: 3pww - Split: train
ID: 3uo4 - Split: train
ID: 1p1q - Split: train
[7]:
print(type(f_splits))
for key in f_splits.keys():
print(f"{key} - Type: {type(f_splits[key])} - Length: {len(f_splits[key])}")
for run in range(len(f_splits[key])):
print(f"\tRun {run + 1} - Type: {type(f_splits[key][run])} - {len(f_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(f_splits[key][0].items())[:5]))
<class 'dict'>
I1f - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
I2 - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C1f - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C2 - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
ID: 2d3u - Split: val
ID: 3cyx - Split: train
ID: 3uo4 - Split: val
ID: 1p1q - Split: test
ID: 2wtv - Split: val
[8]:
print(type(inter_splits))
for key in inter_splits.keys():
print(f"{key} - Type: {type(inter_splits[key])} - Length: {len(inter_splits[key])}")
for run in range(len(inter_splits[key])):
print(f"\tRun {run + 1} - Type: {type(inter_splits[key][run])} - {len(inter_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(inter_splits[key][0].items())[:5]))
<class 'dict'>
R - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
I1e - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
I1f - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
I2 - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C1e - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C1f - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
C2 - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 182 assignments
Run 2 - Type: <class 'dict'> - 182 assignments
Run 3 - Type: <class 'dict'> - 182 assignments
ID: ('2d3u', '2d3u') - Split: not selected
ID: ('3cyx', '3cyx') - Split: train
ID: ('3uo4', '3uo4') - Split: not selected
ID: ('1p1q', '1p1q') - Split: not selected
ID: ('2wtv', '2wtv') - Split: not selected
[8]: