Other Initiatives

In recent years, many datasets have been published with data splits that put special focus on minimizing similarity-induced information leakage between the splits. These splits and the underlying algorithms are often very specific to the dataset.

Here, we compare the similarity-induced information leakage in these splits to DataSAIL splits. We measure the leakage of a split by the scaled \(L(\pi)\) metric as defined in the main manuscript:

\[\text{scaled L}(\pi):=\frac{\sum_{xx^\prime\in\binom{\mathcal{D}}{2}}[\pi(x)\neq\pi(x^\prime)]\cdot\text{sim}(x,x^\prime)\cdot\kappa(x)\cdot\kappa(x^\prime)}{\sum\nolimits_{x,x^\prime\in\mathcal{D}}\text{sim}(x,x^\prime)}\]

Here \(\pi:\mathcal{D}\rightarrow [k]\) is a data splitting function mapping samples \(x\) of the dataset \(\mathcal{D}\) to one of \(k\) splits. \(\text{sim}:\mathcal{D}\times\mathcal{D}\rightarrow [0,1]\) is a similarity function between samples \(x\) and \(x^\prime\) of \(\mathcal{D}\). \(\kappa:\mathcal{D}\rightarrow\mathbb{R}_{\geq 0}\) is a weighting function that can be used to put more emphasis on certain samples. This is especially useful if \(x\) represents clusters or has multiple interactions in a drug-target interaction dataset and potentially leaks information multiple times.

MoleculeNet

Zu et al. (2018)

DOI: 10.1039/C7SC02664A

This benchmark suite provides multiple datasets for molecular property prediction with different properties to predict. Each dataset contains a predefined split, some of which are scaffold-based or time-based, but most are random. Here, we compare these default split to similarity-based DataSAIL splits.

MoleculeNet Results Comparison

MoleculeNet Dataset Comparison

Scaled L(π) values for different splitting methods

Dataset	MoleculeNet Technique	MoleculeNet Split	DataSAIL Split
QM7	stratified	0.3425	0.2680
QM8	random	0.3300	0.2918
QM9	random	0.3306	0.2727
ESOL	random	0.3069	0.1808
FreeSolv	random	0.3213	0.1410
Lipophilicity	random	0.3343	0.3027
MUV	random	0.3349	0.3143
HIV	scaffold	0.3306	0.3071
BACE	scaffold	0.3309	0.3036
BBBP	scaffold	0.3366	0.2866
Toc21	random	0.3333	0.2224
ToxCast	random	0.3355	0.2220
SIDER	random	0.3513	0.2345
ClinTox	random	0.3317	0.2303

Leak Proof PDBBind (LP-PDBBind)

Li et al. (2023)

DOI: 10.48550/arXiv.2308.09639

This work improves the PDBBind dataset by defining a new datasplit that reduces data leakage between train, validation and test sets. The resulting LP-PDBBind dataset ensures that the train set has a maximum sequence similarity of 0.5 and maximum ligand similarity of 0.99 to both validation and test sets. Between the validation and test set, those guarantees are 0.9 for protein similarity and 0.99 for ligand similarity. Protein similarity was measured as the percentage of matching residues after a Needleman-Wunsch alignment, while ligand similarity was measured as the Dice similarity between Morgan fingerprints.

LP-PDBBind Results Comparison

LP-PDBBind Dataset Comparison

Scaled L(π) values for different splitting methods

Split Method	Scaled L(π)
LP-PDBBind	0.4484
DataSAIL Ligand S1	0.6330
DataSAIL Protein S1	0.5446
DataSAIL S2	0.4277

Protein Ligand INteraction Dataset and Evaluation Resource (PLINDER)

Durairaj et al. (2024)

DOI: 10.1101/2024.07.17.603955

This work introduces PLINDER, a dataset for protein-ligand interaction prediction, extracted from the PDB and intesively annotated. The authors provide three different data splits. The most complex one is PLINDER-PL50, which was created by combining mutliple similarity metrics: (i) sequence identity for proteins, (ii) pocket-level Jaccard similarity using pharmacophores, (iii) interaction-level similarity using PLIP features, and (iii) ligand-level similarity using Tanimoto similarity on ECFP4 fingerprints. The algorithm then identifies clusters of similar protein-ligand systems. Finally, the test set is constructed to contain systems from clusters that have no or minimal similarity to any systems in the training or validation sets. Along this, there are two simpler splits: PLINDER-TIME, which is a time-based split, and PLINDER-ECOD, which is based on ECOD topologies.

PLINDER Results Comparison

PLINDER Dataset Comparison

Scaled L(π) values for different splitting methods

Split Method	Scaled L(π)
PLINDER-PL50	0.0678
PLINDER-ECOD	0.3601
PLINDER-TIME	0.3682
DataSAIL Ligand S1	0.2307
DataSAIL Protein S1	0.4008
DataSAIL S2	0.0252

Protein INteraction Dataset and Evaluation Resource (PINDER)

Kovtun et al. (2024)

DOI: 10.1101/2024.07.17.603980

The Protein INteraction Dataset and Evaluation Resource (PINDER) contains curated and highly annotated protein-protein interactions obtained from the RCSB NextGen database. After data cleaning and preprocessing, PINDER provides a data leakage removed split. To measure the leakage between two systems (interacting protein-protein pairs), the authors employed FoldSeek and MMseqs. Here, we compare DataSAIL to version 1 of PINDER, released in November 2023.

Other than the LP-PDBBind dataset, we can define a similarity metric between the two dimensions interacting in this two-dimensional dataset. Therefore, we did not directly use DataSAILs S2 splitting module but rather the S1 with all protein sequences from both dimensions, weighted with the number of interactions each protein participates in. From the resulting assignment, we assigned an interaction to a split if and only if both proteins are assigned to that same split.

PINDER Results Comparison

PLINDER Dataset Comparison

Scaled L(π) values for different splitting methods

Split Method	Scaled L(π)
PINDER	0.0068
DataSAIL	0.0140

Gold Standard Human Proteome Dataset for sequence-based PPI prediction

Bernett et al. (2023)

DOI: 10.1093/bib/bbae076

The authors first show that all sequence-based protein-protein interaction (PPI) predictors they evaluated perform no better than random when sequence similarity between splits is removed. They further develop a PPI dataset based on the human proteome where they separate the proteins into three blocks using KaHIP over SIMAP2 bitscores. Then, the PPIs are assigned to the blocks if and only if the interacting proteins are both in the corresponding block. In a last step, CDHIT is used to remove redundancy (max 40% sequence similarity) within each block.

comparison coming soon