.. _splits-label: ###### Splits ###### In this section, we will discuss the different splitting techniques DataSAIL provides for splitting datasets. First, we need state what we consider to be one-dimensional data and two-dimensional data. The motivation for these names become clear in the depictions below. All splitting teechniques and how they relate to each other are visualized in the following image: .. image:: ../imgs/phylOverview_splits.png :width: 600 :alt: Splitting techniques One-Dimensional Data #################### We consider one-dimension data to be data where the task is to predict one or more features of one input system. Examples are protein classification or molecular property prediction. The input system refers to proteins or molecules in the examples. We call them "input system", as in the era of GNNs "input variables" are probably not appropriate anymore. .. image:: ../imgs/mpp.png :width: 600 :alt: Exemplary protein feature prediction dataset Two-Dimensional Data #################### This is data with two input systems. Examples are interaction datasets such as protein-protein or drug-target interaction prediction datasets. Here, the two input systems are two sets of proteins or drugs and their targets. The two-dimensionality becomes clear from the example. .. image:: ../imgs/pli.png :width: 600 Two dimensional data can be threaten as one-dimensional data when ignoring one dimension of the data. Therefore, ever technique to split one-dimensional data can be applied to two dimensional data as well (and in both dimensions). Splitting Techniques #################### We will discuss the different techniques to split a dataset based on this exemplary interaction table. The proteins are made up and chosen by their shape. Interaction with any of the ligands are random and unintentionally. Ad an example to visualize all these techniques, we will use the interaction-dataset visualized above. Furthermore, we will always split into 3 splits (green, yellow, and red). All non-colored fields are interactions that are lost from the full dataset. Random Split (R) ================ This is the most simple split and the most widely used one. Here, datapoints are randomly assigned to splits. Therefore, the amount of leaked data is the biggest here. .. image:: ../imgs/PLI_r.png :width: 600 Identity-based One-dimensional split (I1) ========================================= The easiest step in reducing information leaks is to make sure that all samples associated with one ID is one dimension end up in the same split. Therefore, a model cannot memorize this ID between training, validation, and test. In this case DataSAIL only optimizes the sizes of the splits towards the request by the user. .. image:: ../imgs/PLI_i1.png :width: 600 Identity-based Two-dimensional split (I2) ========================================= This can be enforced to both dimensions in case of a two-dimensional dataset. But, because all samples belong to one ID from wither either dimension, there are samples having their IDs in different splits. These samples cannot be assigned to any set and therefore are lost. In this scenario, DataSAIL reduces the number of lost samples while keeping the sizes of the splits close to what the user requested. .. image:: ../imgs/PLI_i2.png :width: 600 Similarity-based One-dimensional split (S1) =========================================== The next step is to cluster IDs before splitting. This is useful as samples of one dimension might be similar to each other. By making sure all datapoints of similar IDs remain in the same split, the model cannot memorize similarities of IDs and extrapolate them between splits. Here, DataSAIL optimizes for the overall information leak as well as for the size of the splits to be similar to what has been requested. .. image:: ../imgs/PLI_c1.png :width: 600 Similarity-based Two-Dimensional split (S2) =========================================== Lastly, this cluster-based splitting can be enforced on both dimensions of a two-dimensional dataset to reduce information leaks further more. Here, DataSAIL extends the objective from above with a penalty for lost samples. .. image:: ../imgs/PLI_c2.png :width: 600