DataSAIL

DataSAIL, short for Data Splitting Against Information Leakage, serves as a tool crafted to partition data in a manner that minimizes information leakage, especially tailored for machine learning workflows dealing with biological datasets. However, its versatility extends beyond biology, making it applicable to various types of datasets. Whether utilized through its command line interface or integrated as a Python package, DataSAIL stands out for its user-friendly design and adaptability. Licensed under the MIT license, it is open source and conveniently accessible on GitHub. Installation is made simple through conda, utilizing mamba.

Quick Start

DataSAIL is available for all modern versions of Pytion (v3.8 or newer). Other than described on the conda-website, the command to install DataSAIL within your just created environment is

mamba install -c kalininalab -c conda-forge -c bioconda datasail
pip install grakel

The second command is necessary to run WLK clustering as the grakel library is not available on conda for python 3.10 or newer. Alternatively, one can install DataSAIL-lite from conda as

mamba install -c kalininalab -c conda-forge -c bioconda datasail-lite
pip install grakel

Note

It is important to use mamba for the installation because conda might not be able to resolve the dependencies of DataSAIL successfully.

The difference between DataSAIL and DataSAIL-lite is that the latter does not include the clustering algorithms and requires the user to install them manually as needed. The reason for this is that the clustering algorithms are not available for all OS and we want to make DataSAIL available for all OS.

Regardless of which installation command was used, DataSAIL can be executed by running

datasail -h

in the command line and see the parameters DataSAIL takes. For a more detailed description see here. DataSAIL can also directly be included as a normal package into your Python program using

from datasail.sail import datasail

splits = datasail(...)

The arguments for the package use of DataSAIL are explained in the method’s documentation. You can find a more detailed description of them based on their CLI use as the arguments are mostly the same.