DataSAIL

DataSAIL, short for Data Splitting Against Information Leakage, is a versatile tool designed to partition data while minimizing similarities between the partitions. Inter-sample similarities can lead to information leakage, resulting in an overestimation of the model’s performance in certain training regimes.

DataSAIL was initially developed for machine learning workflows involving biological datasets, but its utility extends to any type of datasets. It can be used through a command line interface or integrated as a Python package, making it accessible and user-friendly. The tool is licensed under the MIT license, ensuring it remains open source and freely available on GitHub.

Note

DataSAIL is a work in progress, and we are continuously improving it. If you have any suggestions or find any bugs, please open an issue in our Issue Tracker on GitHub.

Note

If you want to collaborate with us on using DataSAIL on non-biochemical datasets, please reach out to us via email at roman.joeres[at]helmholtz-hips.de.

Install

DataSAIL is available for all modern versions of Python (v3.9 or newer). We ship two versions of DataSAIL:

  • DataSAIL: The full version of DataSAIL, which includes all third-party clustering algorithms and is available on conda for linux and OSX (called datasail).

  • DataSAIL-lite: A lightweight version of DataSAIL, which does not include any third-party clustering algorithms and is available on PyPI (called datasail) and conda (called datasail-lite).

Note

There is a naming-inconsitency between the conda and PyPI versions of DataSAIL. The lite version is called datasail-lite on conda, while it is called datasail on PyPI. This will be fixed in the future, but for now, please be aware of this inconsistency.

OS
Package
Run:
Linux
OSX
OSX-ARM
Windows
Conda
Pip
Command

Note

If you install DataSAIL from conda, it is recommended to use mamba because conda might not be able to resolve the dependencies of DataSAIL successfully.

Quick Start

Regardless of which installation command was used, DataSAIL can be executed by running

datasail -h

in the command line and see the parameters DataSAIL takes. For a more detailed description see here. DataSAIL can also directly be included as a normal package into your Python program using

from datasail.sail import datasail

splits = datasail(...)

The arguments for the package use of DataSAIL are explained in the method’s documentation. You can find a more detailed description of them based on their CLI use as the arguments are mostly the same.

For frequently asked questions, please refer to the FAQ section.