Input formats

DataSAIL has been designed to split biochemical datasets but can be applied to any other type of data if the user can provide necessary information. Therefore, DataSAIL accepts different input formats as they are required by different types of data.

CSV and TSV Files

The standard way to share data in an effective way are .csv and .tsv files. In DataSAIL, these formats are used to, e. g., transport data about molecules, weights of samples, or stratification. From these files, DataSAIL only reads the first two columns. The first column has to contain the names of the samples and the second row the according information (SMILES or FASTA string, weighting, stratification, …). Also, the first row must be column names, therefore, DataSAIL ignores the first row. Examples are given in :code:`tests/data/pipline/drug.tsv`v(`Link <https://github.com/kalininalab/DataSAIL/blob/main/tests/data/pipeline/drugs.tsv>`__) and tests/data/pipeline/drugs_weights.tsv (Link).

But they are also used to ship similarity and distance matrices. An example is given in tests/data/pipeline/drug_sim.csv (Link) and tests/data/pipeline/drug_dist.csv (Link). Here, the first row and column contain the names of the samples and the rest of the matrix the similarities or distances between the samples.

CSV and TSV files can also be used to transport interactions. An example is given in tests/data/pipeline/inter.tsv (Link). Again, only the first two columns matter which specify which sample from the e-entity with which sample from the f-entity interacts.

FASTA files

FASTA files are widely used for various biological inputs. DataSAIL recognizes all files that end with .fa, .fna, and .fasta as FASTA files. In DataSAIL they are used to transport information about protein sequences, nucleotide sequences (e.g. DNA or RNA), and whole genomes.

For Protein and Nucleotide Sequences

Sequence-based datasets are stored inside a single files. Each sequences must be identified with its name in a line starting with a >. All following lines are concatenated to form the sequence until there is an empty line, the end of the file, or a line that starts with > starting the next line. An example with protein sequences is given in tests/data/pipline/seqs.fasta (Link).

For whole Genomes

Genome input through FASTA files is a bit different to the format above. Here, each file contains all contigs, or reads of one sample and the dataset is represented by a folder. Examples are given in tests/data/genomes (Link).

Pickle Files

From version 1.0.0 on, DataSAIL can also take embeddings as input. Here, the pickle file has to contain a dictionary mapping the sample names to the embeddings. An example storing Morgan fingerprints of the molecules in tests/data/pipeline/drugs.tsv in a pickle file is given in tests/data/pipeline/drugs.pkl (Link).

HDF5 Files

Also, from version 1.0.0 on, DataSAIL supports the .h5 format. This format is used to store large datasets in runtime and memory efficient way. Similar to Pickle files, the HDF5 file has to contain a dictionary mapping the sample names to the embeddings. An example storing Morgan fingerprints of the molecules in tests/data/pipeline/drugs.tsv in a HDF5 file is given in tests/data/pipeline/drugs.h5 (Link). To open and convert it to a dictionary, the following code can be used:

import h5py
import numpy as np

with h5py.File('tests/data/pipeline/morgan.h5', 'r') as f:
    morgan = {k: np.array(v) for k, v in f.items()}

Example code for creation and reading of Pickle and HDF5 files can be found in tests/data/pipeline/embed.py (Link).

Molecular Input Files

Molecules can be input as SMILES strings in TSV and CSV format as described above, but also using dedicated fileformats. DataSAIL supports the following fileformats: .mol, .mol2, .mrv, .pdb, .sdf, .tpl, and .xyz. Files may only contain a single molecule (or molecular conformation), except for .sdf files, which can contain multiple molecules. The molecules are named based on their property _Name or their filename if the property is not set. In case of .sdf files and molecules without _Name property, the index at which they are stored in the file is used as suffix to distinguish between molecules in the same file.

Example files for .mol, .mrv, .pdb, and .tpl are given in tests/data/pipeline/mol_formats/<FORMAT>/ (Link).