Input formats
DataSAIL has been designed to split biochemical datasets but can be applied to any other type of data if the user can provide necessary information. Therefore, DataSAIL accepts different input formats as they are required by different types of data.
CSV and TSV Files
The standard way to share data in an effective way are .csv and .tsv files. In DataSAIL, these formats
are used to, e. g., transport data about molecules, weights of samples, or stratification. From these files, DataSAIL
only reads the first two columns. The first column has to contain the names of the samples and the second row the
according information (SMILES or FASTA string, weighting, stratification, …). Also, the first row must be column
names, therefore, DataSAIL ignores the first row. Examples are given in :code:`tests/data/pipline/drug.tsv`v(`Link <https://github.com/kalininalab/DataSAIL/blob/main/tests/data/pipeline/drugs.tsv>`__)
and tests/data/pipeline/drugs_weights.tsv (Link).
But they are also used to ship similarity and distance matrices. An example is given in
tests/data/pipeline/drug_sim.csv (Link)
and tests/data/pipeline/drug_dist.csv (Link).
Here, the first row and column contain the names of the samples and the rest of the matrix the similarities or
distances between the samples.
CSV and TSV files can also be used to transport interactions. An example is given in
tests/data/pipeline/inter.tsv (Link).
Again, only the first two columns matter which specify which sample from the e-entity with which sample from the
f-entity interacts.
FASTA files
FASTA files are widely used for various biological inputs. DataSAIL recognizes all files that end with .fa,
.fna, and .fasta as FASTA files. In DataSAIL they are used to transport information about protein
sequences, nucleotide sequences (e.g. DNA or RNA), and whole genomes.
For Protein and Nucleotide Sequences
Sequence-based datasets are stored inside a single files. Each sequences must be identified with its name in a line
starting with a >. All following lines are concatenated to form the sequence until there is an empty line, the
end of the file, or a line that starts with > starting the next line. An example with protein sequences is
given in tests/data/pipline/seqs.fasta (Link).
For whole Genomes
Genome input through FASTA files is a bit different to the format above. Here, each file contains all contigs, or reads
of one sample and the dataset is represented by a folder. Examples are given in tests/data/genomes (Link).
Pickle Files
From version 1.0.0 on, DataSAIL can also take embeddings as input. Here, the pickle file has to contain a dictionary
mapping the sample names to the embeddings. An example storing Morgan fingerprints of the molecules in
tests/data/pipeline/drugs.tsv in a pickle file is given in tests/data/pipeline/drugs.pkl (Link).
HDF5 Files
Also, from version 1.0.0 on, DataSAIL supports the .h5 format. This format is used to store large datasets in
runtime and memory efficient way. Similar to Pickle files, the HDF5 file has to contain a dictionary mapping the sample
names to the embeddings. An example storing Morgan fingerprints of the molecules in
tests/data/pipeline/drugs.tsv in a HDF5 file is given in tests/data/pipeline/drugs.h5 (Link).
To open and convert it to a dictionary, the following code can be used:
import h5py
import numpy as np
with h5py.File('tests/data/pipeline/morgan.h5', 'r') as f:
morgan = {k: np.array(v) for k, v in f.items()}
Example code for creation and reading of Pickle and HDF5 files can be found in tests/data/pipeline/embed.py (Link).
Molecular Input Files
Molecules can be input as SMILES strings in TSV and CSV format as described above, but also using dedicated
fileformats. DataSAIL supports the following fileformats: .mol, .mol2, .mrv, .pdb,
.sdf, .tpl, and .xyz. Files may only contain a single molecule (or molecular conformation),
except for .sdf files, which can contain multiple molecules. The molecules are named based on their property
_Name or their filename if the property is not set. In case of .sdf files and molecules without
_Name property, the index at which they are stored in the file is used as suffix to distinguish between
molecules in the same file.
Example files for .mol, .mrv, .pdb, and .tpl are given in
tests/data/pipeline/mol_formats/<FORMAT>/ (Link).