Package

datasail

Entry point for the package usage of DataSAIL.

Parameters

techniques – List of techniques to split based on
inter – Filepath to a TSV file storing interactions of the e-entities and f-entities.
output – Output directory to store the results in.
max_sec – Maximal number of seconds to take for optimizing a found solution.
verbose – Verbosity level for logging.
splits – List of splits, have to add up to one, otherwise scaled accordingly.
names – List of names of the splits.
epsilon – Fraction by how much the provided split sizes may be undercut
delta – Fraction by how much the stratification may be undercut
runs – Number of runs to perform per split. This may introduce some variance in the splits.
solver – Solving algorithm to use.
cache – Boolean flag indicating to store or load results from cache.
cache_dir – Directory to store the cache in if not the default location.
linkage – Linkage method to use to compute metrics between merged clusters.
e_type – Data format of the first batch of data
e_data – Data file of the first batch of data
e_weights – Weighting of the datapoints from e_data
e_strat – Stratification of the datapoints from e_data
e_sim – Similarity measure to apply for the e-data
e_dist – Distance measure to apply for the e-data
e_args – Additional arguments for the tools in e_sim or e_dist
e_clusters – Number of clusters to find in the e-data
f_type – Data format of the second batch of data
f_data – Data file of the second batch of data
f_weights – Weighting of the datapoints from f-data
f_strat – Stratification of the datapoints from f-data
f_sim – Similarity measure to apply for the f-data
f_dist – Distance measure to apply for the f-data
f_args – Additional arguments for the tools in f_sim or f-dist
f_clusters – Number of clusters to find in the f-data
threads – number of threads to use for one CD-HIT run

Returns

Three dictionaries mapping techniques to another dictionary. The inner dictionary maps input id to their splits.