Commandline Interface
Here, we discuss the arguments for the Commandline Interface in more detail. As they are more or less the same as for the package usage of DataSAIL, this is also an extended description of package.
General Arguments
In this section, we’re discussing the argument structure of DataSAIL. The arguments are mostly the same between the python function and the CLI. Their functionality does not change, but some of the arguments are not available for the package version. This is noted accordingly. What might change is the type of input accepted. The package version of DataSAIL usually accepts string input to a file, a dictionary or a list (depending on the argument), and a function or generator therefore. For more details on the supported types, please checkout the type annotations of the package entry to DataSAIL.
-o / --output
CLI only! Required!
The path to the output directory to store the splits in. This folder will contain all splits, reports, and logs from the execution.
-i / --inter
The filepath to the TSV file of interactions between two entities. More details are given here.
--to-sec
The maximal time to spend optimizing the objective in seconds. This does not include preparatory work such as parsing data and clustering the input.
--threads
The number of threads to use throughout the computation. This number of threads is also forwarded to clustering programs used internally. If 0, all available CPUs will be used.
--verbose
The verbosity level of the program. Choices are: [C]ritical, [F]atal, [E]rror, [W]arning, [I]nfo, [D]ebug
-v / --version
CLI only!
Get the number of the installed version of DataSAIL.
Splitting Arguments
The following arguments are used to specify the splitting mode and the splits to compute. The arguments are the same for the CLI and the package version of DataSAIL.
-t / --techniques
Required!
- Select the mode to split the data. Choices are
R: Random split,
I1: identity-based cold-single split,
I2: identity-based cold-double split,
C1: similarity-based cold-single split,
C2: similarity-based cold-double split
For both, I1 and C1, you have to specify e or f, i.e. I1e, I1f, C1e, or C1f, to make clear if DataSAIL shall compute a cold split based on the e-entity or the f-entity.
-s / --splits
The sizes of the individual splits the program shall produce.
-n / --names
The names of the splits in order of the -s argument. If left empty, splits will be called Split1, Split2, …
--overflow
How to handle overflow of the splits. If ‘assign’, a cluster that overflows a split size will be assigned to one split. The remaining data is split normally into n-1 splits. If ‘break’, the cluster will be broken into smaller parts to fit into a split.
-d / --delta
A multiplicative factor by how much the limits (as defined in the -s / –splits argument defined) of the stratification can be exceeded.
-e / --epsilon
A multiplicative factor by how much the limits (as defined in the -s / –splits argument defined) of the splits can be exceeded.
-r / --runs
The number of different to perform per technique. The idea is to compute several different splits of the dataset using the same technique to investigate the variance of the model on different data-splits. The variance in splits is introduced by shuffling the dataset everytime a new split is requested.
--solver
Which solver to use to solve the binary linear program. The choices are presented here.
--cache
Boolean flag indicating to store clustering matrices in cache to not recompute clusters multiple times.
--cache-dir
Destination of the cache folder. Default is the OS-default cache dir
Entity Arguments
The following arguments are entity specific and the same for e entities and f entities. We will describe the arguments for the e entities. The arguments for the f entities can be derived by replacing “e-” with “f-“.
--e-type
The type of the first data batch to the program. Choices are: [P]rotein, [M]olecule, [G]enome, [O]ther”
--e-data
The first input to the program. This can either be the filepath a directory containing only data files.
--e-weights
The custom weights of the samples, the format can be a CSV/TSV-file or equivalent as described above.
--e-sim
Provide the name of a method to determine similarity between samples of the first input dataset. This can either be the name of a method based on the data type (see here for available methods) or a filepath to a file storing the pairwise similarities in TSV (see here for details).
--e-dist
Provide the name of a method to determine distance between samples of the first input dataset. This can either be the name of a method based on the data type (see here for available methods) or a filepath to a file storing the pairwise similarities in TSV (see here for details).
--e-strat
A file containing the stratification of the first input dataset. The stratification is a TSV file as described here.
--e-args
Additional arguments for the clustering algorithm used in --e-dist or --e-sim.