Commandline Interface

Here, we discuss the arguments for the Commandline Interface in more detail. As they are more or less the same as for the package usage of DataSAIL, this is also an extended description of package.

General Arguments

In this section, we’re discussing the argument structure of DataSAIL. The arguments are mostly the same between the python function and the CLI. Their functionality does not change, but some of the arguments are not available for the package version. This is noted accordingly. What might change is the type of input accepted. The package version of DataSAIL usually accepts string input to a file, a dictionary or a list (depending on the argument), and a function or generator therefore. For more details on the supported types, please checkout the type annotations of the package entry to DataSAIL.

-o / --output

CLI only! Required!

The path to the output directory to store the splits in. This folder will contain all splits, reports, and logs from the execution.

-i / --inter

The filepath to the TSV file of interactions between two entities. More details are given here.

--to-sec

The maximal time to spend optimizing the objective in seconds. This does not include preparatory work such as parsing data and clustering the input.

--threads

The number of threads to use throughout the computation. This number of threads is also forwarded to clustering programs used internally. If 0, all available CPUs will be used.

--verbose

The verbosity level of the program. Choices are: [C]ritical, [F]atal, [E]rror, [W]arning, [I]nfo, [D]ebug

-v / --version

CLI only!

Get the number of the installed version of DataSAIL.

Splitting Arguments

The following arguments are used to specify the splitting mode and the splits to compute. The arguments are the same for the CLI and the package version of DataSAIL.

-t / --techniques

Required!

Select the mode to split the data. Choices are

R: Random split,
I1: identity-based cold-single split,
I2: identity-based cold-double split,
C1: similarity-based cold-single split,
C2: similarity-based cold-double split

For both, I1 and C1, you have to specify e or f, i.e. I1e, I1f, C1e, or C1f, to make clear if DataSAIL shall compute a cold split based on the e-entity or the f-entity.

-s / --splits

The sizes of the individual splits the program shall produce.

-n / --names

The names of the splits in order of the -s argument. If left empty, splits will be called Split1, Split2, …

--overflow

How to handle overflow of the splits. If ‘assign’, a cluster that overflows a split size will be assigned to one split. The remaining data is split normally into n-1 splits. If ‘break’, the cluster will be broken into smaller parts to fit into a split.

-d / --delta

A multiplicative factor by how much the limits (as defined in the -s / –splits argument defined) of the stratification can be exceeded.

-e / --epsilon

A multiplicative factor by how much the limits (as defined in the -s / –splits argument defined) of the splits can be exceeded.

-r / --runs

The number of different to perform per technique. The idea is to compute several different splits of the dataset using the same technique to investigate the variance of the model on different data-splits. The variance in splits is introduced by shuffling the dataset everytime a new split is requested.

--solver

Which solver to use to solve the binary linear program. The choices are presented here.

--cache

Boolean flag indicating to store clustering matrices in cache to not recompute clusters multiple times.

--cache-dir

Destination of the cache folder. Default is the OS-default cache dir

Entity Arguments

The following arguments are entity specific and the same for e entities and f entities. We will describe the arguments for the e entities. The arguments for the f entities can be derived by replacing “e-” with “f-“.

--e-type

The type of the first data batch to the program. Choices are: [P]rotein, [M]olecule, [G]enome, [O]ther”

--e-data

The first input to the program. This can either be the filepath a directory containing only data files.

--e-weights

The custom weights of the samples, the format can be a CSV/TSV-file or equivalent as described above.

--e-sim

Provide the name of a method to determine similarity between samples of the first input dataset. This can either be the name of a method based on the data type (see here for available methods) or a filepath to a file storing the pairwise similarities in TSV (see here for details).

--e-dist

Provide the name of a method to determine distance between samples of the first input dataset. This can either be the name of a method based on the data type (see here for available methods) or a filepath to a file storing the pairwise similarities in TSV (see here for details).

--e-strat

A file containing the stratification of the first input dataset. The stratification is a TSV file as described here.

--e-args

Additional arguments for the clustering algorithm used in --e-dist or --e-sim.