Package
datasail
- datasail(techniques: Optional[Union[str, List[str], Callable[[...], List[str]], Generator[str, None, None]]] = None, inter: Optional[Union[str, Path, List[Tuple[str, str]], Callable[[...], List[str]], Generator[str, None, None]]] = None, output: Optional[Union[str, Path]] = None, max_sec: int = 100, verbose: str = 'W', splits: Optional[List[float]] = None, names: Optional[List[str]] = None, delta: float = 0.05, epsilon: float = 0.05, runs: int = 1, solver: str = 'SCIP', cache: bool = False, cache_dir: Optional[Union[str, Path]] = None, linkage: Literal['average', 'single', 'complete'] = 'average', overflow: Literal['assign', 'break'] = 'assign', e_type: Optional[str] = None, e_data: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, e_weights: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, e_strat: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, e_sim: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, e_dist: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, e_args: str = '', e_clusters: int = 50, f_type: Optional[str] = None, f_data: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, f_weights: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, f_strat: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, f_sim: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, f_dist: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, f_args: str = '', f_clusters: int = 50, threads: int = 1) Tuple[Dict, Dict, Dict][source]
Entry point for the package usage of DataSAIL.
- Parameters
techniques – List of techniques to split based on
inter – Filepath to a TSV file storing interactions of the e-entities and f-entities.
output – Output directory to store the results in.
max_sec – Maximal number of seconds to take for optimizing a found solution.
verbose – Verbosity level for logging.
splits – List of splits, have to add up to one, otherwise scaled accordingly.
names – List of names of the splits.
epsilon – Fraction by how much the provided split sizes may be undercut
delta – Fraction by how much the stratification may be undercut
runs – Number of runs to perform per split. This may introduce some variance in the splits.
solver – Solving algorithm to use.
cache – Boolean flag indicating to store or load results from cache.
cache_dir – Directory to store the cache in if not the default location.
linkage – Linkage method to use to compute metrics between merged clusters.
e_type – Data format of the first batch of data
e_data – Data file of the first batch of data
e_weights – Weighting of the datapoints from e_data
e_strat – Stratification of the datapoints from e_data
e_sim – Similarity measure to apply for the e-data
e_dist – Distance measure to apply for the e-data
e_args – Additional arguments for the tools in e_sim or e_dist
e_clusters – Number of clusters to find in the e-data
f_type – Data format of the second batch of data
f_data – Data file of the second batch of data
f_weights – Weighting of the datapoints from f-data
f_strat – Stratification of the datapoints from f-data
f_sim – Similarity measure to apply for the f-data
f_dist – Distance measure to apply for the f-data
f_args – Additional arguments for the tools in f_sim or f-dist
f_clusters – Number of clusters to find in the f-data
threads – number of threads to use for one CD-HIT run
- Returns
Three dictionaries mapping techniques to another dictionary. The inner dictionary maps input id to their splits.