Package

datasail

datasail(techniques: Optional[Union[str, List[str], Callable[[...], List[str]], Generator[str, None, None]]] = None, inter: Optional[Union[str, Path, List[Tuple[str, str]], Callable[[...], List[str]], Generator[str, None, None]]] = None, output: Optional[Union[str, Path]] = None, max_sec: int = 100, verbose: str = 'W', splits: Optional[List[float]] = None, names: Optional[List[str]] = None, delta: float = 0.05, epsilon: float = 0.05, runs: int = 1, solver: str = 'SCIP', cache: bool = False, cache_dir: Optional[Union[str, Path]] = None, linkage: Literal['average', 'single', 'complete'] = 'average', overflow: Literal['assign', 'break'] = 'assign', e_type: Optional[str] = None, e_data: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, e_weights: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, e_strat: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, e_sim: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, e_dist: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, e_args: str = '', e_clusters: int = 50, f_type: Optional[str] = None, f_data: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, f_weights: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, f_strat: Optional[Union[str, Path, Dict[str, Union[str, ndarray]], Callable[[...], Dict[str, Union[str, ndarray]]], Generator[Tuple[str, Union[str, ndarray]], None, None]]] = None, f_sim: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, f_dist: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]] = None, f_args: str = '', f_clusters: int = 50, threads: int = 1) Tuple[Dict, Dict, Dict][source]

Entry point for the package usage of DataSAIL.

Parameters
  • techniques – List of techniques to split based on

  • inter – Filepath to a TSV file storing interactions of the e-entities and f-entities.

  • output – Output directory to store the results in.

  • max_sec – Maximal number of seconds to take for optimizing a found solution.

  • verbose – Verbosity level for logging.

  • splits – List of splits, have to add up to one, otherwise scaled accordingly.

  • names – List of names of the splits.

  • epsilon – Fraction by how much the provided split sizes may be undercut

  • delta – Fraction by how much the stratification may be undercut

  • runs – Number of runs to perform per split. This may introduce some variance in the splits.

  • solver – Solving algorithm to use.

  • cache – Boolean flag indicating to store or load results from cache.

  • cache_dir – Directory to store the cache in if not the default location.

  • linkage – Linkage method to use to compute metrics between merged clusters.

  • e_type – Data format of the first batch of data

  • e_data – Data file of the first batch of data

  • e_weights – Weighting of the datapoints from e_data

  • e_strat – Stratification of the datapoints from e_data

  • e_sim – Similarity measure to apply for the e-data

  • e_dist – Distance measure to apply for the e-data

  • e_args – Additional arguments for the tools in e_sim or e_dist

  • e_clusters – Number of clusters to find in the e-data

  • f_type – Data format of the second batch of data

  • f_data – Data file of the second batch of data

  • f_weights – Weighting of the datapoints from f-data

  • f_strat – Stratification of the datapoints from f-data

  • f_sim – Similarity measure to apply for the f-data

  • f_dist – Distance measure to apply for the f-data

  • f_args – Additional arguments for the tools in f_sim or f-dist

  • f_clusters – Number of clusters to find in the f-data

  • threads – number of threads to use for one CD-HIT run

Returns

Three dictionaries mapping techniques to another dictionary. The inner dictionary maps input id to their splits.