Evaluation of Data Leakage

From version 1.3.0 (or a dev-build from github) onwards, DataSAIL allows you to easily quantify the similarity-induced data leakage for a given datasplit. This feature is currently only available for the python interface and can be used as follows:

from datasail.eval import eval_split

scaled_dl, total_dl, max_dl = eval_split("P", path_to_data, path_to_weights, similarity, distance, split_assignments)

The arguments are the same as for the python interface to DataSAIL. The full documentation of this function is given below. The output is a tuple containing the following elements:

scaled_dl The scaled data leakage, which is the total data leakage divided by the total pairwise similarity (or distance) in the dataset.
total_dl: The total, unscaled data leakage.
max_dl The total pairwise similarity (or distance) in the dataset.

eval_split(datatype, data: Optional[Union[dict[str, Any], str, Path]], weights: Optional[Union[dict[str, float], str, Path]], similarity: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]], distance: Optional[Union[str, Path, Tuple[List[str], ndarray], Callable[[...], Tuple[List[str], ndarray]]]], dist_conv: Optional[Union[int, float, Callable]], split_assignment: Union[dict[str, Any], str, Path], return_matrix: bool = False) → tuple[float, float, float, Optional[numpy.ndarray]][source]

Evaluate the leakage of a single split assignment on a dataset. The inputs are mostly the same as for a normal DataSAIL run.

Either a similarity or distance matrix must be provided. If a distance matrix is provided, a distance conversion function, string, or a maximum distance value must also be provided to convert distances to similarities. In case of a function, it has to match the signature func(distance_matrix: np.ndarray, len_fp: int = 1) -> np.ndarray, where len_fp is the length of the fingerprints (or 1 if not applicable). The len_fp parameter can be ignored if not needed.

Parameters

datatype – The type of data, options are “M”, “P”, “G”, “O”
data – The dataset to evaluate, can be a dictionary, string (path), or Path object.
weights – Optional weights for the dataset, can be a dictionary, string (path), or Path object.
similarity – Optional similarity matrix, can be a string (path) or Path object.
distance – Optional distance matrix, can be a string (path) or Path object.
dist_conv – Optional distance conversion function or maximum distance value.
split_assignment – A single split assignment, can be a dictionary, string (path), or Path object.
return_matrix – Whether to return the similarity/distance matrix used for evaluation. If True, the function will return a tuple of (leakage_ratio, leakage_value, total_value, matrix).

Returns

A tuple containing

the leakage ratio (lower is better),
the absolute leakage value, and
the total metric value for the split assignment (maximal leakage possible).
the similarity/distance matrix used for evaluation (if return_matrix is True otherwise this will be None).