Evaluation of Data Leakage

From version 1.3.0 (or a dev-build from github) onwards, DataSAIL allows you to easily quantify the similarity-induced data leakage for a given datasplit. This feature is currently only available for the python interface and can be used as follows:

from datasail.eval import eval_splits

scaled_dl, total_dl, max_dl = eval_splits("P", path_to_data, path_to_weights, similarity, distance, split_assignments)

The arguments are the same as for the python interface to DataSAIL. The full documentation of this function is given below. The output is a tuple containing the following elements:

scaled_dl The scaled data leakage, which is the total data leakage divided by the total pairwise similarity (or distance) in the dataset.
total_dl: The total, unscaled data leakage.
max_dl The total pairwise similarity (or distance) in the dataset.

eval_splits(datatype, data: Optional[Union[dict[str, Any], str, Path]], weights: Optional[Union[dict[str, float], str, Path]], similarity, distance, split_assignments: Union[dict[str, Any], str, Path, list[Union[dict[str, Any], str, pathlib.Path]], tuple[Union[dict[str, Any], str, pathlib.Path]], dict[Any, Union[dict[str, Any], str, pathlib.Path]]])[source]

Evaluate the leakage of a split assignment on a dataset.

Parameters

datatype – The type of data (e.g., “text”, “image”, etc.)
data – The dataset to evaluate, can be a dictionary, string (path), or Path object.
weights – Optional weights for the dataset, can be a dictionary, string (path), or Path object.
similarity – Optional similarity matrix, can be a string (path) or Path object.
distance – Optional distance matrix, can be a string (path) or Path object.
split_assignments – A single split assignment, a list or tuple of split assignments, or a dictionary mapping split names to split assignments.

Returns

A tuple containing the leakage ratio, leakage value, and total metric value for each split assignment