How to Add a new Similarity or Distance Metric to DataSAIL
On the example of DIAMOND {links}, we demonstrate how to add a new similarity or distance metric to DataSAIL.
The DIAMOND similarity metric is a similarity metric for comparing two sets of items. It is based on the Jaccard similarity coefficient, which is a measure of similarity between two sets. The Jaccard similarity coefficient is defined as the size of the intersection divided by the size of the union of the two sets. The DIAMOND similarity metric is defined as 1 minus the Jaccard similarity coefficient. The DIAMOND similarity metric is used to compare two sets of items, such as two sets of genes, two sets of proteins, or two sets of metabolites.
0. Create a Fork of the DataSAIL Repository
When you want to add a new feature to the DataSAIL repository, you first have to create a fork of the DataSAIL repository. An explanation of how to do this can be found in the official GitHub documentation.
1. Installability
When adding a new similarity or distance metric to DataSAIL, make sure, it is installable from Conda
2. Registration
Register the new similarity or distance metric at various locations in the settings.py file.
69DIAMOND = "diamond"
and
80SIM_ALGOS = [WLK, MMSEQS, MMSEQS2, MMSEQSPP, FOLDSEEK, CDHIT, CDHIT_EST, ECFP, DIAMOND, ]
and
84# Check if the tools are installed
85INSTALLED = {
86 # Define the check per tool
87 ...
88 DIAMOND: shutil.which("diamond") is not None,
89 ...
90}
and
136# Define the names of the YAML files
137YAML_FILE_NAMES = {
138 # define the yaml file per tool
139 ...
140 DIAMOND: "args/diamond.yaml",
141 ...
142}
3. Python Implementation
Next, we create the file diamond.py in the datasail/cluster directory. This new file must contain a
run_<tool_name> function, which is used to compute the similarity or distance metric. The
run_<tool_name> function must take the following arguments:
dataset: A DataSet-object storing the data to be clustered.
threads: The number of threads to use for the computation.
log_dir: An optional path to a directory where log files can be written.
The documentation of the run_<tool_name> function must be written in the docstring of the function but can be
copied from other metric-implementations as they are very similar.
13def run_diamond(dataset: DataSet, threads: int, log_dir: Optional[Path] = None) -> None:
14 """
15 Run Diamond on a dataset in clustering mode.
16
17 Args:
18 dataset: Dataset to be clustered.
19 threads: Number of threads to be used by the clustering algorithm.
20 log_dir: Directory to store the logs.
21 """
Then, we have to check if the tool, we want to use is installed on the system. This can be done by using the
shutil.which function.
22 if not INSTALLED[DIAMOND]:
23 raise ValueError("DIAMOND is not installed.")
Next, we have to extract potential user arguments from the dataset object. This is done using the
MultiYAMLParser class. Here, the number of parameters to be extracted is 2, as the DIAMOND tool has two stages
from a FASTA file to the similarity matrix. If the tool has only one stage, only one set of parameters has to be
extracted (see foldseek.py).
25 parser = MultiYAMLParser(DIAMOND)
26 makedb_args = parser.get_user_arguments(dataset.args, [], 0)
27 blastp_args = parser.get_user_arguments(dataset.args, [], 1)
Then, we need to extract the sequences from the database into a FASTA file as only a FASTA file can be passed on to DIAMOND.
31 with open("diamond.fasta", "w") as out:
32 for name, seq in dataset.data.items():
33 out.write(f">{name}\n{seq}\n")
Now, we construct the command line call to DIAMOND. This includes creating a folder for the tmp-files created by
DIAMOND, the actual call to DIAMOND and piping of the output to a log file or /dev/null.
35 result_folder = Path("diamond_results")
36
37 cmd = lambda x: f"mkdir {result_folder} && " \
38 f"cd {result_folder} && " \
39 f"diamond makedb --in ../diamond.fasta --db seqs.dmnd {makedb_args} {x} --threads {threads} && " \
40 f"diamond blastp --db seqs.dmnd --query ../diamond.fasta --out alis.tsv --outfmt 6 qseqid sseqid pident " \
41 f"--threads {threads} {blastp_args} {x} && " \
42 f"rm ../diamond.fasta"
43
44 if log_dir is None:
45 cmd = cmd("> /dev/null 2>&1")
46 else:
47 cmd = cmd(f">> {(Path(log_dir) / f'{dataset.get_name()}_mmseqspp.log').resolve()}")
48
49 if result_folder.exists():
50 cmd = f"rm -rf {result_folder} && " + cmd
Now, we’re ready to run DIAMOND.
52 LOGGER.info("Start DIAMOND")
53 LOGGER.info(cmd)
54 os.system(cmd)
As something may break during the execution of the DIAMOND tool, we have to check if the output file exists.
56 if not (result_folder / "alis.tsv").is_file():
57 raise ValueError("Something went wront with DIAMOND alignment. The output file does not exist.")
Now, it’s time to harvest the results of the DIAMOND tool. This is done by reading the output file into a dataframe and
converting it into a table using df.pivot. Then, we need to fix two details about the DIAMOND output. First,
the TSV-file has no column names, so we have to add them. Second, the similarity score is computed as pident, which is
f_ident * 100. To correct this and scale the similarities back to [0,1], we have to divide p_ident by 100.
59 df = pd.read_csv(result_folder / "alis.tsv", sep="\t")
60 df.columns = ["query", "target", "pident"]
61 df["fident"] = df["pident"] / 100
62 rev = df.copy(deep=True)
63 rev.columns = ["target", "query", "pident", "fident"]
64 df = pd.concat([df, rev])
65 df = df.groupby(["query", "target"]).agg({"fident": "mean"}).reset_index()
66 table = df.pivot(index="query", columns="target", values="fident").fillna(0).to_numpy()
After deleting the temporary files, we can store the results in the dataset object.
68 shutil.rmtree(result_folder)
69
70 dataset.cluster_names = dataset.names
71 dataset.cluster_map = {n: n for n in dataset.names}
72 dataset.cluster_similarity = table
4. Registration – cont’d
Now, we have to add the new similarity or distance metric to the run function in the clustering.py file.
97 def similarity_clustering(dataset: DataSet, threads: int = 1, log_dir: Optional[str] = None) -> None:
98 """
99 ...
100 """
101 if dataset.similarity_algorithm == WLK:
102 ...
103 elif dataset.similarity.lower() == DIAMOND:
104 run_diamond(dataset, threads, log_dir)
105 ...
106 else:
107 raise ValueError(f"Unknown similarity algorithm: {dataset.similarity_algorithm}")
If you provide evidence in your upcoming pull request that the new metric is better than all other methods, you can add
your metric at the first place in the list in the get_default method in the settings.py file.
22 if data_type == P_TYPE:
23 if data_format == FORM_PDB:
24 return FOLDSEEK, None
25 elif data_format == FORM_FASTA:
26 order = [DIAMOND, MMSEQS2, CDHIT, MMSEQSPP]
27 for method in order:
28 if INSTALLED[method]:
29 return method, None
30 else:
31 ...
5. Tool Arguments
In step 3, we had to extract the user arguments from the dataset object. This is done using the
MultiYAMLParser and a YAML file. This YAML file must be created in the args directory. The YAML file
must contain the following structure:
1<argument name>:
2 description: <description of the argument>
3 type: <type of the argument, e.g., bool, int, or float>
4 cardinality: <how many arguments to provide, e.g., 0 (for booleans), "?" (for one-value arguments), and "+" (for multi-value arguments)>
5 default: ,<the default value of the argument>
6 calls: <a list of calls, e.g., ["-a"] or ["-a", "--all"]>
7 fos: <0 if this exclusively belongs to the first stage, 1 if this exclusively belongs to the second stage, and 2 if this belongs to both stages>
8...
The fos-argument can be discarded in case there’s only one stage. The YAML file must be named diamond.yaml (as
registered in step 2a). For tools with three or more stages, DataSAIL does no yet have a solution. Usually, not every
stage requires custom arguments (e.g., MMSEQSPP). In the example of DIAMOND, the database creation requires no user
arguments, so the YAML file only contains arguments for the blastp step. This step can significantly be
simplified using ChatGPT or similar tools.
Next, we need to write a checker-function for these arguments.
213 def check_diamond_arguments(args: str = "") -> Optional[Namespace]:
214 """ ... """
215 args = MultiYAMLParser(CDHIT).parse_args(args)
216 ...
217 return args
Next, we have to this checker to the validate_user_args function in the cluster.py file.
20 def validate_user_args(
21 dtype: str,
22 dformat: str,
23 similarity: str,
24 distance: str,
25 tool_args: str,
26 ) -> Optional[Union[Namespace, Tuple[Optional[Namespace], Optional[Namespace]]]]:
27 """
28 ...
29 """
30 sim_on, dist_on = isinstance(similarity, str), isinstance(distance, str)
31 both_none = not sim_on and not dist_on
32 if (sim_on and similarity.lower().startswith(CDHIT_EST)) or \
33 (both_none and get_default(dtype, dformat)[0] == CDHIT_EST):
34 ...
35 elif (sim_on and similarity.lower().startswith(DIAMOND)) or \
36 (both_none and get_default(dtype, dformat)[0] == DIAMOND):
37 return check_diamond_arguments(tool_args)
38 ...
39 else:
40 return None
6. Testing
First, we add a test to test_clustering.py. This test_<tool_name>_<data_type> function in
tests/test_clustering.py checks the specific (isolated) functionality of the tool
230 @pytest.mark.full
231 def test_diamond_protein():
232 data = protein_fasta_data(DIAMOND)
233 if platform.system() == "Windows":
234 pytest.skip("DIAMOND is not supported on Windows")
235 run_diamond(data, 1, Path())
236 check_clustering(data)
Second, we add two tests to test_arg_validation.py to check if invalid arguments are detected and valid arguments are
accepted by our parser above. The tests in tests/test_arg_validation.py look like this:
77 @pytest.mark.parametrize("args", [
78 "--comp-based-stats 5", "--masking unknown", "--soft-masking invalid", "--evalue -0.001",
79 ...
80 "--stop-match-score -0.5", "--window 0", "--ungapped-score -5", "--rank-ratio2 -0.8", "--rank-ratio -0.9",
81 ])
82 def test_diamond_args_checker_invalid(args):
83 with pytest.raises(ValueError):
84 check_diamond_arguments(args)
85
86
87 @pytest.mark.parametrize("args", [
88 "--comp-based-stats 2", "--masking seg", "--soft-masking tantan", "--evalue 0.001", "--motif-masking 1",
89 ...
90 "--rank-ratio2 0.8", "--rank-ratio 0.9", "--lambda 0.5", "--K 10"
91 ])
92 def test_diamond_args_checker_valid(args):
93 assert check_diamond_arguments(args) is not None
Lastly, we want to test the full procedure with the new clustering tool. Therefore, we have tests in
test/test_custom_args.py. For a new tool, we have to add one (or more if the tool has multiple stages) test
functions.
127 def test_diamond_cargs():
128 out = Path("data/pipeline/output")
129 sail([
130 "-o", str(out),
131 "-t", "C1e",
132 "-s", "0.7", "0.3",
133 "--e-type", "P",
134 "--e-data", str(Path('data') / 'pipeline' / 'seqs.fasta'),
135 "--e-sim", "diamond",
136 "--e-args", "--gapopen 10"
137 ])
138
139 assert out.is_dir()
140 assert (out / "C1e").is_dir()
141 assert (out / "logs").is_dir()
142 assert (out / "logs" / "seqs_diamond_gapopen_10.log")
143
144 shutil.rmtree(out, ignore_errors=True)
7. Pull Request
Now, we’re done have have to open a pull request on the dev branch of the DataSAIL repository. In the pull request, we
need to justify why the metric is worth including in the DataSAIL repository. This can be done by comparing results to
already existing metrics. If the new metric is better than all other metrics, we can add it to the get_default
method and justify that in the PR as well. Lastly, please provide links to the paper, the repo, and the documentation
of the new metric in the PR.