########################################################## How to Add a new Similarity or Distance Metric to DataSAIL ########################################################## On the example of DIAMOND {links}, we demonstrate how to add a new similarity or distance metric to DataSAIL. The DIAMOND similarity metric is a similarity metric for comparing two sets of items. It is based on the Jaccard similarity coefficient, which is a measure of similarity between two sets. The Jaccard similarity coefficient is defined as the size of the intersection divided by the size of the union of the two sets. The DIAMOND similarity metric is defined as 1 minus the Jaccard similarity coefficient. The DIAMOND similarity metric is used to compare two sets of items, such as two sets of genes, two sets of proteins, or two sets of metabolites. 0. Create a Fork of the DataSAIL Repository ########################################### When you want to add a new feature to the DataSAIL repository, you first have to create a fork of the DataSAIL repository. An explanation of how to do this can be found in the official GitHub documentation. 1. Installability ################# When adding a new similarity or distance metric to DataSAIL, make sure, it is installable from Conda 2. Registration ############### Register the new similarity or distance metric at various locations in the :code:`settings.py` file. .. code-block:: python :linenos: :lineno-start: 69 DIAMOND = "diamond" and .. code-block:: python :linenos: :lineno-start: 80 SIM_ALGOS = [WLK, MMSEQS, MMSEQS2, MMSEQSPP, FOLDSEEK, CDHIT, CDHIT_EST, ECFP, DIAMOND, ] and .. code-block:: python :linenos: :lineno-start: 84 # Check if the tools are installed INSTALLED = { # Define the check per tool ... DIAMOND: shutil.which("diamond") is not None, ... } and .. code-block:: python :linenos: :lineno-start: 136 # Define the names of the YAML files YAML_FILE_NAMES = { # define the yaml file per tool ... DIAMOND: "args/diamond.yaml", ... } 3. Python Implementation ======================== Next, we create the file :code:`diamond.py` in the :code:`datasail/cluster` directory. This new file must contain a :code:`run_` function, which is used to compute the similarity or distance metric. The :code:`run_` function must take the following arguments: * dataset: A DataSet-object storing the data to be clustered. * threads: The number of threads to use for the computation. * log_dir: An optional path to a directory where log files can be written. The documentation of the :code:`run_` function must be written in the docstring of the function but can be copied from other metric-implementations as they are very similar. .. code-block:: python :linenos: :lineno-start: 13 def run_diamond(dataset: DataSet, threads: int, log_dir: Optional[Path] = None) -> None: """ Run Diamond on a dataset in clustering mode. Args: dataset: Dataset to be clustered. threads: Number of threads to be used by the clustering algorithm. log_dir: Directory to store the logs. """ Then, we have to check if the tool, we want to use is installed on the system. This can be done by using the :code:`shutil.which` function. .. code-block:: python :linenos: :lineno-start: 22 if not INSTALLED[DIAMOND]: raise ValueError("DIAMOND is not installed.") Next, we have to extract potential user arguments from the :code:`dataset` object. This is done using the :code:`MultiYAMLParser` class. Here, the number of parameters to be extracted is 2, as the DIAMOND tool has two stages from a FASTA file to the similarity matrix. If the tool has only one stage, only one set of parameters has to be extracted (see :code:`foldseek.py`). .. code-block:: python :lineno-start: 25 parser = MultiYAMLParser(DIAMOND) makedb_args = parser.get_user_arguments(dataset.args, [], 0) blastp_args = parser.get_user_arguments(dataset.args, [], 1) Then, we need to extract the sequences from the database into a FASTA file as only a FASTA file can be passed on to DIAMOND. .. code-block:: python :linenos: :lineno-start: 31 with open("diamond.fasta", "w") as out: for name, seq in dataset.data.items(): out.write(f">{name}\n{seq}\n") Now, we construct the command line call to DIAMOND. This includes creating a folder for the tmp-files created by DIAMOND, the actual call to DIAMOND and piping of the output to a log file or :code:`/dev/null`. .. code-block:: python :linenos: :lineno-start: 35 result_folder = Path("diamond_results") cmd = lambda x: f"mkdir {result_folder} && " \ f"cd {result_folder} && " \ f"diamond makedb --in ../diamond.fasta --db seqs.dmnd {makedb_args} {x} --threads {threads} && " \ f"diamond blastp --db seqs.dmnd --query ../diamond.fasta --out alis.tsv --outfmt 6 qseqid sseqid pident " \ f"--threads {threads} {blastp_args} {x} && " \ f"rm ../diamond.fasta" if log_dir is None: cmd = cmd("> /dev/null 2>&1") else: cmd = cmd(f">> {(Path(log_dir) / f'{dataset.get_name()}_mmseqspp.log').resolve()}") if result_folder.exists(): cmd = f"rm -rf {result_folder} && " + cmd Now, we're ready to run DIAMOND. .. code-block:: python :linenos: :lineno-start: 52 LOGGER.info("Start DIAMOND") LOGGER.info(cmd) os.system(cmd) As something may break during the execution of the DIAMOND tool, we have to check if the output file exists. .. code-block:: python :linenos: :lineno-start: 56 if not (result_folder / "alis.tsv").is_file(): raise ValueError("Something went wront with DIAMOND alignment. The output file does not exist.") Now, it's time to harvest the results of the DIAMOND tool. This is done by reading the output file into a dataframe and converting it into a table using :code:`df.pivot`. Then, we need to fix two details about the DIAMOND output. First, the TSV-file has no column names, so we have to add them. Second, the similarity score is computed as pident, which is f_ident * 100. To correct this and scale the similarities back to [0,1], we have to divide p_ident by 100. .. code-block:: python :linenos: :lineno-start: 59 df = pd.read_csv(result_folder / "alis.tsv", sep="\t") df.columns = ["query", "target", "pident"] df["fident"] = df["pident"] / 100 rev = df.copy(deep=True) rev.columns = ["target", "query", "pident", "fident"] df = pd.concat([df, rev]) df = df.groupby(["query", "target"]).agg({"fident": "mean"}).reset_index() table = df.pivot(index="query", columns="target", values="fident").fillna(0).to_numpy() After deleting the temporary files, we can store the results in the :code:`dataset` object. .. code-block:: python :linenos: :lineno-start: 68 shutil.rmtree(result_folder) dataset.cluster_names = dataset.names dataset.cluster_map = {n: n for n in dataset.names} dataset.cluster_similarity = table 4. Registration -- cont'd ######################### Now, we have to add the new similarity or distance metric to the :code:`run` function in the :code:`clustering.py` file. .. code-block:: python :linenos: :lineno-start: 97 def similarity_clustering(dataset: DataSet, threads: int = 1, log_dir: Optional[str] = None) -> None: """ ... """ if dataset.similarity_algorithm == WLK: ... elif dataset.similarity.lower() == DIAMOND: run_diamond(dataset, threads, log_dir) ... else: raise ValueError(f"Unknown similarity algorithm: {dataset.similarity_algorithm}") If you provide evidence in your upcoming pull request that the new metric is better than all other methods, you can add your metric at the first place in the list in the :code:`get_default` method in the :code:`settings.py` file. .. code-block:: python :linenos: :lineno-start: 22 if data_type == P_TYPE: if data_format == FORM_PDB: return FOLDSEEK, None elif data_format == FORM_FASTA: order = [DIAMOND, MMSEQS2, CDHIT, MMSEQSPP] for method in order: if INSTALLED[method]: return method, None else: ... 5. Tool Arguments ################# In step 3, we had to extract the user arguments from the :code:`dataset` object. This is done using the :code:`MultiYAMLParser` and a YAML file. This YAML file must be created in the :code:`args` directory. The YAML file must contain the following structure: .. code-block:: text :linenos: : description: type: cardinality: default: , calls: fos: <0 if this exclusively belongs to the first stage, 1 if this exclusively belongs to the second stage, and 2 if this belongs to both stages> ... The fos-argument can be discarded in case there's only one stage. The YAML file must be named :code:`diamond.yaml` (as registered in step 2a). For tools with three or more stages, DataSAIL does no yet have a solution. Usually, not every stage requires custom arguments (e.g., MMSEQSPP). In the example of DIAMOND, the database creation requires no user arguments, so the YAML file only contains arguments for the :code:`blastp` step. This step can significantly be simplified using ChatGPT or similar tools. Next, we need to write a checker-function for these arguments. .. code-block:: python :linenos: :lineno-start: 213 def check_diamond_arguments(args: str = "") -> Optional[Namespace]: """ ... """ args = MultiYAMLParser(CDHIT).parse_args(args) ... return args Next, we have to this checker to the :code:`validate_user_args` function in the :code:`cluster.py` file. .. code-block:: python :linenos: :lineno-start: 20 def validate_user_args( dtype: str, dformat: str, similarity: str, distance: str, tool_args: str, ) -> Optional[Union[Namespace, Tuple[Optional[Namespace], Optional[Namespace]]]]: """ ... """ sim_on, dist_on = isinstance(similarity, str), isinstance(distance, str) both_none = not sim_on and not dist_on if (sim_on and similarity.lower().startswith(CDHIT_EST)) or \ (both_none and get_default(dtype, dformat)[0] == CDHIT_EST): ... elif (sim_on and similarity.lower().startswith(DIAMOND)) or \ (both_none and get_default(dtype, dformat)[0] == DIAMOND): return check_diamond_arguments(tool_args) ... else: return None 6. Testing ========== First, we add a test to test_clustering.py. This :code:`test__` function in :code:`tests/test_clustering.py` checks the specific (isolated) functionality of the tool .. code-block:: python :linenos: :lineno-start: 230 @pytest.mark.full def test_diamond_protein(): data = protein_fasta_data(DIAMOND) if platform.system() == "Windows": pytest.skip("DIAMOND is not supported on Windows") run_diamond(data, 1, Path()) check_clustering(data) Second, we add two tests to test_arg_validation.py to check if invalid arguments are detected and valid arguments are accepted by our parser above. The tests in :code:`tests/test_arg_validation.py` look like this: .. code-block:: python :linenos: :lineno-start: 77 @pytest.mark.parametrize("args", [ "--comp-based-stats 5", "--masking unknown", "--soft-masking invalid", "--evalue -0.001", ... "--stop-match-score -0.5", "--window 0", "--ungapped-score -5", "--rank-ratio2 -0.8", "--rank-ratio -0.9", ]) def test_diamond_args_checker_invalid(args): with pytest.raises(ValueError): check_diamond_arguments(args) @pytest.mark.parametrize("args", [ "--comp-based-stats 2", "--masking seg", "--soft-masking tantan", "--evalue 0.001", "--motif-masking 1", ... "--rank-ratio2 0.8", "--rank-ratio 0.9", "--lambda 0.5", "--K 10" ]) def test_diamond_args_checker_valid(args): assert check_diamond_arguments(args) is not None Lastly, we want to test the full procedure with the new clustering tool. Therefore, we have tests in :code:`test/test_custom_args.py`. For a new tool, we have to add one (or more if the tool has multiple stages) test functions. .. code-block:: python :linenos: :lineno-start: 127 def test_diamond_cargs(): out = Path("data/pipeline/output") sail([ "-o", str(out), "-t", "C1e", "-s", "0.7", "0.3", "--e-type", "P", "--e-data", str(Path('data') / 'pipeline' / 'seqs.fasta'), "--e-sim", "diamond", "--e-args", "--gapopen 10" ]) assert out.is_dir() assert (out / "C1e").is_dir() assert (out / "logs").is_dir() assert (out / "logs" / "seqs_diamond_gapopen_10.log") shutil.rmtree(out, ignore_errors=True) 7. Pull Request ############### Now, we're done have have to open a pull request on the dev branch of the DataSAIL repository. In the pull request, we need to justify why the metric is worth including in the DataSAIL repository. This can be done by comparing results to already existing metrics. If the new metric is better than all other metrics, we can add it to the :code:`get_default` method and justify that in the PR as well. Lastly, please provide links to the paper, the repo, and the documentation of the new metric in the PR.