.. _embeddings-label: ######################## Clustering of Embeddings ######################## DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings. The clustering algorithms are: .. list-table:: Title :widths: 30 15 15 15 15 15 :header-rows: 1 * - Algorithm - Sim or Dist - Boolean - Integer - Float - RDKit or SciPy * - AllBit - Sim - X - \- - \- - RDKit * - Asymmetric - Sim - X - \- - \- - RDKit * - Braun-Blanquet - Sim - X - \- - \- - RDKit * - Canberra - Dist - X - X - X - SciPy * - Dice - Sim - X - X - \- - RDKit * - Hamming - Dist - X - X - X - SciPy * - Kulczynski - Sim - X - \- - \- - RDKit * - Jaccard - Dist - X - \- - \- - SciPy * - Matching - Dist - X - X - X - SciPy * - OnBit - Sim - X - \- - \- - RDKit * - Rogers-Tanimoto - Dist - X - \- - \- - SciPy * - Rogot-Goldberg - Sim - X - \- - \- - RDKit * - Russel - Sim - X - \- - \- - RDKit * - Sokal - Sim - X - \- - \- - RDKit * - Sokal-Michener - Dist - X - \- - \- - SciPy * - Tanimoto - Sim - X - X - \- - RDKit * - Yule - Dist - X - \- - \- - SciPy Individual Algorithms ##################### In the following, we will describe the individual algorithms in more detail and with the mathematical formula that computes the respective metric between two vectors :math:`u` and :math:`v` of length :math:`n`. Depending on the method used, :math:`u` and :math:`v` can be float-vectors but may also be restricted to be int-vectors or bit-vectors. .. note:: We will use the `Iverson bracket `__ notation :math:`[P]` to denote the indicator function that is 1 if the predicate :math:`P` is true and 0 otherwise. AllBit ====== This is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v`. .. math:: \text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n} Asymmetric ========== The Asymmetric similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the minimum number of bits set in either of the two vectors. The implementation is given in `RDKit `__. .. math:: & u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Asymmetric}(u, v) = \begin{cases} 0, &\text{if} \min(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise} \end{cases} Braun-Blanquet ============== The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the maximum number of bits set in either of the two vectors. The implementation is given in `RDKit `__. .. math:: & u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Braun-Blanquet}(u, v) = \begin{cases} 0, &\text{if} \max(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise} \end{cases} Canberra ======== The Canberra distance is the sum of the absolute differences of the two vectors :math:`u` and :math:`v` divided by the sum of the absolute values of the two vectors. The implementation is given in `SciPy `__. .. math:: \text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|} Dice ==== The Dice similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits set in either of the two vectors. The implementation is given in `RDKit `__. .. math:: \text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]} Hamming or Matching =================== The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy `__. .. math:: \text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]] Jaccard ======= The Jaccard distance is the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy `__. .. math:: \text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n} Kulczynski ========== The Kulczynski similarity is the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` multiplied with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is given in `RDKit `__. .. math:: & u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Kulczynski}(u, v) = \begin{cases} 0, &\text{if} u_1 \cdot v_1 = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise} \end{cases} Matching ======== see Hamming OnBit ===== The OnBit similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v`. The similarity is 0 if the latter sum is 0. The implementation is given in `RDKit `__. .. math:: \text{OnBit}(u, v) = \begin{cases} 0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise} \end{cases} Rogers-Tanimoto =============== The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. .. math:: \text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]} Rogot-Goldberg ============== The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: & x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & y = \sum_{i=1}^{n} [u[i]]\\ & z = \sum_{i=1}^{n} [u[i]]\\ & d = n - y - z + x\\ & \text{Rogot-Goldberg}(u, v) = \begin{cases} 1, &\text{if} x = n \lor d = n,\\ \frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise} \end{cases} Russel ====== The Russel similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the number of one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: \text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n} Sokal ===== The Sokal similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: \text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]} Sokal-Michener ============== The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. .. math:: \text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]} Tanimoto ======== The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: & t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\ & c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & \text{Tanimoto}(u, v) = \begin{cases} 1, &\text{if} t = 0,\\ \frac{c}{t - c} &\text{otherwise} \end{cases} Yule ==== The Yule distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. .. math:: \text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]}