.. _embeddings-label: ######################## Clustering of Embeddings ######################## DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings. The clustering algorithms are: .. list-table:: Overview over embedding-based metrics readily available in DataSAIL. :widths: 30 15 15 15 15 15 :header-rows: 1 * - Algorithm - Sim or Dist - Boolean - Integer - Float - RDKit or SciPy * - :ref:`AllBit ` - Sim - X - \- - \- - RDKit * - :ref:`Asymmetric ` - Sim - X - \- - \- - RDKit * - :ref:`Braun-Blanquet ` - Sim - X - \- - \- - RDKit * - :ref:`Canberra ` - Dist - X - X - X - SciPy * - :ref:`Cosine ` - Sim and Dist - X - X - X - SciPy * - :ref:`Dice ` - Sim - X - X - \- - RDKit * - :ref:`Hamming ` - Dist - X - X - X - SciPy * - :ref:`Jaccard ` - Dist - X - \- - \- - SciPy * - :ref:`Kulczynski ` - Sim - X - \- - \- - RDKit * - :ref:`Matching ` - Dist - X - X - X - SciPy * - :ref:`OnBit ` - Sim - X - \- - \- - RDKit * - :ref:`Rogers-Tanimoto ` - Dist - X - \- - \- - SciPy * - :ref:`Rogot-Goldberg ` - Sim - X - \- - \- - RDKit * - :ref:`Russel ` - Sim - X - \- - \- - RDKit * - :ref:`Sokal ` - Sim - X - \- - \- - RDKit * - :ref:`Sokal-Michener ` - Dist - X - \- - \- - SciPy * - :ref:`Tanimoto ` - Sim - X - X - \- - RDKit * - :ref:`Yule ` - Dist - X - \- - \- - SciPy Note that the cosine metric can be used both as similarity and distance metric where the relation is :math:`\text{CosineDistance}(u, v) = 1 - \text{CosineSimilarity}(u, v)`. Individual Algorithms ##################### In the following, we will describe the individual algorithms in more detail and with the mathematical formula that computes the respective metric between two vectors :math:`u` and :math:`v` of length :math:`n`. Depending on the method used, :math:`u` and :math:`v` can be float-vectors but may also be restricted to be int-vectors or bit-vectors. .. note:: We will use the `Iverson bracket `__ notation :math:`[P]` to denote the indicator function that is 1 if the predicate :math:`P` is true and 0 otherwise. .. _metric-allbit: AllBit ====== This is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v`. .. math:: \text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n} .. _metric-asymmetric: Asymmetric ========== The Asymmetric similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the minimum number of bits set in either of the two vectors. The implementation is given in `RDKit `__. .. math:: & u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Asymmetric}(u, v) = \begin{cases} 0, &\text{if} \min(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise} \end{cases} .. _metric-braun-blanquet: Braun-Blanquet ============== The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the maximum number of bits set in either of the two vectors. The implementation is given in `RDKit `__. .. math:: & u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Braun-Blanquet}(u, v) = \begin{cases} 0, &\text{if} \max(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise} \end{cases} .. _metric-canberra: Canberra ======== The Canberra distance is the sum of the absolute differences of the two vectors :math:`u` and :math:`v` divided by the sum of the absolute values of the two vectors. The implementation is given in `SciPy `__. .. math:: \text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|} .. _metric-cosine: Cosine ====== The Cosine similarity is the dot product of the two vectors :math:`u` and :math:`v` divided by the product of the Euclidean norms of the two vectors. The implementation is given in `SciPy `__. .. math:: \text{CosineSimilarity}(u, v) = \frac{\sum_{i=1}^{n} u[i] \cdot v[i]}{\sqrt{\sum_{i=1}^{n} u[i]^2} \cdot \sqrt{\sum_{i=1}^{n} v[i]^2}} .. _metric-dice: Dice ==== The Dice similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits set in either of the two vectors. The implementation is given in `RDKit `__. .. math:: \text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]} .. _metric-hamming: Hamming or Matching =================== The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy `__. .. math:: \text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]] .. _metric-jaccard: Jaccard ======= The Jaccard distance is the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy `__. .. math:: \text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n} .. _metric-kulczynski: Kulczynski ========== The Kulczynski similarity is the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` multiplied with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is given in `RDKit `__. .. math:: & u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Kulczynski}(u, v) = \begin{cases} 0, &\text{if} u_1 \cdot v_1 = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise} \end{cases} Matching ======== see Hamming .. _metric-onbit: OnBit ===== The OnBit similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v`. The similarity is 0 if the latter sum is 0. The implementation is given in `RDKit `__. .. math:: \text{OnBit}(u, v) = \begin{cases} 0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise} \end{cases} .. _metric-rogers-tanimoto: Rogers-Tanimoto =============== The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. .. math:: \text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]} .. _metric-rogot-goldberg: Rogot-Goldberg ============== The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: & x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & y = \sum_{i=1}^{n} [u[i]]\\ & z = \sum_{i=1}^{n} [u[i]]\\ & d = n - y - z + x\\ & \text{Rogot-Goldberg}(u, v) = \begin{cases} 1, &\text{if} x = n \lor d = n,\\ \frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise} \end{cases} .. _metric-russel: Russel ====== The Russel similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the number of one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: \text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n} .. _metric-sokal: Sokal ===== The Sokal similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: \text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]} .. _metric-sokal-michener: Sokal-Michener ============== The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. .. math:: \text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]} .. _metric-tanimoto: Tanimoto ======== The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. .. math:: & t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\ & c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & \text{Tanimoto}(u, v) = \begin{cases} 1, &\text{if} t = 0,\\ \frac{c}{t - c} &\text{otherwise} \end{cases} .. _metric-yule: Yule ==== The Yule distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. .. math:: \text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]}