Clustering of Embeddings

DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings. The clustering algorithms are:

Title

Algorithm

Sim or Dist

Boolean

Integer

Float

RDKit or SciPy

AllBit

Sim

X

-

-

RDKit

Asymmetric

Sim

X

-

-

RDKit

Braun-Blanquet

Sim

X

-

-

RDKit

Canberra

Dist

X

X

X

SciPy

Dice

Sim

X

X

-

RDKit

Hamming

Dist

X

X

X

SciPy

Kulczynski

Sim

X

-

-

RDKit

Jaccard

Dist

X

-

-

SciPy

Matching

Dist

X

X

X

SciPy

OnBit

Sim

X

-

-

RDKit

Rogers-Tanimoto

Dist

X

-

-

SciPy

Rogot-Goldberg

Sim

X

-

-

RDKit

Russel

Sim

X

-

-

RDKit

Sokal

Sim

X

-

-

RDKit

Sokal-Michener

Dist

X

-

-

SciPy

Tanimoto

Sim

X

X

-

RDKit

Yule

Dist

X

-

-

SciPy

Individual Algorithms

In the following, we will describe the individual algorithms in more detail and with the mathematical formula that computes the respective metric between two vectors \(u\) and \(v\) of length \(n\). Depending on the method used, \(u\) and \(v\) can be float-vectors but may also be restricted to be int-vectors or bit-vectors.

Note

We will use the Iverson bracket notation \([P]\) to denote the indicator function that is 1 if the predicate \(P\) is true and 0 otherwise.

AllBit

This is the ratio of equal bits in the two bit vectors \(u\) and \(v\).

\[\text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n}\]

Asymmetric

The Asymmetric similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the minimum number of bits set in either of the two vectors. The implementation is given in RDKit.

\[\begin{split}& u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Asymmetric}(u, v) = \begin{cases} 0, &\text{if} \min(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise} \end{cases}\end{split}\]

Braun-Blanquet

The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the maximum number of bits set in either of the two vectors. The implementation is given in RDKit.

\[\begin{split}& u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Braun-Blanquet}(u, v) = \begin{cases} 0, &\text{if} \max(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise} \end{cases}\end{split}\]

Canberra

The Canberra distance is the sum of the absolute differences of the two vectors \(u\) and \(v\) divided by the sum of the absolute values of the two vectors. The implementation is given in SciPy.

\[\text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|}\]

Dice

The Dice similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits set in either of the two vectors. The implementation is given in RDKit.

\[\text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]}\]

Hamming or Matching

The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in SciPy.

\[\text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]]\]

Jaccard

The Jaccard distance is the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the number of equal one-bits in the two bit vectors \(u\) and \(v\) plus the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in SciPy.

\[\text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n}\]

Kulczynski

The Kulczynski similarity is the number of equal one-bits in the two bit vectors \(u\) and \(v\) multiplied with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is given in RDKit.

\[\begin{split}& u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Kulczynski}(u, v) = \begin{cases} 0, &\text{if} u_1 \cdot v_1 = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise} \end{cases}\end{split}\]

Matching

see Hamming

OnBit

The OnBit similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\). The similarity is 0 if the latter sum is 0. The implementation is given in RDKit.

\[\begin{split}\text{OnBit}(u, v) = \begin{cases} 0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise} \end{cases}\end{split}\]

Rogers-Tanimoto

The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.

\[\text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]}\]

Rogot-Goldberg

The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) plus the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\begin{split}& x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & y = \sum_{i=1}^{n} [u[i]]\\ & z = \sum_{i=1}^{n} [u[i]]\\ & d = n - y - z + x\\ & \text{Rogot-Goldberg}(u, v) = \begin{cases} 1, &\text{if} x = n \lor d = n,\\ \frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise} \end{cases}\end{split}\]

Russel

The Russel similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the number of one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n}\]

Sokal

The Sokal similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) minus the number of equal one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]}\]

Sokal-Michener

The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.

\[\text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]}\]

Tanimoto

The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) minus the number of equal one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\begin{split}& t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\ & c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & \text{Tanimoto}(u, v) = \begin{cases} 1, &\text{if} t = 0,\\ \frac{c}{t - c} &\text{otherwise} \end{cases}\end{split}\]

Yule

The Yule distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.

\[\text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]}\]