Clustering of Embeddings

DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings. The clustering algorithms are:

Title
Algorithm	Sim or Dist	Boolean	Integer	Float	RDKit or SciPy
AllBit	Sim	X	-	-	RDKit
Asymmetric	Sim	X	-	-	RDKit
Braun-Blanquet	Sim	X	-	-	RDKit
Canberra	Dist	X	X	X	SciPy
Dice	Sim	X	X	-	RDKit
Hamming	Dist	X	X	X	SciPy
Kulczynski	Sim	X	-	-	RDKit
Jaccard	Dist	X	-	-	SciPy
Matching	Dist	X	X	X	SciPy
OnBit	Sim	X	-	-	RDKit
Rogers-Tanimoto	Dist	X	-	-	SciPy
Rogot-Goldberg	Sim	X	-	-	RDKit
Russel	Sim	X	-	-	RDKit
Sokal	Sim	X	-	-	RDKit
Sokal-Michener	Dist	X	-	-	SciPy
Tanimoto	Sim	X	X	-	RDKit
Yule	Dist	X	-	-	SciPy

Individual Algorithms

In the following, we will describe the individual algorithms in more detail and with the mathematical formula that computes the respective metric between two vectors \(u\) and \(v\) of length \(n\). Depending on the method used, \(u\) and \(v\) can be float-vectors but may also be restricted to be int-vectors or bit-vectors.

Note

We will use the Iverson bracket notation \([P]\) to denote the indicator function that is 1 if the predicate \(P\) is true and 0 otherwise.

AllBit

This is the ratio of equal bits in the two bit vectors \(u\) and \(v\).

\[\text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n}\]

Asymmetric

The Asymmetric similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the minimum number of bits set in either of the two vectors. The implementation is given in RDKit.

\[\begin{split}& u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Asymmetric}(u, v) = \begin{cases} 0, &\text{if} \min(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise} \end{cases}\end{split}\]

Braun-Blanquet

The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the maximum number of bits set in either of the two vectors. The implementation is given in RDKit.

\[\begin{split}& u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Braun-Blanquet}(u, v) = \begin{cases} 0, &\text{if} \max(u_1,v_1) = 0,\\ \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise} \end{cases}\end{split}\]

Canberra

The Canberra distance is the sum of the absolute differences of the two vectors \(u\) and \(v\) divided by the sum of the absolute values of the two vectors. The implementation is given in SciPy.

\[\text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|}\]

Dice

The Dice similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits set in either of the two vectors. The implementation is given in RDKit.

\[\text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]}\]

Hamming or Matching

The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in SciPy.

\[\text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]]\]

Jaccard

The Jaccard distance is the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the number of equal one-bits in the two bit vectors \(u\) and \(v\) plus the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in SciPy.

\[\text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n}\]

Kulczynski

The Kulczynski similarity is the number of equal one-bits in the two bit vectors \(u\) and \(v\) multiplied with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is given in RDKit.

\[\begin{split}& u_1 = \sum_{i=1}^{n} [u[i]]\\ & v_1 = \sum_{i=1}^{n} [v[i]]\\ & \text{Kulczynski}(u, v) = \begin{cases} 0, &\text{if} u_1 \cdot v_1 = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise} \end{cases}\end{split}\]

Matching

see Hamming

OnBit

The OnBit similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\). The similarity is 0 if the latter sum is 0. The implementation is given in RDKit.

\[\begin{split}\text{OnBit}(u, v) = \begin{cases} 0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\ \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise} \end{cases}\end{split}\]

Rogers-Tanimoto

The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.

\[\text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]}\]

Rogot-Goldberg

The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) plus the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\begin{split}& x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & y = \sum_{i=1}^{n} [u[i]]\\ & z = \sum_{i=1}^{n} [u[i]]\\ & d = n - y - z + x\\ & \text{Rogot-Goldberg}(u, v) = \begin{cases} 1, &\text{if} x = n \lor d = n,\\ \frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise} \end{cases}\end{split}\]

Russel

The Russel similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the number of one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n}\]

Sokal

The Sokal similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) minus the number of equal one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]}\]

Sokal-Michener

The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.

\[\text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]}\]

Tanimoto

The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) minus the number of equal one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.

\[\begin{split}& t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\ & c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ & \text{Tanimoto}(u, v) = \begin{cases} 1, &\text{if} t = 0,\\ \frac{c}{t - c} &\text{otherwise} \end{cases}\end{split}\]

Yule

The Yule distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.

\[\text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]}\]