Clustering of Embeddings
DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings. The clustering algorithms are:
Algorithm |
Sim or Dist |
Boolean |
Integer |
Float |
RDKit or SciPy |
|---|---|---|---|---|---|
AllBit |
Sim |
X |
- |
- |
RDKit |
Asymmetric |
Sim |
X |
- |
- |
RDKit |
Braun-Blanquet |
Sim |
X |
- |
- |
RDKit |
Canberra |
Dist |
X |
X |
X |
SciPy |
Dice |
Sim |
X |
X |
- |
RDKit |
Hamming |
Dist |
X |
X |
X |
SciPy |
Kulczynski |
Sim |
X |
- |
- |
RDKit |
Jaccard |
Dist |
X |
- |
- |
SciPy |
Matching |
Dist |
X |
X |
X |
SciPy |
OnBit |
Sim |
X |
- |
- |
RDKit |
Rogers-Tanimoto |
Dist |
X |
- |
- |
SciPy |
Rogot-Goldberg |
Sim |
X |
- |
- |
RDKit |
Russel |
Sim |
X |
- |
- |
RDKit |
Sokal |
Sim |
X |
- |
- |
RDKit |
Sokal-Michener |
Dist |
X |
- |
- |
SciPy |
Tanimoto |
Sim |
X |
X |
- |
RDKit |
Yule |
Dist |
X |
- |
- |
SciPy |
Individual Algorithms
In the following, we will describe the individual algorithms in more detail and with the mathematical formula that computes the respective metric between two vectors \(u\) and \(v\) of length \(n\). Depending on the method used, \(u\) and \(v\) can be float-vectors but may also be restricted to be int-vectors or bit-vectors.
Note
We will use the Iverson bracket notation \([P]\) to denote the indicator function that is 1 if the predicate \(P\) is true and 0 otherwise.
AllBit
This is the ratio of equal bits in the two bit vectors \(u\) and \(v\).
Asymmetric
The Asymmetric similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the minimum number of bits set in either of the two vectors. The implementation is given in RDKit.
Braun-Blanquet
The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the maximum number of bits set in either of the two vectors. The implementation is given in RDKit.
Canberra
The Canberra distance is the sum of the absolute differences of the two vectors \(u\) and \(v\) divided by the sum of the absolute values of the two vectors. The implementation is given in SciPy.
Dice
The Dice similarity is the ratio of equal bits in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits set in either of the two vectors. The implementation is given in RDKit.
Hamming or Matching
The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in SciPy.
Jaccard
The Jaccard distance is the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the number of equal one-bits in the two bit vectors \(u\) and \(v\) plus the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in SciPy.
Kulczynski
The Kulczynski similarity is the number of equal one-bits in the two bit vectors \(u\) and \(v\) multiplied with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is given in RDKit.
Matching
see Hamming
OnBit
The OnBit similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\). The similarity is 0 if the latter sum is 0. The implementation is given in RDKit.
Rogers-Tanimoto
The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.
Rogot-Goldberg
The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) plus the number of bits that are different in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.
Russel
The Russel similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the number of one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.
Sokal
The Sokal similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) minus the number of equal one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.
Sokal-Michener
The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.
Tanimoto
The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors \(u\) and \(v\) divided by the sum of the one-bits in the two bit vectors \(u\) and \(v\) minus the number of equal one-bits in the two bit vectors \(u\) and \(v\). The implementation is given in RDKit.
Yule
The Yule distance is twice the number of bits that are different in the two bit vectors \(u\) and \(v\) divided by the sum of the number of bits that are different in the two bit vectors \(u\) and \(v\) plus the number of bits that are equal in the vectors. The implementation is given in SciPy.