Skip to content

Symmetric Similarity Searches

Run a similarity search to find compounds that are structurally similar to a query molecule.

Similarity Metrics

Possible metrics that can be used are (tanimoto is default):

  • tanimoto (Jaccard): Measures the ratio of intersection to union. \(T(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{c}{a + b - c}\)
  • dice (Dice-Sørensen): Emphasizes the intersection more than Tanimoto. \(D(A,B) = \frac{2|A \cap B|}{|A| + |B|} = \frac{2c}{a + b}\)
  • cosine (Otsuka–Ochiai): Also focuses on shared features but is less affected by the total number of features. \(C(A,B) = \frac{|A \cap B|}{\sqrt{|A| \cdot |B|}} = \frac{c}{\sqrt{a \cdot b}}\)

Where:

  • \(a\) is the number of bits set to 1 in fingerprint A
  • \(b\) is the number of bits set to 1 in fingerprint B
  • \(c\) is the number of bits set to 1 in both fingerprints

Use the FPSim2Engine.similarity function to run symmetric similarity searches:

from FPSim2 import FPSim2Engine

fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename)

query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.similarity(query, threshold=0.7, metric='tanimoto', n_workers=1)

For on-disk similarity searches (slower but doesn't require loading the entire fingerprint file into memory), use the FPSim2Engine.on_disk_similarity function. This is useful for databases larger than the available system memory:

from FPSim2 import FPSim2Engine

fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename, in_memory_fps=False)

query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.on_disk_similarity(query, threshold=0.7, metric='tanimoto', n_workers=1)

Parallel Processing

The n_workers parameter can be used to split a single query into multiple threads to speed up the search. This is especially useful when searching large datasets.

Top K Searches

Retrieve the top K most similar hits using the FPSim2Engine.top_k function.

from FPSim2 import FPSim2Engine

fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename)

query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.top_k(query, k=100, threshold=0.7, metric='tanimoto', n_workers=1)

For on-disk top K searches, use the FPSim2Engine.on_disk_top_k function.

from FPSim2 import FPSim2Engine

fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename, in_memory_fps=False)

query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.on_disk_top_k(query, k=100, threshold=0.7, metric='tanimoto', n_workers=1)