Symmetric Similarity Searches
Run a similarity search to find compounds that are structurally similar to a query molecule.
Similarity Metrics
Possible metrics that can be used are (tanimoto
is default):
tanimoto
(Jaccard): Measures the ratio of intersection to union. \(T(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{c}{a + b - c}\)dice
(Dice-Sørensen): Emphasizes the intersection more than Tanimoto. \(D(A,B) = \frac{2|A \cap B|}{|A| + |B|} = \frac{2c}{a + b}\)cosine
(Otsuka–Ochiai): Also focuses on shared features but is less affected by the total number of features. \(C(A,B) = \frac{|A \cap B|}{\sqrt{|A| \cdot |B|}} = \frac{c}{\sqrt{a \cdot b}}\)
Where:
- \(a\) is the number of bits set to 1 in fingerprint A
- \(b\) is the number of bits set to 1 in fingerprint B
- \(c\) is the number of bits set to 1 in both fingerprints
Use the FPSim2Engine.similarity
function to run symmetric similarity searches.
from FPSim2 import FPSim2Engine
fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename)
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.similarity(query, threshold=0.7, metric='tanimoto', n_workers=1)
Use the FPSim2Engine.on_disk_similarity
function to run similarity searches on disk. This method is much slower but suitable when working with databases larger than available RAM. To use ONLY if the dataset doesn't fit in memory.
from FPSim2 import FPSim2Engine
fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename, in_memory_fps=False)
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.on_disk_similarity(query, threshold=0.7, metric='tanimoto')
Parallel Processing
The n_workers
parameter can be used to split a single query into multiple threads to speed up the search. This is especially useful when searching large datasets.
Top K Searches
Retrieve the top K most similar hits using the FPSim2Engine.top_k
function.
from FPSim2 import FPSim2Engine
fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename)
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.top_k(query, k=100, threshold=0.7, metric='tanimoto', n_workers=1)
Use the FPSim2Engine.on_disk_top_k
function to run top-K searches on disk. This method is much slower but suitable when working with databases larger than available RAM. To use ONLY if the dataset doesn't fit in memory.
from FPSim2 import FPSim2Engine
fp_filename = 'chembl_35_v0.6.0.h5'
fpe = FPSim2Engine(fp_filename, in_memory_fps=False)
query = 'CC(=O)Oc1ccccc1C(=O)O'
results = fpe.on_disk_top_k(query, k=100, threshold=0.7, metric='tanimoto')