How to Highlight Molecular Substructures: Celebrating Commonalities and Differences

Introduction

Today, we will dive into molecular substructure highlighting with RDKit – a powerful technique that illuminates the hidden intricacies within molecular compounds.

In this tutorial, I will be focusing on two things:

Highlighting a common substructure among molecules.
Highlighting differences among molecules, given a common substructure.

If you are interested in more Cheminformatics related tutorials, check my other blog posts here.

Section 1: Understanding the Power of Structure Highlighting

Before we jump into the technical aspects, let’s take a moment to appreciate why structure highlighting is a vital tool in the world of molecular visualization. By highlighting specific parts of a molecule, we can draw attention to key features, bond patterns, and functional groups that are important to the biological endpoint of interest. This technique not only enhances the visual aesthetics but also aids in effective communication of complex molecular structures.

Section 2: Setting Up the Environment

To get started, make sure you have Python, RDKit, and Pandas installed. The versions I am using are:

Python 3.7.13
Pandas 1.1.5
RDKit 2020.09.1

Section 3: The Code

The jupyter notebook containing the tutorial is available in github.

First, import the libraries.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
import pandas as pd

Then, load a sample file containing Smiles. In this tutorial, I am using the data from one of my PhD projects (MD-QSAR with Imatinib derivatives and pKi against BCR-ABL kinase protein.)

df = pd.read_csv("smiles_pKi.csv")
df.head()

Now, let’s add RDKit molecule objects to the pandas dataframe.

PandasTools.AddMoleculeColumnToFrame(df,'Smiles')

During the tutorial, we will focus on extracting common substructures. Specifically, I will demonstrate the extraction of Murcko scaffolds from the dataset and work with the largest subset that shares the same Murcko scaffold. In your situation, you have the flexibility to manually specify either a core structure or a common substructure of interest within the molecules you are currently analyzing.

Let’s extract the Murcko skeletons.

MurckoG = []
for each_mol in df["ROMol"]:
    try:
        MurckoG.append(Chem.MolToSmiles(GetScaffoldForMol(each_mol)))
    except:
        MurckoG.append(Chem.MolToSmiles(each_mol))
df["Murcko"] = MurckoG
df["Murcko"].value_counts()

Now, let’s shift our attention to the most prevalent scaffold skeleton, which occurs 60 times. To do this, we will create a subset of our dataframe that contains only instances of this particular scaffold. Subsequently, we will visualize this common scaffold within the subset for better understanding and analysis.

cur_scaf_smiles = 'O=C(Nc1cc2ccc(-c3ccccc3)cc2cn1)C1CC1'
cur_scaf_df = df[df["Murcko"]==cur_scaf_smiles]
mol_scaffold_of_interest = Chem.MolFromSmiles(cur_scaf_smiles)

Okay, we are now prepared to visualize this scaffold within the subset. To begin, let’s generate a customized legend that incorporates the IDs alongside their corresponding pKi values. This will provide us with a comprehensive and informative visualization.

custom_legend = [' : '.join(tup) for tup in list(zip(cur_scaf_df['ID'], cur_scaf_df['pKi'].astype(str)))]

Section 3.1. How to Highlight a Common Substructure among Molecules.

To highlight the core structure among the molecule series, we proceed as follows:

For each molecule, we identify the substructure match of the current scaffold and store the resulting matches.
Next, we generate a series of molecules and highlight the atoms matching the scaffold substructure within each molecule .

highlight_scaffold = [mMol.GetSubstructMatch(Chem.MolFromSmiles(cur_scaf_smiles)) for mMol in cur_scaf_df['ROMol']]
Draw.MolsToGridImage(cur_scaf_df['ROMol'], 
                     legends = custom_legend,
                     highlightAtomLists = highlight_scaffold, 
                     molsPerRow=4,
                     subImgSize=(300,300), useSVG=False)

Series of molecules with the same Murcko scaffold highlighted

Section 3.2. How to Highlight Differences among Molecules.

To emphasize the disparities within the series of molecules, we use the following approach:

For each molecule, we identify and retain the substructure mismatches with the current scaffold. To facilitate this, I have developed a custom function called get_atoms_to_highlight, which enables the seamless visualization of atom differences between two structures.
In the next step, we generate a series of molecules and highlight the atoms corresponding to the variations within each molecule. This allows for clearer visualization and analysis of the contrasting features present in the molecule series.

The function get_atoms_to_highlight takes in two molecules: a scaffold of interest and a compound of interest. Its purpose is to identify the atoms within the compound of interest that differ from the scaffold substructure. The function achieves this by first obtaining a complete list of atoms in the compound, then grouping the atoms that match the scaffold substructure. The final step involves subtracting the matched atoms from the complete atom list, resulting in a list of atoms to be highlighted. This list is valuable for conducting further analysis and visualization of the compound.

def get_atoms_to_highlight(mol_scaffold_of_interest, mol_compound_of_interest):
    atom_list = [i for i in range(mol_compound_of_interest.GetNumAtoms())]
    grouped_common_atoms = mol_compound_of_interest.GetSubstructMatches(mol_scaffold_of_interest)
    flattened_grouped_common_atoms = list(sum(grouped_common_atoms, ()))
    atoms_to_highlight = list(set(atom_list) - set(flattened_grouped_common_atoms))
    return atoms_to_highlight

In the following code snippet, we iterate through each molecule in the cur_scaf_df['ROMol'] column, and apply the get_atoms_to_highlight function to identify the atoms that differ from our scaffold of interest. These differing atoms are then stored within each molecule’s __sssAtoms attribute, and their corresponding lists are collected in hl_atom_list.

Then, we visualize a series of molecules and highlight the atoms corresponding to the variations within each molecule.

hl_atom_list = []
for each_mol in cur_scaf_df['ROMol']:
    highlight = get_atoms_to_highlight(mol_scaffold_of_interest, each_mol)
    each_mol.__sssAtoms = highlight
    hl_atom_list.append(highlight)

Draw.MolsToGridImage(cur_scaf_df['ROMol'].values, 
                     legends=custom_legend, 
                     highlightAtomLists = hl_atom_list, 
                     molsPerRow=5,
                     subImgSize=(300,300), 
                     useSVG=False)

Series of molecules with the differences in the structures highlighted

Conclusion:

In this python tutorial about molecular substructure highlighting with RDKit, I covered two topics:

how to highlight a common substructure among molecules by identifying substructure matches and generating a series of molecules with the highlighted scaffold.
how to highlight differences among molecules by employing a custom function that identifies and highlights the varying atoms between a scaffold of interest and a compound of interest.

Thanks a lot for reading this blogpost tutorial! I hope you find this post useful. As always, I welcome constructive feedback and suggestions.