This is the last of the five-part series tutorial of the blog post, Computing Molecular Descriptors – Intro, in the context of drug discovery. The goal of this post to explain the python code on creating new descriptors such as MRC (developed in MacrolactoneDB study) and using Mordred descriptors.
What are MRC descriptors?
MRC descriptors were developed to better characterize ring chemical structures such as macrolactones and macrolides (diverse natural products with important biological activities – e.g. antibiotics, antifungal, anticancer, etc.), to provide insight into their structure-activity relationships, properties, etc. and to boost predictive modeling when used in combination with other descriptors such as Mordred.
Mrc descriptors include information on frequency of ring sizes ranging from 13 to 99, smallest and largest ring sizes (≥12-membered), frequency of sugars and occurrence of esters within the core rings. An illustrative diagram of mrc descriptor along with an example structure of CHEMBL2221290 is shown in Fig. 1.
Zin PPK, Williams GJ, Ekins S (2020) Cheminformatics Analysis and Modeling with MacrolactoneDB. Scientific Reports 10:1–12 . doi: 10.1038/s41598-020-63192-4
I encourage you to read the MacrolactoneDB research article to learn more about macrolides and the details on the features included in MRC descriptors.
Implementing Python Code on Mordred_MRC_Descriptors
Please check out the previous posts: Computing Molecular Descriptors – Part 1, MACCS Fingerprints in Python – Part 2 to see more explanation on the formatting of the code and python class.
Make sure you have the following python libraries (itertools, pandas, RDKit, Mordred) installed in the working directory. I recommend using miniconda to install them since it is a very convenient package manager.
# To install pandas
conda install pandas
# To install mordred
conda install -c rdkit -c mordred-descriptor mordred
Here in this tutorial, I will explain the code function by function. The code for Macrocycle_Descriptors class that I have developed can be found below and here in the GitHub link as well.
import itertools
import pandas as pd
from rdkit import Chem
from mordred import Calculator, descriptors
from mordred.RingCount import RingCount
class Macrocycle_Descriptors:
def __init__(self, smiles):
self.mols = [Chem.MolFromSmiles(i) for i in smiles]
self.smiles = smiles
self.mordred = None
def compute_ringsize(self, mol):
'''
check for macrolides of RS 3 to 100, return a list of ring counts.
[RS3,RS4,.....,RS100]
[0,0,0,...,1,...,0]
'''
RS_3_100 = [i+3 for i in range(97)]
RS_count = []
for j in RS_3_100:
RS = RingCount(order=j)(mol)
RS_count.append(RS)
return RS_count
def macrolide_ring_info(self):
headers = ['n'+str(i+13)+'Ring' for i in range(87)]+['SmallestRS','LargestRS']
# up to nR12 is already with mordred, start with nR13 to nR100
ring_sizes = []
for i in range(len(self.mols)):
RS = self.compute_ringsize(self.mols[i]) # nR3 to nR100
RS_12_100 = RS[9:] # start with nR12 up to nR100
ring_indices = [i for i,x in enumerate(RS_12_100) if x!=0] # get index if item isn't equal to 0
if ring_indices:
# find 1, locate the last index
# largest_RS is based on RS 3 to 100.
# Add 3 (starting ring count) to get up to the actual ring size
smallest_RS = ring_indices[0]+12
largest_RS = ring_indices[-1]+12
RS_12_100.append(smallest_RS) # Smallest RS
RS_12_100.append(largest_RS) # Largest RS
else:
RS_12_100.extend(['',''])
ring_sizes.append(RS_12_100[1:]) # up to nR12 is already with mordred, start with nR13 to nR100
df = pd.DataFrame(ring_sizes, columns=headers)
return df
def sugar_count(self):
sugar_patterns = [
'[OX2;$([r5]1@C@C@C(O)@C1),$([r6]1@C@C@C(O)@C(O)@C1)]',
'[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
'[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C(O)@C1)]',
'[OX2;$([r5]1@C(!@[OX2H1])@C@C@C1),$([r6]1@C(!@[OX2H1])@C@C@C@C1)]',
'[OX2;$([r5]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
'[OX2;$([r5]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
]
sugar_mols = [Chem.MolFromSmarts(i) for i in sugar_patterns]
sugar_counts = []
for i in self.mols:
matches_total = []
for s_mol in sugar_mols:
raw_matches = i.GetSubstructMatches(s_mol)
matches = list(sum(raw_matches, ()))
if matches not in matches_total and len(matches) !=0:
matches_total.append(matches)
sugar_indices = set((list(itertools.chain(*matches_total))))
count = len(sugar_indices)
sugar_counts.append(count)
df = pd.DataFrame(sugar_counts, columns=['nSugars'])
return df
def core_ester_count(self):
'''
Returns pandas frame containing the count of esters in core rings of >=12 membered macrocycles.
'''
ester_smarts = '[CX3](=[OX1])O@[r;!r3;!r4;!r5;!r6;!r7;!r8;!r9;!r10;!r11]'
core_ester = []
ester_mol = Chem.MolFromSmarts(ester_smarts)
for i in self.mols:
ester_count = len(i.GetSubstructMatches(ester_mol))
core_ester.append(ester_count)
df = pd.DataFrame(core_ester, columns=['core_ester'])
return df
def mordred_compute(self, name):
calc = Calculator(descriptors, ignore_3D=True)
df = calc.pandas(self.mols)
self.mordred = df
df.insert(loc=0, column='smiles', value=self.smiles)
df.to_csv(name[:-4]+'_mordred.csv', index=False)
def compute_mordred_macrocycle(self, name):
if not isinstance(self.mordred, pd.DataFrame):
self.mordred = self.mordred_compute()
ring_df = self.macrolide_ring_info()
sugar_df = self.sugar_count()
ester_df = self.core_ester_count()
# self.mrc = pd.concat([ring_df,sugar_df, ester_df], axis=1)
mordred_mrc = pd.concat([self.mordred, ring_df,sugar_df, ester_df], axis=1)
mordred_mrc.to_csv(name[:-4]+'_mordred_mrc.csv', index=False)
The above code snipped is a class stored as “Macrocycle_Descriptors.py” which you can load it in other python scripts and call its modules. The object-oriented code style has a similar format as in the previous posts in the five-part series tutorial.
How to Use Macrocycle_Descriptors.py
For example, let’s say you have a file containing smiles and you want to compute two separate CSV files containing Mordred descriptors and mordred_mrc 1(both Mordred and MRC) descriptors for them. Below is the code to do that. It follows pretty much the same code format as MACCS, ECFP6, and RDKit from the previous blog tutorials.
import time, os
import pandas as pd
from molvs import standardize_smiles
from rdkit import Chem
from Macrocycle_Descriptors import *
def convert_time(second):
day = second/86400
hour = (day - int(day))*24
minute = (hour - int(hour))*60
second = round((minute - int(minute))*60,4)
return(str(int(day)) + ' DAYS: '+ str(int(hour)) + ' HOURS: '+ str(int(minute)) + ' MINUTES: ' + str(second) + ' SECONDS')
def create_directory():
while not os.path.exists('desc'):
os.mkdir('desc')
def main():
filename = 'data/macrolides_smiles.csv'
create_directory()
df = pd.read_csv(filename)
smiles = [standardize_smiles(i) for i in df['smiles'].values]
start_time = time.time()
output_filename = 'desc' + filename[4:]
## Create Macrocycle_Descriptors object.
mrc_descriptor = Macrocycle_Descriptors(smiles)
## Compute mordred descriptors and export CSV file.
mrc_descriptor.mordred_compute(output_filename)
## Compute mordred_mrc descriptors and export CSV file.
mrc_descriptor.compute_mordred_macrocycle(output_filename)
duration = convert_time(time.time()-start_time)
print(duration)
if __name__ == '__main__':
main()
Explaining the Python Code for Macrocycle_Descriptors Class
class Macrocycle_Descriptors:
def __init__(self, smiles):
self.mols = [Chem.MolFromSmiles(i) for i in smiles]
self.smiles = smiles
self.mordred = None
This is the creation of Macrocycle_Descriptors class and the init function. Whenever we create this object, we will need to provide a list of SMILES. We will keep three class attributes: self.mols (let’s convert the list of SMILES to MOLs as soon as we create this Macrocycle_Descriptors object; self.mols will be used again and again in other methods in this class.), self.smiles (we can append it as a column when we export CSV files), and self.mordred (let’s set it to None; I will explain later).
def compute_ringsize(self, mol):
'''
check for macrolides of RS 3 to 100, return a list of ring counts.
[RS3,RS4,.....,RS100]
[0,0,0,...,1,...,0]
'''
RS_3_100 = [i+3 for i in range(97)]
RS_count = []
for j in RS_3_100:
RS = RingCount(order=j)(mol)
RS_count.append(RS)
return RS_count
compute_ringsize() method will calculate the occurrence of ring structures detected in a chemical. It takes a single mol, iterates for ring sizes of 3 to 99 and counts the frequencies of each ring size using RingCount() function from Mordred. For example, if you look at the structure in Figure 1, this function will detect that it has four six-membered rings, and two thirty-six membered rings. It will return a list of numbers wherein each number indicates the occurrences of associated ring sizes and their indexes indicate the actual ring sizes (see Figure 2). Note that the list starts recording from ring sizes of three until ninety-nine.
def macrolide_ring_info(self):
headers = ['n'+str(i+13)+'Ring' for i in range(87)]+['SmallestRS','LargestRS']
# up to nR12 is already with mordred, start with nR13 to nR99
ring_sizes = []
for i in range(len(self.mols)):
RS = self.compute_ringsize(self.mols[i]) # nR3 to nR99
RS_12_99 = RS[9:] # start with nR12 up to nR99
ring_indices = [i for i,x in enumerate(RS_12_99) if x!=0] # get index if item isn't equal to 0
if ring_indices:
# find 1, locate the last index
# largest_RS is based on RS 3 to 99.
# Add 3 (starting ring count) to get up to the actual ring size
smallest_RS = ring_indices[0]+12
largest_RS = ring_indices[-1]+12
RS_12_99.append(smallest_RS) # Smallest RS
RS_12_99.append(largest_RS) # Largest RS
else:
RS_12_99.extend(['',''])
ring_sizes.append(RS_12_99[1:]) # up to nR12 is already with mordred, start with nR13 to nR99
df = pd.DataFrame(ring_sizes, columns=headers)
return df
macrolide_ring_info() is a class method that ties together the ring-size related information for self.mols. It computes ring sizes from 13 (Mordred already covers up to ring size of 12) to 99 by using the previous class method, and also detects the smallest and largest core ring sizes (≥12-membered).
As you can see in the function, it iterates for self.mols. For each mol, it applies compute_ringsize() and subsets the list so that we are looking at mainly the ring size information from 12 to 99-membered rings to scope out the maximal and minimal value of the core ring sizes (≥12-membered).
Note that we are dealing with a list containing the frequencies whose indexes (aka ring sizes) are in ascending order, and many will likely be zeros (meaning there won’t be tens of different-sized rings). So, we extract the indexes of non-zeros by the following line:
ring_indices = [i for i,x in enumerate(RS_12_99) if x!=0]
Just to give an example:
RS_12_99 = [0,1,5,0,1,0]
ring_indices = [i for i,x in enumerate(RS_12_99) if x!=0]
print(ring_indices)
ring_indices from the above will return [1,2,4], which are the indexes of non-zeros in the variable RS_12_99.
We then take the first and last indexes from ring_indices and add 12 to account for the actual ring sizes; the first will be the ring size of the smallest core ring and the last will be the ring size of the largest core ring.
There’s a bit of math involved to account for proper ring sizes, match indexes, etc. At the end of the macrolide_ring_info() function, we put together the already computed RS from 13 to 99, smallest RS, largest RS for all the mols in self.mols, convert them into a pandas dataframe, and return it. The dataframe would look something like in Table 1 below.
Macrolides | nR13 | nR14 | .. | n99Ring | SmallestRS | LargestRS |
id_1 | 1 | 0 | 0 | 0 | 13 | 13 |
id_2 | 0 | 1 | 0 | 0 | 14 | 14 |
id_3 | 0 | 0 | 0 | 0 | 15 | 15 |
id_4 | 0 | 0 | 0 | 0 | 15 | 15 |
id_5 | 0 | 0 | 0 | 0 | 15 | 15 |
id_6 | 0 | 0 | 0 | 0 | 15 | 15 |
id_7 | 0 | 0 | 1 | 0 | 15 | 15 |
def sugar_count(self):
sugar_patterns = [
'[OX2;$([r5]1@C@C@C(O)@C1),$([r6]1@C@C@C(O)@C(O)@C1)]',
'[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
'[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C(O)@C1)]',
'[OX2;$([r5]1@C(!@[OX2H1])@C@C@C1),$([r6]1@C(!@[OX2H1])@C@C@C@C1)]',
'[OX2;$([r5]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
'[OX2;$([r5]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
]
sugar_mols = [Chem.MolFromSmarts(i) for i in sugar_patterns]
sugar_counts = []
for i in self.mols:
matches_total = []
for s_mol in sugar_mols:
raw_matches = i.GetSubstructMatches(s_mol)
matches = list(sum(raw_matches, ()))
if matches not in matches_total and len(matches) !=0:
matches_total.append(matches)
sugar_indices = set((list(itertools.chain(*matches_total))))
count = len(sugar_indices)
sugar_counts.append(count)
df = pd.DataFrame(sugar_counts, columns=['nSugars'])
return df
sugar_count() method counts sugars in self.mols and returns a pandas dataframe containing a single column called ‘nSugars’. In this function, I used six sugar patterns (SMARTS) from the CDK website, converted them to RDKit Mols, and performed a substructure search against self.mols. For each compound in self.mols, we apply a substructure match against six sugar patterns and record the unique atom indices. In other words, the number of unique atom indices represents the number of sugars present in the structure. We then collect sugar counts for each compound in self.mols in sugar_counts, use it to create a new pandas dataframe containing a single column called “nSugars”, and return that pandas dataframe at the end of the function.
def core_ester_count(self):
'''
Returns pandas frame containing the count of esters in core rings of >=12 membered macrocycles.
'''
ester_smarts = '[CX3](=[OX1])O@[r;!r3;!r4;!r5;!r6;!r7;!r8;!r9;!r10;!r11]'
core_ester = []
ester_mol = Chem.MolFromSmarts(ester_smarts)
for i in self.mols:
ester_count = len(i.GetSubstructMatches(ester_mol))
core_ester.append(ester_count)
df = pd.DataFrame(core_ester, columns=['core_ester'])
return df
core_ester_count() method counts ester functional groups within the core ring structures which are at least 12-membered. In order to do that, we will apply a substructure match against the compounds of interest (self.mols) using GetSubstructMatches() but first, we will need to know the SMARTS for the substructure of interest.
The SMARTs for the ester function group within ≥12-membered ring structures is “[CX3](=[OX1])O@[r;!r3;!r4;!r5;!r6;!r7;!r8;!r9;!r10;!r11]“. If you want to know how to write SMARTS for any substructure, you can read the documentation from daylight: SMARTS – A Language for Describing Molecular Patterns, SMARTS Examples, and SMARTS Tutorial.
When we perform the substructure search for core ester against all the compounds in self.mols, we then compute the total number of core esters in each compound and store them in a list ‘core_ester’. In other words, ‘core_ester’ is a list containing the occurrences of the esters within ≥12-membered ring structures for all the compounds in self.mols. We then convert it into a pandas dataframe containing a single column called ‘core_ester’ and return it at the end of core_ester_count() method.
def mordred_compute(self, name):
calc = Calculator(descriptors, ignore_3D=True)
df = calc.pandas(self.mols)
self.mordred = df
df.insert(loc=0, column='smiles', value=self.smiles)
df.to_csv(name[:-4]+'_mordred.csv', index=False)
mordred_compute() takes a parameter called name which is the name of your desired output CSV file containing Mordred 2D descriptors. Here, we save the computed Mordred descriptors as local variable df and also as a class attribute self.mordred so that we can reuse it in the compute_mordred_macrocycle() method explained below. The function mordred_compute() exports a CSV file containing the computed Mordred descriptors for self.mols. The list of Mordred descriptors can be viewed here.
def compute_mordred_macrocycle(self, name):
if not isinstance(self.mordred, pd.DataFrame):
self.mordred = self.mordred_compute()
ring_df = self.macrolide_ring_info()
sugar_df = self.sugar_count()
ester_df = self.core_ester_count()
# self.mrc = pd.concat([ring_df,sugar_df, ester_df], axis=1)
mordred_mrc = pd.concat([self.mordred, ring_df,sugar_df, ester_df], axis=1)
mordred_mrc.to_csv(name[:-4]+'_mordred_mrc.csv', index=False)
compute_mordred_macrocycle() method puts together the previous class methods to create mordred_mrc descriptors. Similar to the mordred_compute() method, it takes a parameter called name for the output CSV file containing mordred_mrc descriptors. We first check if mordred descriptors are already computed or not. If they are, we should have an instance called self.mordred. If not, we call the previous class method mordred_compute() so that we can have computed Mordred descriptors stored as self.mordred. In case we might run into any bug or issue in that part, we initially set self.mordred to None in the __init__() method.
Here, we called the class methods to compute ring information, frequency of sugars, and core esters. We then merge them all together along with self.mordred pandas dataframe and export a CSV file. In essence, the final CSV file from the compute_mordred_macrocycle() method contains mordred descriptors along with mrc descriptors.
Well, that’s all for this post and this is the last of the five-part series tutorials from Computing Molecular Descriptors – Intro series. Please let me know if you have any feedbacks, suggestions and if you find any mistakes. Thank you!