This is part-4 from the five-part series tutorial of the blog post, Computing Molecular Descriptors – Intro, in the context of drug discovery. The goal of this post to explain the python code on computing 2D RDKit descriptors and exporting them as CSV files.
First, install the required library packages using miniconda.
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
The code for RDKit_2D class that I have developed can be found below and here in the GitHub link as well.
Also, please check out the previous posts: Computing Molecular Descriptors – Part 1, and MACCS Fingerprints in Python – Part 2 to see more explanation on the formatting of the code and python class.
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
class RDKit_2D:
def __init__(self, smiles):
self.mols = [Chem.MolFromSmiles(i) for i in smiles]
self.smiles = smiles
def compute_2Drdkit(self, name):
rdkit_2d_desc = []
calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
header = calc.GetDescriptorNames()
for i in range(len(self.mols)):
ds = calc.CalcDescriptors(self.mols[i])
rdkit_2d_desc.append(ds)
df = pd.DataFrame(rdkit_2d_desc,columns=header)
df.insert(loc=0, column='smiles', value=self.smiles)
df.to_csv(name[:-4]+'_RDKit_2D.csv', index=False)
The above code snipped is a class stored as “RDKit_2D.py” which you can load from other python scripts and call its modules.
In RDKit_2D class, I implemented “compute_2Drdkit()” as the only and major function to handle the calculation of RDKit descriptors, prepare them and export them as a CSV file.
Below is the full list of the RDKit descriptors that we will compute. You can check it by running the following line of code.
from rdkit import Chem # make sure to import it if you haven't done so
from rdkit.Chem import Descriptors # make sure to import it if you haven't done so
descriptors_list = [x[0] for x in Descriptors._descList]
print(descriptors_list)
It will print out the following list containing the currently available descriptors.
[‘MaxEStateIndex’, ‘MinEStateIndex’, ‘MaxAbsEStateIndex’, ‘MinAbsEStateIndex’, ‘qed’, ‘MolWt’, ‘HeavyAtomMolWt’, ‘ExactMolWt’, ‘NumValenceElectrons’, ‘NumRadicalElectrons’, ‘MaxPartialCharge’, ‘MinPartialCharge’, ‘MaxAbsPartialCharge’, ‘MinAbsPartialCharge’, ‘FpDensityMorgan1’, ‘FpDensityMorgan2’, ‘FpDensityMorgan3’, ‘BalabanJ’, ‘BertzCT’, ‘Chi0’, ‘Chi0n’, ‘Chi0v’, ‘Chi1’, ‘Chi1n’, ‘Chi1v’, ‘Chi2n’, ‘Chi2v’, ‘Chi3n’, ‘Chi3v’, ‘Chi4n’, ‘Chi4v’, ‘HallKierAlpha’, ‘Ipc’, ‘Kappa1’, ‘Kappa2’, ‘Kappa3’, ‘LabuteASA’, ‘PEOE_VSA1’, ‘PEOE_VSA10’, ‘PEOE_VSA11’, ‘PEOE_VSA12’, ‘PEOE_VSA13’, ‘PEOE_VSA14’, ‘PEOE_VSA2’, ‘PEOE_VSA3’, ‘PEOE_VSA4’, ‘PEOE_VSA5’, ‘PEOE_VSA6’, ‘PEOE_VSA7’, ‘PEOE_VSA8’, ‘PEOE_VSA9’, ‘SMR_VSA1’, ‘SMR_VSA10’, ‘SMR_VSA2’, ‘SMR_VSA3’, ‘SMR_VSA4’, ‘SMR_VSA5’, ‘SMR_VSA6’, ‘SMR_VSA7’, ‘SMR_VSA8’, ‘SMR_VSA9’, ‘SlogP_VSA1’, ‘SlogP_VSA10’, ‘SlogP_VSA11’, ‘SlogP_VSA12’, ‘SlogP_VSA2’, ‘SlogP_VSA3’, ‘SlogP_VSA4’, ‘SlogP_VSA5’, ‘SlogP_VSA6’, ‘SlogP_VSA7’, ‘SlogP_VSA8’, ‘SlogP_VSA9’, ‘TPSA’, ‘EState_VSA1’, ‘EState_VSA10’, ‘EState_VSA11’, ‘EState_VSA2’, ‘EState_VSA3’, ‘EState_VSA4’, ‘EState_VSA5’, ‘EState_VSA6’, ‘EState_VSA7’, ‘EState_VSA8’, ‘EState_VSA9’, ‘VSA_EState1’, ‘VSA_EState10’, ‘VSA_EState2’, ‘VSA_EState3’, ‘VSA_EState4’, ‘VSA_EState5’, ‘VSA_EState6’, ‘VSA_EState7’, ‘VSA_EState8’, ‘VSA_EState9’, ‘FractionCSP3’, ‘HeavyAtomCount’, ‘NHOHCount’, ‘NOCount’, ‘NumAliphaticCarbocycles’, ‘NumAliphaticHeterocycles’, ‘NumAliphaticRings’, ‘NumAromaticCarbocycles’, ‘NumAromaticHeterocycles’, ‘NumAromaticRings’, ‘NumHAcceptors’, ‘NumHDonors’, ‘NumHeteroatoms’, ‘NumRotatableBonds’, ‘NumSaturatedCarbocycles’, ‘NumSaturatedHeterocycles’, ‘NumSaturatedRings’, ‘RingCount’, ‘MolLogP’, ‘MolMR’]
As you can see, there are altogether 115 RDKit descriptors as of now. We will compute all of them for each given compound.
We can directly use the aforementioned list as the header for the output molecular descriptors. Or we can also use the following line of code to achieve the same thing.
header = calc.GetDescriptorNames()
MoleculeDescriptors.MolecularDescriptorCalculator() is an abstract base class for descriptor calculators that you will need to call for whatever molecular descriptor feature that you want. Since we want to compute all the descriptors, we will call this for the entire list of available descriptors. So, below is the single line of code to prepare that.
calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
If you want to know the type of calc, you can check it by calling type(calc). The output is here:
<class 'rdkit.ML.Descriptors.MoleculeDescriptors.MolecularDescriptorCalculator'>
Once we have set up the calc variable, we can then compute the descriptors for a mol of interest by executing the code below:
mol = Chem.MolFromSmiles('CC1=CC(=C(C=C1NC(=O)C2=C(C(=CC(=C2)I)I)O)Cl)C(C#N)C3=CC=C(C=C3)Cl')
ds = calc.CalcDescriptors(mol)
print(ds)
Below is the output of the variable ds, which contains all the computed molecular properties in the order of the descriptor list that I have mentioned before: [x[0] for x in Descriptors._descList].
(12.745535497884882, -0.5698675869184535, 12.745535497884882, 0.06836196299558472, 0.29235676951522827, 663.0799999999998, 648.9679999999996, 661.852179048, 152, 0, 0.25900360274825185, -0.5060008092821509, 0.5060008092821509, 0.25900360274825185, 1.1333333333333333, 1.8, 2.433333333333333, 1.9402256061221343, 1175.2355472715938, 21.999271042140983, 15.510685796207323, 21.337540270596875, 14.20561635580521, 8.654060372339206, 11.567487609533984, 6.540209173088069, 9.791667981874944, 4.508130316967467, 6.532560560468579, 2.9673141998441293, 5.210790706092031, -1.5399999999999987, 3810408.214481713, 23.130028268670063, 9.726157452179187, 5.018674352279847, 211.0337664023786, 10.423315998847038, 5.749511833283905, 0.0, 0.0, 5.907179729351506, 0.0, 4.794537184071822, 0.0, 5.261891554738487, 0.0, 41.40098098584986, 99.12766827122579, 19.30283517323161, 21.120761121716058, 9.901064578912528, 79.97818703863621, 5.261891554738487, 0.0, 0.0, 12.841643245852017, 5.316788604006331, 87.97037368409433, 0.0, 11.818733146076179, 5.316788604006331, 5.687386274683562, 5.749511833283905, 68.38362103460115, 11.013707124192212, 0.0, 25.395214609352177, 32.9662491970212, 48.53093654769288, 10.045266627482652, 0.0, 0.0, 73.12, 11.8250857755129, 15.162956133651015, 0.0, 11.3129633249809, 24.866286664928605, 14.697085254459388, 0.0, 48.53093654769288, 29.51460782675868, 33.97688054386666, 23.20187978046503, 0.0, 12.431427488936194, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.4300532661852827, 56.195686505433166, 0.09090909090909091, 30, 2, 4, 0, 0, 0, 3, 0, 3, 3, 2, 8, 4, 0, 0, 0, 3, 7.124400000000003, 136.59099999999992)
Essentially, when you get to this point, you can put them together in a function to iterate for all the mols and organize them into a pandas dataframe. It requires that you know some basic python codes on how to use for loops, constructing a pandas frame, exporting a CSV file, etc.
As a result of putting these pieces together, you will get something like the following class module from RDKit_2D (note that they are indented that way because I copied pasted it directly from the python class):
def compute_2Drdkit(self, name):
rdkit_2d_desc = []
calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
header = calc.GetDescriptorNames()
for i in range(len(self.mols)):
ds = calc.CalcDescriptors(self.mols[i])
rdkit_2d_desc.append(ds)
df = pd.DataFrame(rdkit_2d_desc,columns=header)
df.insert(loc=0, column='smiles', value=self.smiles)
df.to_csv(name[:-4]+'_RDKit_2D.csv', index=False)
I know that some of us are more interested in actually importing and using this class so below is an example code to import the “RDKit_2D” class and use it to compute RDKit 2D descriptors. It follows pretty much the same code format as MACCS and ECFP from the previous tutorials.
import pandas as pd
from molvs import standardize_smiles
from RDKit_2D import *
def main():
filename = 'data/macrolides_smiles.csv' # path to your csv file
df = pd.read_csv(filename) # read the csv file as pandas data frame
smiles = [standardize_smiles(i) for i in df['smiles'].values]
## Compute RDKit_2D Fingerprints and export a csv file.
RDKit_descriptor = RDKit_2D(smiles) # create your RDKit_2D object and provide smiles
RDKit_descriptor.compute_2Drdkit(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.
if __name__ == '__main__':
main()
That’s all for this post for now. Please stay tuned for the next post which is the last of this five-part series: Mordred_MRC_Descriptors in Python – Part 5.
MRC descriptors are new descriptors that were developed as part of the MacrolactoneDB project to better characterize macrolides and macrocycles. You will get a closer look at how to develop new descriptors like MRC in the next post. Cheers!