MACCS Fingerprints in Python – Part 2

This is from the five-part series tutorial of the previous blog post, Computing Molecular Descriptors – Intro in the context of drug discovery. The goal of this post to explain the python code on computing MACCS fingerprints.

Please read this blog to familiarize yourself with MACCS. The 166 public keys (fragment definitions) of MACCS in RDKit implementation can be found here. Essentially, it is a binary fingerprint (zeros and ones) that answer 166 fragment related questions. If the explicitly defined fragment exists in the structure, the bit in that position is set to 1, and if not, it is set to 0. In that sense, the position of the bit matters because it is addressed to a specific question or a fragment.

I will dive into the python tutorial part once you are familiar with MACCS. First, install the required library packages using miniconda and import them.

conda install -c rdkit rdkit
conda install pandas

The code for MACCS class that I have developed can be found below and here in the GitHub link as well.

import pandas as pd
from rdkit import Chem
from rdkit.Chem import MACCSkeys
class MACCS:
    def __init__(self, smiles):
        self.smiles = smiles
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]

    def compute_MACCS(self, name):
        MACCS_list = []
        header = ['bit' + str(i) for i in range(167)]
        for i in range(len(self.mols)):
            ds = list(MACCSkeys.GenMACCSKeys(self.mols[i]).ToBitString())
            MACCS_list.append(ds)
        df = pd.DataFrame(MACCS_list,columns=header)
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_MACCS.csv', index=False)

The class for MACCS can be saved as an individual python file named “MACCS.py”

It is designed in a way that we can easily load it and import it in another python file. For example, let’s say you want to compute MACCS for a csv file containing SMILES. Below is the code you can use to do that (I have provided comments to explain the code). Oh, and make sure that “MACCS.py” is in the same working directory. If not, it won’t be able to load the “MACCS.py”.

import pandas as pd
from molvs import standardize_smiles
from MACCS import *

def main():
    filename = 'data/macrolides_smiles.csv'  # path to your csv file
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['smiles'].values]  

    ## Compute MACCS Fingerprints and export file.
    maccs_descriptor = MACCS(smiles)        # create your MACCS object and provide smiles
    maccs_descriptor.compute_MACCS(filename) # compute MACCS and provide the name of your desired output file. you can use the same name as the input file because the MACCS class will ensure to add "_MACCS.csv" as part of the output file.

if __name__ == '__main__':
    main()

Okay, Now Let’s Break Down the Python Code!

Creating MACCS Class & __init__() Method

Let’s clarify this code section below.

class MACCS:
    def __init__(self, smiles):
        self.smiles = smiles

Here, we create a class called “MACCS” and initiate the attributes of the class in the __init__() method. See the code below. In the __init__() method, we have set a required parameter called smiles which will be taken in the form of a list. In other words, whenever we call this class object, we will need to provide a list of smiles.

We will keep the list of smiles as a class attribute self.smiles so that we can access it any time we want from the other class methods.

For example:

from MACCS import *        # importing MACCS class

smiles = ['Nc1nc(NC2CC2)c2ncn(C3C=CC(CO)C3)c2n1',
'CC(=O)NCCCS(=O)(=O)O',
'CCCC(=O)Nc1ccc(OCC(O)CNC(C)C)c(C(C)=O)c1',
'CC(=O)Nc1ccc(O)cc1',
'CC(=O)Nc1nnc(S(N)(=O)=O)s1',
'CC(=O)NO'
]
maccs_descriptor = MACCS(smiles)        # creating the object MACCS and providing a list of smiles

Note that all the class methods will have ‘self‘ keyword to access the attributes and methods of the class.

We will make sure to convert all the smiles into mols by applying ‘Chem.MolFromSmiles()‘ method from RDKit. This conversion of smiles to mols is the first major step in applying other cheminformatics processes and manipulations later. The collection of mols is stored in a list as the variable self.mols. We write this code as part of the __init__() method so the conversion of mols from smiles is executed as soon as the class object is called and created.

There we have it… the creation of MACCS class and the __init__() method.

class MACCS:
    def __init__(self, smiles):
        self.smiles = smiles
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]

Computing MACCS and Saving an Output CSV File

Now, let’s look at the following class method.

    def compute_MACCS(self, name):
        MACCS_list = []
        header = ['bit' + str(i) for i in range(167)]
        for i in range(len(self.mols)):
            ds = list(MACCSkeys.GenMACCSKeys(self.mols[i]).ToBitString())
            MACCS_list.append(ds)
        df = pd.DataFrame(MACCS_list,columns=header)
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_MACCS.csv', index=False)

This is a class module called ‘compute_MACCS()’ that will compute MACCS fingerprints and then write an output CSV file.

In compute_MACCS() method, we will provide a parameter called ‘name’ which is the output file name so that the files are systematically named with the appropriate descriptors.

First, we make a list called “MACCS_list” that will store all the computed fingerprints and set up a variable named “header”. MACCS has 167 fingerprints so we make the list “header” which will store something like

header = ['bit0', 'bit1', 'bit2', 'bit3', 'bit4', 'bit5', ... , 'bit164', 'bit165', 'bit166']

This is also how we will format the column header in the output CSV file.

Next, we will iterate for all the mols in the self.mols list.

For each mol, we will compute MACCS keys and convert them into bit strings. This is done by this line of code:

ds = list(MACCSkeys.GenMACCSKeys(self.mols[i]).ToBitString())

The MACCS fingerprint for each mol is appended into MACCS_list. Essentially, MACCS_list will contain the fingerprints of all the mols in a nested list format.

So, for example, MACCS_list will look something like that for two sets of fingerprints for two compounds.

MACCS_list = [
['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '1', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '0', '1', '1', '0', '1', '0', '0', '1', '1', '1', '0', '0', '1', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '1', '0', '0', '0', '0', '0', '1', '0', '1', '0', '1', '0', '1', '0', '1', '0', '0', '1', '0', '0', '0', '0', '1', '1', '0', '0', '1', '0', '1', '0', '1', '1', '1', '1', '0', '0', '1', '1', '1', '1', '1', '0'], 
['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '1', '0', '0', '0', '1', '0', '0', '1', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '1', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '0', '1', '1', '1', '0', '0', '0', '0', '0', '0', '0', '1', '0', '1', '0', '1', '0', '1', '0', '1', '0', '1', '1', '1', '0', '0', '0', '1', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '0', '1', '0', '1', '1', '1', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '1', '0', '1', '1', '1', '1', '0', '1', '1', '1', '1', '0', '0', '1', '0', '0']
]

For as many mols as you will compute, the MACCS_list will grow linearly. Once you have a nested list like that, you can easily convert it into a pandas data frame, which can be easily manipulated and exported as CSV file. See the example below.

df = pd.DataFrame(MACCS_list,columns=header)

# Next, we insert a column called smiles at the first column index.
df.insert(loc=0, column='smiles', value=self.smiles)       

# Then, we export it as CSV file by providing a file name. We don't want an extra column for the index so we set it to False. 
df.to_csv(name[:-4]+'_MACCS.csv', index=False)

There you have it… computing MACCS and generating output csv files.

I probably won’t go as much detail into the other python classes because it may sound redundant. I will mainly provide a quick overview of what each code section is doing.

Please stay tuned for the next post on the python code for computing ECFP6 in the next blog post: ECFP6 Fingerprints in Python – Part 3.