Please check out the previous blog posts from this series if you haven’t done so already:
Part 1 algorithm for k-fold Cross-Validation
Part 2A of the Nested Cross-Validation & Cross-Validation Series where I went through a python tutorial on implementing k-fold CV regressors using random forest (RF) from scikit-learn with a simple cheminformatics dataset with descriptors and endpoints of interest.
Here in Part 2B, I will cover the python tutorial for the dataset containing high-dimensional matrices where each matrix represents features of a chemical structure (this is taken from one of my Ph.D. projects; MD-QSAR with Imatinib derivatives).
In this example, one compound is represented by a CSV file containing 456 columns and 1501 rows. Each column represents a molecular dynamics (MD) based descriptor/feature, and each row represents a specified time-frame.
Please see Figure 2 for a graphical illustration.
To provide a brief description, this data is extracted from MD simulations run for a duration of 15 ns. Each ligand is docked within a specified binding pocket of the BCR-ABL kinase protein, and MD simulations were performed. 15k frames were extracted including protein-ligand complexes (one every picosecond). The resulting descriptors account for binding site characteristics such as its volumne, composition, energetic characteristics, protein-ligand interactions between each compound the ABL-kinase protein, and so on.
Overall, each CSV file contains the 4D/MD descriptor set for a compound which characterizes the ligand, the binding site, and the molecular protein-ligand interactions over the 15,000 MD frames. Please check out this MD-QSAR paper for more detail.
This is very heavy high-dimensional data (456 x 1501 = 684,456 features for a compound!!), and we have 170 compounds in the folder. I would not use simple machine learning algorithms without a lot of feature engineering, powerful extraction, and/or reduction techniques for dealing with this dataset. It is likely some of the data is noise or crap which won’t likely contribute to any meaningful relationship between the features being used and the endpoint being modeled (which is pKi in this case), the same as in Part 2A tutorial.
Anyhow, I will use random forest (RF) machine learning algorithm for a python code demonstration.
Since the workflow is pretty much the same as in Part 2A, I won’t go into details of each function. The code and relevant datasets are provided in this GitHub repository.
Import the required libraries.
import itertools, os
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
Create a python class to handle data files containing MD descriptor frames.
class MDframes_DL:
def __init__(self):
self.MDQSAR_directory = "MD_Data/"
self.retrieve_data_labels()
self.retrieve_cpd_filenames()
def retrieve_data_labels(self):
'''Retrieve the names of compounds along with binding affinities and save them in dictionary.'''
cpd_affmanager = open("aff.csv",'r')
cpd_affmanager.readline()
self.cpd_data_labels = {i[0]:float(i[1]) for i in (x.strip().split(',') for x in cpd_affmanager)} # make dictionary of compound IDs (string) matched with binding affinities (float)
def retrieve_cpd_filenames(self):
'''Retrieve the names of all cpds files in the folder and shuffle the list.'''
self.cpd_filenames = os.listdir(self.MDQSAR_directory) # contains the names of all compounds in the folder
shuffle(self.cpd_filenames)
def BatchGenerator(self,file_names):
"Based on files in path and generate data and labels batch by batch."
initial = 0
batch_size = 1
while initial < len(file_names):
pre_data = []
labels = []
for i in range(batch_size):
if initial < len(file_names):
if os.path.isfile(self.MDQSAR_directory+file_names[initial]):
each_cpd = pd.read_csv(self.MDQSAR_directory+file_names[initial]).astype(float)
df = each_cpd.fillna(0)
scaler = MinMaxScaler()
each_cpd_scaled = scaler.fit_transform(df)
pre_data.append(each_cpd_scaled)
labels.append(self.cpd_data_labels[file_names[initial][:-4]])
initial += 1
pre_data = np.nan_to_num(np.array(pre_data))
shaped_data = pre_data.reshape((pre_data.shape[0], pre_data.shape[1] * pre_data.shape[2]))
scaler = MinMaxScaler(feature_range=(0, 1))
final_data = scaler.fit_transform(shaped_data)
yield final_data, labels
def read_data(self, files):
'''
Given a list of file names, extract the descriptors.
Fill missing values with zeros, scale the data from zero to 1 column by column.
Return the processed data and the endpoints (in other words, labels).
'''
pre_data = []
labels = []
for i in files:
each_cpd = pd.read_csv(self.MDQSAR_directory+i,engine='python').astype(float)
df = each_cpd.fillna(0)
scaler = preprocessing.MinMaxScaler()
each_cpd_scaled = scaler.fit_transform(df)
pre_data.append(each_cpd_scaled)
labels.append(self.cpd_data_labels[i[:-4]])
pre_data = np.nan_to_num(np.array(pre_data))
shaped_data = pre_data.reshape((pre_data.shape[0], pre_data.shape[1] * pre_data.shape[2]))
scaler = preprocessing.MinMaxScaler()
data = scaler.fit_transform(shaped_data)
return data, labels
def split_partitions(self, filenames, folds):
'''
The following function splits the data into k different folds, given a list of compound names and the number of partitions.
There will be two outputs: one folds (test sets) and k-1 folds (train sets).
'''
num_val_samples = len(filenames) // folds+1
one_fold = []
nine_folds = []
for i in range(folds):
one_fold_data = filenames[i * num_val_samples: (i + 1) * num_val_samples] # prepares the validation data: data from partition # k
one_fold.append(one_fold_data)
nine_fold_data = np.concatenate([filenames[:i * num_val_samples],filenames[(i + 1) * num_val_samples:]],axis=0).tolist() # prepares the training data: data from all other partitions
nine_folds.append(nine_fold_data)
return one_fold, nine_folds
Get a list of cpd file names which are also ChEMBL compound IDs in the MD_Data directory.
directory = "MD_Data/"
cpd_files = os.listdir(directory)
Shuffle the names of the cpd files.
cpd_files = shuffle(cpd_files, random_state=65)
print(cpd_files)
We will do 5 fold CVs like in the previous tutorial (keeping the core workflow the same), so we will set outer_k to 5. We will instantiate the class MDframes_DL() and assign it to the variable sample.
outer_k = 5
sample = MDframes_DL()
test_fold, train_fold = sample.split_partitions(cpd_files, outer_k)
split_partitions() is pretty similar to the one in Part 2A. Here, we are scrambling file names only. We take in a list of all the file names (aka ChEMBL ID names) and split them into 5 test folds and 5 train folds. Refer to Figure 3 for graphical illustration.
As a result, test_fold and train fold will each contain 5 lists of file names. To show what test_fold looks like, here it is.
[['CHEMBL3689722.csv',
'CHEMBL3689625.csv',
..
'CHEMBL3689710.csv',
'CHEMBL3689686.csv'],
['CHEMBL3689738.csv',
'CHEMBL3689627.csv',
...
'CHEMBL3689736.csv',
'CHEMBL3689668.csv'],
['CHEMBL3689699.csv',
'CHEMBL3689643.csv',
...
'CHEMBL3689745.csv',
'CHEMBL3689669.csv'],
['CHEMBL3689727.csv',
'CHEMBL3689751.csv',
...
'CHEMBL3689717.csv',
'CHEMBL3689689.csv'],
['CHEMBL3689720.csv',
'CHEMBL3689688.csv',
..
'CHEMBL3689711.csv',
'CHEMBL3689640.csv']]
Next, we perform a k-fold CV with an RF algorithm. We are using only a set of RF parameters for all CV rotations and not really modifying anything for now.
In the real scenario, you need to explore the RF grid search and find the most optimal parameters for your model as you perform the cross-validation process, but, I will skip this for now because it is very time-consuming to run it for such high-dimensional data. I expect horrible results from this CV procedure.
In the code below, we will fit each train partition with one set of RF parameters and make predictions on the relevant test partition. We iterate the process for all 5 splits and gather the results.
outerCV_targets = []
outerCV_predictions = []
outerCV_IDs = []
for i in range(outer_k):
outer_train = train_fold[i]
outer_test = test_fold[i]
outerCV_test_data, outerCV_test_targets = sample.read_data(outer_test)
outerCV_train_data, outerCV_train_targets = sample.read_data(outer_train)
cv_rf = RandomForestRegressor(n_estimators= 100,
max_depth = 5,
max_features = 'auto',
min_samples_leaf = 1,
min_samples_split = 5,
bootstrap = True,
criterion="mae",
n_jobs = -1)
# Fit the random search model
cv_rf.fit(outerCV_train_data,outerCV_train_targets)
outerCV_test_predictions = cv_rf.predict(outerCV_test_data).tolist()
outerCV_predictions.append(outerCV_test_predictions)
outerCV_targets.append(outerCV_test_targets)
outerCV_IDs.append(outer_test)
Note that I have written a function read_data as part of MDframes_DL() class, which extracts data based on file names, and returns the processed data and corresponding endpoints.
Just to look at how the results actually are from this run…
outerCV_targets_combined = list(itertools.chain.from_iterable(outerCV_targets))
outerCV_predictions_combined = list(itertools.chain.from_iterable(outerCV_predictions))
outerCV_IDs_combined = list(itertools.chain.from_iterable(outerCV_IDs))
rms = mean_squared_error(outerCV_targets_combined, outerCV_predictions_combined, squared=False)
print(f"rms error is: {rms:.2f}")
r2 = r2_score(outerCV_targets_combined, outerCV_predictions_combined)
print(f"r2 value is: {r2:.2f}")
rms error is: 0.88
r2 value is: 0.07
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = outerCV_targets_combined
ax.scatter(x, outerCV_predictions_combined, edgecolors=(0,0,0))
ax.plot([min(x), max(x)],[min(x), max(x)],'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.rcParams["figure.dpi"] = 300
plt.show()
It appears that it has modeled junk from the dataset, which is kind of expected, considering how big and noisy the dataset is (it needs tremendous feature engineering), we didn’t do any parameter searching or tuning (we just used the first random parameter set), and we were using RF (which I normally like, but I don’t believe it is as sophisticated as DNNs to handle this type of dataset). Anyhow, the point of this tutorial is not to make a good model or to discuss the MD descriptor generation method for that matter. The goal is to show the python code for a CV workflow, and how to read and do basic processing on such large data files. From there, you can expand more on other things that you want to do.
I will stop Part2B here for now. As always, I thank you for visiting my blog and welcome constructive feedback. Please let me know if you detect any mistakes as well. I will be working on the next post for part 3, where I will explain the algorithm of NeCV and compare Cross-Validation and NeCV.