How To Curate Chemical Data for Cheminformatics

In early 2021, I gave a talk at the MIDD+ Conference held by Simulations Plus Inc. on data curation using one of the projects that I worked on — the Madin-Darby Canine Kidney (MDCK) project. In this blog post, I will be focusing on the general data curation aspects of that project.

Let me emphasize why data curation is an essential step in Cheminformatics. A machine learning model can only be as good as the data it is built on. If the data is noisy, filled with activity cliffs, and full of mistakes, you won’t get any useful model. Often, cheminformatics datasets aren’t quite large, like thousands and thousands of observations in other fields like image classification or text generation. With “relatively” smaller datasets, we need to ensure that the data we have is of high quality.

Inspired by the significant Quote from Spider-Man, “With great power comes great responsibility,” I would rephrase it slightly.

With (small or large) data comes great responsibility to ensure the dataset is high-quality and can in fact be modelable.

It’s a nerdy quote for sure, and I am a big Marvel’s fan.

Okay, back to the topic.

Gerenal Steps on How to Curate Chemical Data

Below are general steps you can apply to any project for data curation efforts.

  1. Initial exploration of the dataset
  2. Identify and handle complications or noise existing in the dataset (e.g.,
    • case 1: there are different cell-line types mixed in MDCK dataset, and they are too different to model together;
    • case 2: there are in-silico predictions present in the dataset, and they should be identified and removed)
  3. Exclusion of odd entries & unreliable data sources
  4. Analysis of intra- and inter-lab experimental variability
  5. Chemical data processing
    • Structural cleaning, salt removal & standardization
    • Treatment of tautomeric forms
    • Removal of identical duplicates
    • Handling duplicate structures with multiple measurements
    • Activity cliff verification & analysis
  6. Mistake Diagnosis & Correction

I will expand on the above steps and show you how I used them specifically for the MDCK project.

MDCK is a commonly used mammalian cell line (originally extracted from an adult female cocker spaniel) to measure apparent permeability. Permeability is an essential property of drug candidates that influences absorption & distribution. There are a lot of advantages of using MDCK Cells. They are easy to use, analytically clean, with good viability and reproducibility. They have clear apical-basolateral polarity, well-defined cell junctions, and a rapid growth rate (three-day growth period). In MDCK project, I was building a machine learning model to predict MDCK Papp values for a given chemical structure.

Machine Learning Pipeline

Figure 1. Simplified workflow from Untold Stories of Data Curation by Phyo Phyo (2021 MIDD+ Conference)

In a typical machine learning pipeline, you would have these essential steps (Figure 1):

  1. Data Mining
  2. Data preprocessing
  3. Data Extraction & Analysis & Curation
  4. Descriptor Generation & Selection
  5. Machine Learning Techniques
  6. Model Training & Validation
  7. Assessment of Applicability Domain

Would it surprise you that ~80-85% of the effort and time go into the first three steps of the project timeline?

Yes, data mining, preprocessing, extraction, analysis, and curation are indispensable parts of the project. They are often the most laborious and heaviest along the pipeline. The techniques used in these steps vary from project to project.

I will zoom in on the data curation part of the project at this point (Figure 2).

MDCK Data Curation Workflow

First and foremost, MDCK related data is mined from multiple databases (such as ChEMBL, GOSTAR) and literature. See Figure 2. for a visual illustration of the chemical data curation workflow.

Figure 2. MDCK Data Curation workflow from Untold Stories of Data Curation by Phyo Phyo (2021 MIDD+ Conference)

During the initial exploration of the dataset, I found some major complications that I needed to tackle; cell line categorization, distinguishing experimental & predicted values, and extraction of permeability direction (all expanded below).

Biological/chemical datasets can have multiple major complications to address. It can take a lot of time investment and effort, so tackle one after another.

Cell Line Categorization

In the original data which contained about 13k observations, all the cell lines (such as MDCK, MDCK-LE, MDCK-MDR multi-drug-resistant, MDCK-BCRP, MDCK-II) come in with or without proper annotations (in many cases, without). They are different enough that we don’t want them mixed in our final dataset. As I recall the process, it was a headache to deal with because it falls on the curators’ shoulders to ensure that the cell lines are well-categorized and accurately labeled. So, I used web scraping and text-mining approach combined with keyword matches to help accommodate the process. It was a very helpful and efficient process, though it may not be 100% perfect. In order to come up with useful keywords to either match or discriminate, you need to do a lot of prior reading and research and incorporate the knoweldge into the pipeline.

Distinguishing Experimental & Predicted Values

Another thing I observed was that the values reported from databases were not always experimental values, but in-silico predictions from other software tools or machine learning models. I came to notice it during the curation process when I observed some endpoints beyond reasonable range, so I went back and double-checked. Unfortunately, those entries are not tagged or labeled as “in-silico predictions”. They are there like their peer experimental entries, so I needed to track them down and eliminate them. I did so again with the help of python scripts and automation to scan through multiple resources and identify possible in-silico predictions. Once I had a smaller list of entries that are flagged suspicious, I manually checked the original articles, inspected them carefully, and determined whether they were in fact experimental values or computational predictions. If I remember correctly, about 150 entries (from ~23 articles) in the dataset were predictions from other tools and thus removed. Another observation I made during that part was computational tools such as ADMET Predictor, QikProp, and PreADMET are often used in predicting MDCK values.

Extraction of Permeability Direction

In these studies, I also needed to account for which direction the measurements were made, apical-to-basolateral or basolateral-to-apical. The assay descriptions in the dataset usually provide that information, so it can be extracted using automation and scripts as well. Make sure to manually inspect the dataset after the process, because there may be human mistakes (e.g. spelling mistakes — baslateral instead of basolateral, apial instead of apical) that are hard to be caught by automation (unless you also account for possible mistakes in spellings as part of the automation script). Once these are annotated, you can subset the data based on the direction that you want to model. In this case, I was focusing on the apical-to-basolateral direction.

Exclusion of Odd Entries & Unreliable Data Sources

There were multiple entries that just couldn’t be used, for example, entries with odd endpoint values, units, or entries with unidentified or unreported cell line information. There are also studies where P-pg (P-glycoprotein) inhibitors such as erythromycin, verapamil, quinidine are used in combination with the MDCK cell line, and where cell lines are transfected with specific agents or modified to accomplish certain objectives. Many entries introduce complications or ambiguity for the dataset and are thus removed. Some of these cases can be found through manual inspection or different types of analysis.

During this part, I made sure to remove entries with missing units/values, subzero or unreasonable values and kept only the values with an “=” sign, i.e., the values that are reported with mathematical signs such as <, >,  ≤, ≥, ranges are discarded.

Analysis of intra- and inter-lab experimental variability

It’s a good idea to analyze and observe the experimental variability of the compounds intra-lab and inter-lab. It’s additional work for sure, and not always easy to do, given the type of project or dataset you will be working with. However, doing this analysis will give you a good idea of the reasonable expectations you can have on the models you will be building. When you eventually build models, the values from these analyses will be useful for the comparison and assessment of your models’ stats and performances.

Chemical Data Processing

At this point, the dataset can go through standard chemical processing which involves salt removal, structural cleaning, and standardization, tautomeric standardization and treatment, removal of identical duplicates (both in structure and endpoint), and treatment of structures with multiple measurements. Another important step is to perform activity cliff analysis, identify compounds with activity cliff & verify them.

One thing I would note is that I have often used python-based open-source tools, and I haven’t come across one that can handle tautomers very well; such as identifying tautomeric duplicates or standardizing tautomers. It’s a difficult problem after all.

For the chemical data processing and standardization, I would highly recommend ADMET predictor because I have used it consistently across projects and I am quite confident in its far-ranging capabilities especially for tautomeric duplicate search, tautomeric standardizations, activity cliff analysis, descriptor generations, modeling capabilities, etc. Please note that I currently work at Simulations Plus Inc., and I will be giving a presentation at the upcoming MIDD+ conference in February 2022.

Okay, I got side-tracked a little. You would think chemical data curation is done at this point. You wish, but no. We are just getting started. 😀

There are several mistakes (often an astounding amount of mistakes) present in chemical/biological datasets. One of our tasks is to fish them out in whatever form they may come, and deal with them accordingly.

I started documenting the types of mistakes detected in the MDCK project while I was deep in the curation part of the pipeline. Figure 3. is, by no means, “accurate” accurate because I was not consistently documenting them but I kept at it as best I could.

One thing I was intrigued to learn is how many and what types of mistakes I would detect or fish out in the process. I don’t imagine presenting it in papers or in journals so here’s the raw figure with raw stats. After discarding unverifiable entries, ~270 entries were corrected (which accounts for ~30% of the final dataset). Mistakes are abundantly present in the databases, and that’s why much effort is focused on chemical data curation. There are several papers and talks highlighting the importance of data curation, and here’s a good video presentation on data curation by Dr. Pankaj Daga that you should check out.

Figure 3. Mistake Types Found During MDCK Data Curation Process (2021 MIDD+ Conference)

Where is Waldo?

I like using metaphors, so here is one. Have you ever played Waldo?

Figure 4. Finding Waldo (2021 MIDD+ Conference)

Finding Waldo is like finding mistakes in the dataset, but harder … much harder. Because in the case of datasets, we don’t know how many Waldos are there, or what shape they will be. Waldos may come in unrecognized forms. You know you are chasing after odd-looking stuff, but you aren’t really sure what they will look like. Well, you get the idea.

It is always good to have reasonable expectations, and we are often limited in time and resources. So, we have to curate data efficiently within reason. It would be unrealistic to be 100% confident about dataset correctness, especially in very large datasets. But, the goal is to find as many mistakes as we can using the tools and methods we have.

Common types of Waldos in the MDCK dataset are:

  • Biological Endpoints (Papp values)
  • Reported Units
  • Chemical Structures
  • Cell-line Information
  • Direction of permeability

I will show some examples below.

Inconsistency in Papp Unit Formats Reported

There are mistakes in databases resulting from unit formatting inconsistency in literature. In an example, the Papp units for entries are reported in databases as 10’6cm/s and 10′-6cm/s. The correct unit is 10′-6 cm/s. But it’s the formatting issue from literature: they report all these different formats to represent 10′-6 cm/s —> x10-6 cm/s, 10’6 MDCK Papp(A-B) (cm/s), Papp A-B (cm/s x 10’6), Papp (cm/s) x 10’6. When databases report them, they just take them as they are, and now the values are all over the place. Some of these mistakes can be detected using distributional analysis.

Mistakes in Papp values

Some mistakes are in the endpoint values reported as in Figure 5 below. These are often human-made mistakes when they type in or report the numbers from literature, and they mess up some decimals or digits, or units. Some of these mistakes can be detected by distributional analysis and duplicate structure assessment when merging multiple databases.

Figure 5. Mistakes in Papp Values (2021 MIDD+ Conference)

Mistakes in Structures

Structural mistakes are quite common as well. Figure 6. below is an example, where two databases report different structures from the same article. These different structures share the same endpoint value and unit. These seem to be human-made mistakes as well. Some of these can be detected using the duplicate endpoint assessment method when merging multiple databases.

Figure 6. Mistakes in Structures (2021 MIDD+ Conference)

Another way to detect structural mistakes is if you have compound names, you can write scripts to retrieve structures from other databases. Then, you can compare them with what you have.

You can also perform cluster analysis to see which compounds or endpoints look suspicious. Also, make sure to manually inspect your modeling-ready dataset. I can assure you that during the modeling process, you will catch more mistakes here and there, based on your models’ predictions and uncertainty analysis. And you will find yourself in a loop between curation and modeling. Data curation is not a straightforward route, so you will go back and forth as needed for your project.

Summary of the Data Curation Steps

So, just to recap …

Before you dive into a project, perform an initial exploration of the dataset, and identify potential complications. Then, inspect the dataset and exclude odd or ambiguous entries. Next, perform analysis of intra- and inter-lab experimental variability.

Then, perform standard chemical processing steps such as structural cleaning, salt removal, functional group standardization, treatment of tautomeric forms, and removal of identical duplicates. You may need to address compounds with multiple measurements. Make sure to perform activity cliff analysis and verify them.

Last, perform mistake diagnosis and correction to fish out as many mistakes as you can out of the dataset.

Final Thoughts

You will find yourself still in the data curation process even when you are building models. Some of your models may have suspicious or odd predictions for certain compounds, so you will find yourself going back and checking them. Data curation is not really done until after you finalize the model and finish the entire project. So, expect to be revisiting that stage often.

There are a lot of creative and methodical approaches based on the project’s nature in detecting mistakes and curating data. I am sure many people feel differently about the data curation process. But I personally find it exciting and fun because who wouldn’t like playing detective in our work now and then!

Pheww … It got longer as I wrote on and on. But it’s all that I want to say in this post for now.

Hopefully, you find this post useful. Thank you so much for reading this post. As always, I welcome constructive feedback and suggestions.