Nested Cross-Validation & Cross-Validation Series – Part 3

This is part 3 of the Nested Cross-Validation & Cross-Validation Series where I will explain the algorithm of nested cross-validation (NeCV), and compare Cross-Validation and NeCV.

Please read this blog first if you need to learn about cross-validation so that you can dive into NeCV after.

I would like to first clarify that there are variations in the implementations and algorithms of NeCV. The algorithm I will be describing is a common one used in several studies including an article I published in 2020.

Below is a diagram illustrating the algorithm of NeCV. (taken from https://pubs.acs.org/doi/10.1021/acs.jcim.0c00200)

Figure 1: Illustration of NeCV (Nested Cross-Validation)

For those who prefer reading the algorithm, here is the pseudocode of the NeCV Algorithm (taken from https://pubs.acs.org/doi/10.1021/acs.jcim.0c00200)

Figure 2. Pseudocode of Nested Cross-Validation Algorithm

Note that, the goal of this NeCV procedure is not to get a single, finalized, robust model that is ready for making predictions, but instead to get an estimate of the unbiased generalization of the performance.

There are two cross-validations (CVs) in the nested cross-validation (note the word “nested”); an external CV and an internal CV. The number of folds for both the external and internal CVs can be defined according to the nature of the project and the developer. To simplify, I am going to choose 10 folds for both external and internal CVs.

So, what’s the difference between external and internal CVs?

The external CV is treated as an external test set for the “true” evaluation of your dataset by the different models chosen based on internal CVs. In that sense, you will create several models built from multiple internal CV procedures which are used to estimate the generalized error of your dataset by making predictions on the “outer test” folds.

In Figure 1., refer to the 10 folds split from the full dataset. Each fold (red fold in Figure 1.) is treated as an external test set while the other nine folds are used for internal CV. In the end, we use 10 optimized models using internal CVs to make predictions on the associated outer test folds. We gather model statistics on all the outer test folds and average them to get the generalized performance of the models on the dataset.

Meanwhile, the internal CV is used to perform hyperparameter tuning and to choose optimal models. These models are fitted with internal folds (aka outer train) and then used to make predictions on the relevant external folds. Essentially, we keep this internal CV process which involves model selection and hyperparameter tuning “separate” from the external CV process which is used for model evaluations.

For the internal CV loop, you will be working with the nine outer train folds (let’s call it outer train; see Figure 1.) after you exclude the outer test fold. Then, you basically apply the simple CV on this outer train by splitting it into “10” internal folds, exploring repeated grid search, tuning hyperparameters, evaluating on the inner test folds, and building an optimal model using all the internal folds (aka the outer train data) which is then used to make predictions on the relevant outer test fold.

In other words, you wrap the native CV procedure explained here in the internal CV. Internal CV thus incorporates these important aspects of model building: grid search, hyperparameter tuning, inner model evaluations, using the results to build an optimal model with the best parameters and the data from all the inner folds (aka. outer train) and use that model to make predictions on the relevant outer test fold. This internal CV procedure is then repeated for all possible outer train folds to build optimal models and to make predictions on all the relevant outer test folds.

This is quite computationally heavy and expensive!

But importantly, all the outer test folds in this NeCV procedure are not used in the internal CV for anything relevant to model-building aspects which involve hyperparameter tuning or optimal model selection; the outer test folds are mainly used for external assessment of the models built from the internal CV procedure, and thus treated as external test sets.

So, is NeCV worth it?

My colleagues from grad school and I often debate whether the cost of the NeCV computation truly justifies the evaluation results. Are the final stats really worth it, especially if the averaged stats from NeCV don’t really differ much from the stats from native CV? What do you think?

My opinion is that if the dataset size is decent enough that it would not take much time to conduct NeCV, then why not? You can make a strong case to evaluate the “unbiased” performance statistics. If you are aiming to use those stats for comparing to industrial-strength models to make predictions on actual novel compounds, yes, you should go ahead and do it. Please note that the procedure will involve additional coding challenges if you are not familiar with programming.

However, if your dataset is large, if you are using lots of descriptors, if you have serious time constraints for small proof-of-study curiosity-based projects, well that’s your call. It really boils down to the kind of problems you are solving, the amount of time and energy you would like to invest, and the strength of model validation that must go in your research study.

Cross-validation vs. nested cross-validation

Both native cross-validation (also known as k-fold cross-validation) and nested cross-validation (NeCV) are useful in working with small datasets.

The major pitfall of cross-validation is that it can give a significantly biased estimate of the true error because the same CV procedure is used to tune and optimize parameters, to perform model selection, and also to estimate the generalization error. The pro is that it is much less computationally intensive than NeCV.

In NeCV, these two tasks are split: the internal CV is used for model selection/hyperparameter tuning, while the external CV is used for estimating the generation error. So, the pro is that it is designed to give an almost “unbiased” estimate of the true error. The downfalls of NeCV are the high computational cost and perhaps the challenge in implementing the algorithm.

Ending notes

I would like to emphasize again that there are several variations of cross-validation and NeCV algorithm applied in different research studies. Some of the well-known procedures are repeated cross-validation (cross-validation is repeated on the entire dataset multiple times, thus resulting in different splits of data, and the variation in the resulting prediction performance is analyzed), stratified cross-validation (where the endpoint is stratified and the same distribution is accounted for in different strata), repeated stratified NeCV, and so on. There are multiple validation algorithms out there, and the terms will vary from study to study. The NeCV algorithm explained here has been referred to as double cross-validation in some studies. Anyhow, the cool thing is you get to explore and pick your own poison (weapon?) suited for your problem.

One last note before I end this blog post … even though it can look pretty doable and easy to build machine learning models, it can be challenging to build reliable, robust models that are industrial strength level. You have to know your dataset well; ensure that it has high-quality and well-curated data, perform several data analysis and visualization techniques to ascertain that the dataset is, in fact, modelable, and minimize the biases that can be introduced during the model building process. You have to carefully select the machine learning algorithm and descriptors, train and test set splitting techniques (esp. when working with difficult distributions, or skewed endpoints), the validation procedure, and so on. You also need to put more effort into descriptor selection especially if you are using relatively simple machine learning algorithms; more complex machine learning algorithms are able to discriminate which descriptors are useful and whatnot towards the endpoint on their own. And oh yes, you also have to define the applicability domain for your model because you will need to know if the new compounds that you are interested in predicting fall within the chemical scope of your model, and the level of certainty and/or reliability your models’ predictions will be on those compounds. You cannot train your models on different types of birds and expect them to make predictions on what type of cars are in the picture.

The next blog post is on part 4 of this series where I will write a tutorial and share the python NeCV implementation code for two chemical datasets. Phewww … honestly, I am a little dreading it, but I will try my best. Please stay tuned.

Thank you so much for reading this post. As always, I welcome constructive feedbacks and suggestions. Please do let me know if you see any mistakes or issues as well.