Saturday, 27 September 2025

Credibility assessment of predictive biophysical models: a well-established practice

A recent post by @BryceKillen showed that there is still much work to disseminate the mature regulatory science that is now well-established for biophysical models among biomechanics researchers.  What follows are some reflections I hope you will find useful; if any editor of a biomechanics journal thinks this debate is useful, I could make a longer version as publishable commentary. 

A good starting point for those interested is the Open Access book “Toward Good Simulation Practice”, which we published a few years ago.  I will pick freely from it, and in particular from Section 2, “Theoretical Foundations of Good Simulation Practice”.  There, we explain that models can be used in science:

as tools used in the development and testing of new theories

as tools for problem-solving

The first case is more complex, but it is rarely the scope of biomechanical modelling; let us just say that in that context, there is no model credibility, but only models (and thus theories we used to build them) that are falsified (or not) by experiments. When a model is used for problem-solving, we talk of the model’s credibility. The two cases should never be confused.


If we now restrict ourselves to problem-solving, there is a well-established body of knowledge, both theoretical and empirical, that all biomechanics researchers using predictive models should be familiar with. In addition to the mentioned book, probably the best source is the ASME VV40:2018 technical standard “Assessing Credibility of Computational Modeling through Verification and Validation: Application to Medical Devices”. While the standard focuses on medical devices, ample literature and various regulatory authorities have acknowledged its general validity for any biomedical problem-solving application involving first-principle models (models based on prior mechanistic knowledge).

A first key step to assess credibility is to define the context of use. In fact, a model is never credible in general but with respect to a specific context of use. An important feature of a context of use is the acceptability threshold, defined as a norm and a value. The norm is necessary to reduce the Quantity of Interest (QI, what we want to predict) to a scalar, and the value is the maximum acceptable error according to that norm, for that specific context of use. 

This assumes that, at least in certain conditions, I can measure the QI with an accuracy that is at least one order of magnitude higher than the acceptability threshold for the predicted QI; this makes it possible to assume the measured values as true values and assume that all differences with the predicted values are prediction error (validation).  When this is not possible, things get tricky (but this discussion is beyond the scope of the comment).

My model predicts the QI as a function of certain inputs.  If, as in most cases, at last one of the inputs is a continuous quantity, even within the limited range of admissible values (input space), there are infinite inputs possible.  We define a model as credible if its prediction error is less than the acceptability threshold for any possible valid input.  To demonstrate this, one should proceed by induction, computing the prediction error for a very large number of valid input sets covering the entire input space.  Unfortunately, models are used when measurements are difficult to achieve, so in most cases, there is a scarcity of experimental true values to compare with. 


Enters the Verification, validation, uncertainty quantification and Applicability Analysis (VVUQA).  This developed as an engineering practice, but since then has received theoretical foundations (e.g., https://doi.org/10.1109/JBHI.2019.2949888). The idea is that even if I do not have enough validation experiments to demonstrate the credibility of a model by induction, I can still assess its credibility by decomposing its prediction error among its various sources and confirm that each error component behaves as expected. This supports the assumption of regularity necessary to assume that the prediction error obtained with a finite set of validation experiments is good enough to assess the credibility of the model.

Biophysical models are affected by approximation (numerical), aleatoric and epistemic error.  The VVUQA process first calculates the overall prediction error over the available validation experiments and confirms that such a value is smaller than the acceptability threshold. Then, various techniques can provide upper boundaries for the approximation error, so as to demonstrate that such an error component is negligible compared to the other two. Uncertainty quantification methods aim to demonstrate that the aleatoric component of the error is normally distributed with a mean close to zero, which allows us to assume a mean norm, such as a Root Mean Square Error, as a measure of the epistemic component of the prediction error. Last, applicability analysis looks at how the prediction error varies with the input values, and from that decides how much we can trust predictions made for input values far from those we tested in the validation experiments.

With this well-established practice, biomechanics researchers can assess if the predictive accuracy of their model is good enough for a defined problem-solving context of use.


 

No comments:

Post a Comment