Saturday, 21 November 2020

On the regulatory validation of AI models


Accepted epistemology suggests that a theory cannot be confirmed, since this would require infinite tests, but only falsified. So, we can never say a theory is true, but only that it has not disproved so far.  However, falsification attempts are not done randomly, they are purposely crafted to seek for all possible weaknesses. So, the process empirically works: all theories that resisted falsification for some decades were not falsified subsequently, at most extended. For example, special relativity theory addressed the special case of bodies travelling close to the speed of light but did not truly falsified the second law of dynamics.  In fact, if you write Newton’s law as F = dp/dt, where p is the momentum, even the case where the mass varies due to relativistic effects is included.

But once a theory has resisted extensive attempt of falsification, for all practical purposes we assume it is true, and use it to make predictions.  However, our predictions will be affected by some error, not because the theory is false, but because of how we use it to make predictions.  An accepted approach suggests that the prediction error of a mechanistic model can be described as the sum of the epistemic error (due to our imperfect application of the theory to the physical reality being predicted), aleatoric error (due to the uncertainty affecting all the measured quantities we use to inform the model), and numerical solution errors, present only when the equations that describe the model as solved numerically.  For mechanistic models, based on theories that resisted extensive falsification, validation means simply the quantification of the prediction errors, ideally in all its three components (which gives raise to the verification, validation and uncertainty quantification (VV&UQ) process).

A phenomenological model is defined as a predictive model that does not use any prior knowledge to make predictions, but only prior observations (data). Analytical AI models are a type of phenomenological models. When we talk about validation for a phenomenological model, we do not simply mean the quantification of the prediction error; since the phenomenological model contains an implicit theory, its validation is more related the falsification of a theory. And while an explicitly formulated theory can be purposely attacked in our falsification attempts, the implicit nature of phenomenological models forces us to use brute force approaches to falsification.  This bring us to the curse of induction: a phenomenological model is never validated, we can only say that with respect to the validation sets we used to challenge it so far, the model resisted our falsification attempts. But in principle nothing guarantees us that at the next validation set the model will not be proven totally wrong.

Following this line of thoughts, one would conclude that locked AI models cannot be trusted.  The best we can do is to formulate AI testing as a continuous process; as new validation sets are produced the model must be retested again and again, and at most we will say it is valid “so far”.

But the world is not black and white.  For example, while purely phenomenological models do exist, purely mechanistic models do not. A simple way to prove this is to consider that since the space-time resolution of our instruments is finite, every mechanistic model has some limits of validity imposed by the particular space-time scale at which we model the phenomenon of interest.  The second law of dynamics is not strictly valid anymore at the speed of light, and it also shakes at the quantum scale.  To address this problem, virtually EVERY mechanistic model must include two phenomenological portions that describe everything bigger than our scale as boundary conditions, and everything smaller than our scale as constitutive equation. All this is to say that there are black box models and grey box models, but not white box models. At most light grey models.  So what?  Well, if every model includes some phenomenological portion, in theory VV&UQ cannot be applied for the arguments above. But VV&UQ works, and we trust our life on airplanes and nuclear power stations because it works.

Which bring us to another issue. Above I wrote: “in principle nothing guarantees us that at the next validation set the model will not be proven totally wrong”. Well, this is not true.  If the phenomenological model is predicting a physical phenomenon, we can postulate some properties.  One very important that come from the conservation principles is that all physical phenomena show so degree of regularity. If Y varies with X, and for X =1, Y =10, and for X = 1.0002 Y = 10.002, when X = 1.0001, we can safely state that it is impossible that Y = 100,000, or 0.0003.  Y must have a value in the order of 10, because the inherent regularity of physical processes.  Statisticians recognise this from another perspective (purely phenomenological) by saying that any finite sampling of  a random variable might be non-normal, but if we add them eventually the resulting distribution will be normal (central limit theorem).  This means that my estimate of an average value will converge asymptotically to the true average value, associated to an infinite sample size. 

Thus, we can say that the estimate of the average prediction error of a phenomenological model as we increase the number of validation sets will converge to the true average prediction error asymptotically.  This means that if the number of validation sets is large enough the value of the estimate will change monotonically, and also its derivative will decrease monotonically. This makes possible to reliably estimate an upper boundary of the prediction error, even with a finite number of validation sets.

We are too early in the day to come to any conclusion, on if and how the credibility of AI-based predictors can be evaluated from a regulatory point of view.  Here I tried to show some aspects of the debate. But personally, I am optimistic: I believe we can reliably estimate the predictive accuracy of all models of physical processes, including those purely phenomenological.

A final word of caution:  this is definitely not true when the AI model is trying to predict a phenomenon affected by non-physical determinants. Predictions involving psychological, behavioural, sociological, political or economic factors cannot rely on the  inherent properties of physical systems, and thus in my humble opinion such phenomenological models can never be truly validated.  I can probably validate an AI-based model that predict the walking speed from the measurement of the acceleration of the centre of mass of the body, but not a model that predict whether a subject will go out walking today.


2 comments:

  1. Statisticians recognise this from another perspective (purely phenomenological) by saying that any finite sampling of a random variable might be non-normal, (MY COMMENT IS IN CAPITAL FOR
    I THINK THAT THE ABOVE SENTENCE IS NOT SHAREABLE FROM A STATISTICAL POINT OF VIEW. "NORMAL" OR "NOT NORMAL" DOES NOT DEPEND ON THE FACT THAT WE ARE TALKING ABOUT A POPULATION OR ABOUT A SAMPLE FROM THIS POPULATION. THE POPULATION IS NORMAL (GAUSSIAN) OR NOT NORMAL AND THE SAMPLE IS A RANDOM SAMPLE FROM A NORMAL (GAUSSIAN) OR NOT POPULATION.
    but if we add them eventually the resulting distribution will be normal (central limit theorem).
    AGAIN, AS ABOVE: IF WE ADD SAMPLES FROM A NORMAL (NOT NORMAL) POPULATION, WE OBTAIN SAMPLE WITH AN INCREASED NUMBER OF SUBJECTS OF THE PARENT POPULATION. THE CENTRAL LIMIT THEOREM APPLIES TO THE DISTRIBUTION OF THE MEAN VALUES OF THE SAMPLES FROM NOT NORMAL POPULATIONS THAT TENDS TO HAVE A NORMAL DISTRIBUTION, BUT IT IS ALWAYS AN APPROXIMATION.
    This means that my estimate of an average value will converge asymptotically to the true average value, associated to an infinite sample size. THE ABOVE SENTENCE IS VALID FOR CONSISTENT ESTIMATORS / ESTIMATES, AS IT IS THE USUAL CASE. IN ADDITION, THIS FACT DOES NOT DEPEND ON THE CENTRAL LIMIT THEOREM.

    ReplyDelete
    Replies
    1. Sorry, but for some reasons I was not notified of your comment, so saw it only now. Thank you for correcting a couple of imprecisions in my language regarding statistical aspects. I agree with both comments, which however do not change the essence of my post.

      Delete