All Models are Wrong, 7 Sources of Model Risk
The 2008 financial crisis revealed to the world (in spectacular fashion) the fragility of financial models. Since the financial crisis two words have come up time and time again: model risk. This article defines model risk and discusses some of the contributors to it which people overlook. These contributors include the assumptions every quant has a love-hate relationship with namely, linearity, stationarity, and normality; as well as some of the statistical biases which rear their heads when developing models or just working with historical data.
Models are used to represent some object in the real world. We build models so that we can use them to infer things about that object the real world. For example, CAD models of buildings allow engineers to infer how those buildings would behave in the event of an earthquake. The risk in using these models is that the model is significantly wrong because if the model was wrong then the results from that model would not reconcile with the real world and any and all inferences or decisions made using that model would be erroneous.
There are many types of financial models but they are all used to represent something from the world of finance. An interesting challenge with financial modelling is that the things being modeled are, more often than not, imaginary. The most popular type of model is a valuation model. Valuation models are used to estimate what the fair value of an illiquid and / or complex security or portfolio is. Valuation models are also used to calculate the sensitivities which are used in risk and capital management models which are used to create hedging strategies.
Model risk in finance is defined as the risk of financial loss resulting from the use of financial models. Put simply, it is the risk of being wrong; but to be more specific it is the risk of being very wrong such as we were in 2008. In reality all quants should remember the following the following famous quote by George E.P Box,
"Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful."
Assumptions are unavoidable when building financial models so the challenge is in making assumptions which do not render the model useless for it's intended purposes. One maxim which I often advocate when discussing Machine Learning models is Ockham's Razor. Ockham's Razor argues that when choosing between two models of equivalent prediction accuracy the one with fewer parameters and / or assumptions will generalize better. Many practitioners misinterpret this argument to mean that "simpler models are better". This is simply not true.
Simpler is only better if the models are of equivalent prediction accuracy. If they aren't them simpler models are often inadequate and under-fit the data. Unfortunately, I believe that researchers like to use the "simpler models are better" argument to get away with using dangerous assumptions which make their models more tractable, more elegant, but also more wrong. The benefit of Computational Finance is that it allows us to create and deal with more intractable, less elegant, but also more realistic and (hopefully) more accurate models.
The remainder of this section presents three dangerous assumptions which I see being used everywhere. Oftentimes these assumptions are used without giving them a second thought. I am not saying that these assumptions make models so wrong that they are rendered useless; I am saying that most researchers don't realize they are making these assumptions and that is dangerous. Before that here is Ockham's and an Anti Razor,
Entities must not be multiplied beyond necessity - William of Ockham
Entities must not be multiplied beyond necessity - William of Ockham
Entities must not be reduced to the point of inadequacy - Karl Menger
Entities must not be reduced to the point of inadequacy - Karl Menger
Linearity is the assumption that the relationship between any two variables can be expressed using a straight line graph. Linearity is a common assumption buried in financial models because most correlation metrics are linear measurements of the relationship between two variables. Some correlation measures cater for non-linearity.
The first problem with using correlations is that you might mistakenly conclude that there is a linear relationship between two variables, when in fact there is a non-linear relationship. In this scenario your model will work well for small forecasts but not at all for large forecasts. Alternatively you might assume that there is no linear relationship between two variables, when in fact there is a non-linear relationship. In this scenario your model would not capture the intricacies of the system being modeled and would most probably suffer from poor accuracy.
In other words if the relationships are non-linear then any linear measure will either, not detect the relationship at all, or under and / or over estimate the strength of the relationship. You may be wondering why is this a problem?
Firstly, in portfolio management the diversification benefit of the portfolio is captured using the historical correlation matrix of the constituent assets' returns. If the relationships between any two assets is non linear (such as with some derivatives) then correlation will either over or understate the diversification benefit and the portfolio will be more or less risky than expected. Secondly, if a company is reserving capital for solvency reasons and they assume a linear relationship between various risk factors then this could result in the company holding too little or too much capital. So the stress tests will not truly reflect the the companies risk.
Additionally, if you are performing some sort of classification and the relationship between two classes in the data is non-linear then your classifier could incorrectly assume that there is just one class in the data. One nice workaround which allows you to classify non-linear data using a linear classifier is the Kernel trick. This technique adds an additional dimension such as similarity which allows use to linearly separate the two different classes,
Stationarity is the assumption that a variable, or the distribution from which a random variable is sampled, is constant over time. In many fields stationarity is a reasonable assumption. As an example, the gravity constant isn't suddenly going to change very much from one day to the next, so it can be treated as a constant. The same assumption cannot be made as lightly for financial markets which are complex adaptive systems.
In the context of model risk three things are often assumed to be stationary which are, more often than not, non-stationary namely: correlations, volatility, and risk factors. Each one of these present their own problems.
Assuming that correlations are stationary is dangerous because, as mentioned previously, they are used to measure the diversification benefit of portfolios. Diversification is the reduction in risk as a result of holding many securities which either move oppositely to one another (negatively correlated) or just differently to one another (low correlation). Unfortunately, correlations are not stable over time and tend to break down during market downturns. In other words, right when you need diversification the most, is when you don't have it.
The above chart shows the rolling correlations for stocks included in the Financial 15 index in South Africa. There are logical reasons why these stocks should be correlated but, as can be seen from the image, there are times during which those correlations break down. In my opinion the reason for this is leverage. Stocks across different asset classes are connected by the firms which trade across them. This is discussed here in more detail.
"The correlations between financial quantities are notoriously unstable." - Paul Wilmott
Another variable which is often assumed to be stationary, especially when using stochastic processes to model security returns and prices, is volatility. Volatility is a measure of how much securities returns vary over time. Generally speaking the following relationship holds for derivatives securities - higher volatility equates to higher derivatives prices. Why? Well because there is a higher probability of the derivative being in the money come expiry. As such, if your model underestimates volatility it will probably under-price the derivatives.
The stochastic process which underpins the Black Scholes model is Geometric Brownian Motion. This model assumes a constant volatility over time. Notice the difference between the range of possible returns when using this model over a the Heston model which uses the CIR process to model stochastic volatility.
In the first image the range of potential end values is between 500 and 2000 whereas in the second image the range of potential end values is between 500 and 2500. This is an example of the impact of volatility.
Last, but not least, many investors when backtesting strategies implicitly assume that risk factors are constant over time. In reality risk factors such as momentum, value, mean-reversion, and firm-size may become stronger or weaker over time as market demographics change. In the height of a bubble momentum is driving returns; in the depth of a recession the value factor is probably driving returns (if there are any). Risk factors are cyclical.
This animated graph illustrates quite nicely a dynamic distribution and how a genetic algorithm adapts to changes in the distribution over time. Such dynamic algorithms are required for risk management.
Normality is the assumption that a random variable follows a normal distribution. Normal distributions, also known as Gaussian distributions, are convenient for many reasons. Firstly, the combination of any number of normal distributions results in a normal distribution. Secondly, the normal distribution can be manipulated algebraically more easily meaning that academics can more easily arrive at closed form elegant solutions to complex problems.
The problem here is that many models, including the delta-normal approach to calculating Value at Risk and Geometric Brownian Motion (which underpins the Black Scholes model) assume that market returns are normally distributed. In actual fact market returns exhibit excess kurtosis and much fatter tails. What this means is that companies often underestimate the amount of tail risk they are exposed to and are unprepared for market crashes.
Case in point - the 1987 Stock Market crash. On October 19 1987 many stock markets across the world fell by more than 20%. This is still, to date, the single worst one-day drop in the S&P 500 since the 1950's. The interesting thing is that in the normal world (i.e. the world which follows the normal distribution) this event should never have happened. The 1987 Stock Market crash was an 18-sigma event which is essentially a statistical impossibility.
Another example of how the normal distribution assumption can render models essentially useless (at least for risk management) is David X Li's model of credit default probabilities. This model assumes that credit defaults in a portfolio were correlated according to a Gaussian Copula, which is basically a high dimensional normal distribution. Then again, perhaps they were (once-upon-a-time) but the distribution changed i.e. it was non-stationary because of changes in the macro-economic environment caused by the Federal Reserve Bank?
A statistic is said to be biased if it is different from the population parameter of interest. In other words, because of some problem with the way the statistics are calculated they are different from the true population parameters. There are many reasons why this might happen some of the most common reasons are discussed below.
4. Sampling Bias
Sampling bias, when a sample is not representative of the population, is often caused when the sample selection strategy is biased. Put simply this means that the probability of any given pattern occurring in the sample is greater than or less than the probability of that pattern appearing in the population. There are many sample selection techniques but the most popular ones are simple random sampling, systematic sampling, stratified sampling, and multistage (cluster) sampling. Each technique and it's pro's and con's are discussed below,
Under simple random sampling every pattern has an equal chance of being selected as part of a sample. This strategy is easy to implement, fast, and works well when a population contains a single class of patterns but when the population contains multiple classes of patterns and the probabilities of a pattern belonging to each of the classes are different then simple random samples may under represent patterns from classes which have a relatively low probability of occurring. This causes the sample to be non-representative and statistics to be biased.
Stratified sampling can be used for labeled data where the number of patterns selected from each class is proportional to the size of that class. For example, given patterns belonging to three classes (A, B, and C) where the percentage of the patterns belonging to each is 5%, 70%, and 25% respectively, a sample of 100 patterns could contain exactly 5 patterns from class A, 70 patterns from class B, and 25 patterns from class C. The benefit is that the sample will be representative; the drawback is that the strategy can only be used for labelled data.
Lastly, multi-stage sampling is a sample selection strategy which does stratified sampling on unlabeled data. This strategy involves two steps; firstly, the data is clustered into classes using a clustering algorithm such as K-Means Clustering or Ant Colony Optimization; and secondly, the data is then sampled such that a proportionate amount of each class is represented in the sample. This strategy overcomes the drawbacks of simple random sampling for data-sets containing multiple classes, but hinges on the performance of the chosen clustering algorithm.
In addition to the biases which may originate from choosing different sample selection strategies; there is the curse of dimensionality. The curse of dimensionality is that as the number of patterns required to produce a representative sample grows exponentially with the number of attributes in those patterns. At some dimensions it is almost impossible to produce a truly representative sample so any statistics will ultimately be biased.
5. Over and under fitting
Overfitting occurs when a model describes noise (randomness) in the data set rather than the underlying statistical relationship. This often results in fantastic in-sample performance and poor out-of-sample performance. When this occurs the model is said to have low generalization ability. Generally speaking overfitting occurs when a model is overly complex (or rather, when the training strategy is oversimplified given the complexity of the model). Complexity in this regard refers to the number of parameters which may be adjusted in the model.
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. - John von Neumann
Overfitting happens very easily, as such, there is a lot of discussion about it on other quant blogs. In my opinion, quants use overfitting as a reason to motivate their Luddite attitude towards the use of complex models such as deep neural networks for quantitative modelling and trading. Many go as far as to claim that simple linear regression models will outperform most complex models over time. Unfortunately these people tend to ignore the more subtle effects of underfitting i.e. when the model is too simple to learn the underlying statistical relationship.
As mentioned previously, whether or not a model over or underfits the data depends a lot on the training strategy used to develop the model. To reduce the probability of overfitting most researchers use a technique called cross-validation. This technique involves dividing the data set into representative partitions namely a training, testing, and validation set. The model is trained using the training set and tested independently on the testing and validation sets for overfitting. If the model shows signs of overfitting the model stops training. Oftentimes to improve the performance of the model multiple cross validations are performed. The challenge with this is simply that you need quite a lot of data to produce multiple training, testing, and validation partitions.
6. Survivorship bias
Survivorship bias is caused by performing statistical analysis on a data-set which only includes entities which have 'survived' to a particular point in time. The classic example of this is analyzing the returns generated by hedge funds. In the past three decades many hedge funds have blown up (e.g. LTCM) or shut down after heavy losses. If a study were conducted on the returns generated over the past three decades but only the hedge funds that were still operating today were included, then the data-set would not capture the riskiness of hedge funds because none of the failed hedge funds would be included. This is called the survivor effect and is shown below,
7. Missing Variable bias
Omitted-variable bias occurs when a model is created which leaves out one or more important causal variables. The bias is created when the model incorrectly compensates for the missing variable by over or underestimating the effect of one of the other variables. This is especially true when the included variables are correlated to the missing causal variable. Alternatively the missing variable may result in a larger prediction error.
The challenge of identifying independent variables which may have predictive power over the dependent variable is not simple. One approach is to identify sets of variables which explain the most variance in the dependent variable. This approach is called best-subset. Alternatively, you could identify eigenvectors (linear combinations of available variables) which account for the most variance in the dependent variable. This is the approach taken when performing principal component analysis. One problem with principal component analysis is that it too may overfit the data and the eigenvectors may not generalize well over time. Lastly, you could iteratively add variables to your model. This is the approach taken by step-wise multiple linear regression and adaptive neural networks.
Ultimately it is probably impossible to develop a truly unbiased model. Either your assumptions (whether explicit or implicit) bias the model or the actual development process will bias the model ... and even if you can avoid both of these sources of bias there will still always be cognitive biases introduced by the users of the model. All of these increase model risk i.e. the risk of the model being wrong enough to cause financial loss. That having been said, the most important takeaway from this article is that even though all models are wrong, some models are still useful.
Great article. I would like to be able to post it directly to my Facebook page and it is not posted to yours.
As an aside it seems to me there is a great mental shift for a trader based mentality transitioning to a systems algo view of the world. Where a trade becomes the construction of a vector and vectors become variables in set construction .That said I think the "mongrel" trade view is still helpful in determining through experimentation what the rules/constraints for the trades are given even a minor variation in the rules gives a totally different vector and set of vectors.