Image Image Image Image Image Image Image Image Image Image

Turing Finance | October 20, 2016

Scroll to top



10 misconceptions about Neural Networks

10 misconceptions about Neural Networks

Neural networks are one of the most popular and powerful classes of machine learning algorithms. In quantitative finance neural networks are often used for time-series forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling. They have also been used to construct stochastic process models and price derivatives. Despite their usefulness neural networks tend to have a bad reputation because their performance is "temperamental". In my opinion this can be attributed to poor network design owing to misconceptions regarding how neural networks work. This article discusses some of those misconceptions.

  1. Neural networks are not models of the human brain
  2. Neural networks are not just a "weak form" of statistics
  3. Neural networks come in many different architectures
  4. Size matters, but bigger isn't always better
  5. Many training algorithms exist for neural networks
  6. Neural networks do not always require a lot of data
  7. Neural networks cannot be trained on any data
  8. Neural networks may need to be retrained
  9. Neural networks are not black boxes
  10. Neural networks are not hard to implement

1. Neural networks are not models of the human brain

The human brain is one of the great mysteries of our time and scientists have not reached a consensus on exactly how it works. Two theories of the brain exist namely the grandmother cell theory and the distributed representation theory. The first theory asserts that individual neurons have high information capacity and are capable of representing complex concepts such as your grandmother or even Jennifer Aniston. The second theory neurons asserts that neurons are much more simple and representations of complex objects are distributed across many neurons. Artificial neural networks are loosely inspired by the second theory.

One reason why I believe current generation neural networks are not capable of sentience (a different concept to intelligence) is because I believe that biological neurons are much more complex than artificial neurons.

A single neuron in the brain is an incredibly complex machine that even today we don’t understand. A single “neuron” in a neural network is an incredibly simple mathematical function that captures a minuscule fraction of the complexity of a biological neuron. So to say neural networks mimic the brain, that is true at the level of loose inspiration, but really artificial neural networks are nothing like what the biological brain does. - Andrew Ng

Another big difference between the brain and neural networks is size and organization. Human brains contain many more neurons and synapses than neural network and they are self-organizing and adaptive. Neural networks, by comparison, are organized according to an architecture. Neural networks are not "self-organizing" in the same sense as the brain which much more closely resemble a graph than an ordered network.

Brain connections

Some very interesting views of the brain as created by state of the art brain imagine techniques. Click on the image for more information.

So what does that mean? Think of it this way: a neural network is inspired by the brain in the same way that the Olympic stadium in Beijing is inspired by a bird's nest. That does not mean that the Olympic stadium is-a bird's nest, it means that some elements of birds nests are present in the design of the stadium. In other words, elements of the brain are present in the design of neural networks but they are a lot less similar than you might think.

In fact neural networks are more closely related to statistical methods such as curve fitting and regression analysis than the human brain. In the context of quantitative finance I think it is important to remember that because whilst it may sound cool to say that something is 'inspired by the brain', this statement may result unrealistic expectations or fear. For more info see 'No! Artificial Intelligence is not an existential threat'.

Non-Linear Regression

An example of curve fitting also known as function approximation. Neural networks are quite often used to approximate complex mathematical functions.

Back to the top

2. Neural networks aren't a "weak form" of statistics

Neural networks consist of layers of interconnected nodes. Individual nodes are called perceptrons and resemble a multiple linear regression. The difference between a multiple linear regression and a perceptron is that a perceptron feeds the signal generated by a multiple linear regression into an activation function which may or may not be non-linear. In a multi layered perceptron (MLP) perceptrons are arranged into layers and layers are connected with other another. In the MLP there are three types of layers namely, the input layer, hidden layer(s), and the output layer. The input layer receives input patterns and the output layer could contain a list of classifications or output signals to which those input patterns may map. Hidden layers adjust the weightings on those inputs until the error of the neural network is minimized. One interpretation of this is that the hidden layers extract salient features in the input data which have predictive power with respect to the outputs.

Mapping Inputs : Outputs

A perceptron receives a vector of inputs, \textbf{z} = (z_1,z_2,\ldots,z_n), consisting on n attributes. This vector of inputs is called an input pattern. These inputs are weighted according to the weight vector belonging to that perceptron, \textbf{v} = (v_1,v_2,\ldots,v_n). In the context of multiple linear regression these can be thought of as regression co-efficients or beta's. The net input signal, net, of the perceptron is usually the sum product of the input pattern and their weights. Neurons which use the sum-product for net are called summation units.

net = \sum^n_{i-1} z_i v_i

The net input signal, minus a bias \theta is then fed into some activation function, f(). Activation functions are usually monotonically increasing functions which are bounded between either (0,1) or (-1,1) (this is discussed further on in this article). Activation functions can be linear or non-linear.


Some popular activation functions used in neural networks are shown below,

Activations Functions

The simplest neural network is one which has just one neuron which maps inputs to an output. Given a pattern, p, the objective of this network would be to minimize the error of the output signal, o_p, relative to some known target value for some given training pattern, t_p. For example, if the neuron was supposed to map p to -1 but it mapped it to 1 then the error, as measured by sum-squared distance, of the neuron would be 4, (-1 - 1)^2


Multilayer perceptron

As shown in the image above perceptrons are organized into layers. The first layer or perceptrons, called the input later, receives the patterns, p, in the training set, P_T. The last layer maps to the expected outputs for those patterns. An example of this is that the patterns may be a list of quantities for different technical indicators regarding a security and the potential outputs may be the categories \{BUY, HOLD, SELL\}.

A hidden layer is one which receives as inputs the outputs from another layer; and for which the outputs form the inputs into yet another layer. So what do these hidden layers do? One interpretation is that they extract salient features in the input data which have predictive power with respect to the outputs. This is called feature extraction and in a way it performs a similar function to statistical techniques such as principal component analysis.

Deep neural networks have a large number of hidden layers and are able to extract much deeper features from the data. Recently, deep neural networks have performed particularly well for image recognition problems. An illustration of feature extraction in the context of image recognition is shown below,


I think that one of the problems facing the use of deep neural networks for trading (in addition to the obvious risk of overfitting) is that the inputs into the neural network are almost always heavily pre-processed meaning that there may be few features to actually extract because the inputs are already to some extent features.

Learning Rules

As mentioned previously the objective of the neural network is to minimize some measure of error, \epsilon. The most common measure of error is sum-squared-error although this metric is sensitive to outliers and may be less appropriate than tracking error in the context of financial markets.

Sum squared error (SSE), \epsilon = \sum^{P_T}_{p=1} \big ( t_p - o_p \big )^2

Given that the objective of the network is to minimize \epsilon we can use an optimization algorithm to adjust the weights in the neural network. The most common learning algorithm for neural networks is the gradient descent algorithm although other and potentially better optimization algorithms can be used. Gradient descent works by calculating the partial derivative of the error with respect to the weights for each layer in the neural network and then moving in the opposite direction to the gradient (because we want to minimize the error of the neural network). By minimizing the error we maximize the performance of the neural network in-sample.

Expressed mathematically the update rule for the weights in the neural network (\textbf{v}) is given by,

v_i(t) = v_i(t - 1) + \delta v_i(t) where

\delta v_i(t) = \eta(-\frac{\partial \epsilon}{\partial v_i}) where

\frac{\partial \epsilon}{\partial v_i} = -2(t_p - o_p) \frac{\partial f}{\partial net_p}z_{i,p}

where \eta is the learning rate which controls how quickly or slowly the neural network converges. It is worth nothing that the calculation of the partial derivative of f with respect to the net input signal for a pattern p represents a problem for any discontinuous activation functions; which is one reason why alternative optimization algorithms may be used. The choice of learning rate has a large impact on the performance of the neural network. Small values for \eta may result in very slow convergence whereas high values for \eta could result in a lot of variance in the training.

Small Learning Rate

High Learning Rate


Despite what some of the statisticians I have met in my time believe, neural networks are not just a "weak form of statistics for lazy analysts" (I have actually been told this before and it was quite funny); neural networks represent an abstraction of solid statistical techniques which date back hundreds of years. For a fantastic explanation of the statistics behind neural networks I recommend reading this chapter. That having been said I do agree that some practitioners like to treat neural networks as a "black box" which can be thrown at any problem without first taking the time to understand the nature of the problem and whether or not neural networks are an appropriate choice. An example of this is the use of neural networks for trading; markets are dynamic yet neural networks assume the distribution of input patterns remains stationary over time. This is discussed in more detail here

Back to the top

3. Neural networks come in many architectures

Up until now we have just discussed the most simple neural network architecture, namely the multi-layer perceptron. There are many different neural network architectures (far too many to mention here) and the performance of any neural network is a function of its architecture and weights. Many modern day advances in the field of machine learning do not come from rethinking the way that perceptrons and optimization algorithms work but rather from being creative regarding how these components fit together. Below I discuss some very interesting and creative neural network architectures which have been developed over time, 

Recurrent Neural Networks - some or all connections flow backwards meaning that feed back loops exist in the network. These networks are believed to perform better on time series data. As such, they may be particularly relevant in the context of the financial markets. For more information here is a link to a fantastic article entitled, The unreasonable performance of recurrent [deep] neural networks.

Recurrent Neural Network Architectures

This diagram shows three popular recurrent Neural Network Architectures namely the Elman neural network, the Jordan neural network, and the Hopfield single-layer neural network.

A more recent interesting recurrent neural network architecture is the Neural Turing Machine. This network combines a recurrent neural network architecture with memory. It has been shown that these neural networks are Turing complete and were able to learn sorting algorithms and other computing tasks.

Boltzmann neural network - one of the first fully connected neural networks was the Boltzmann neural network a.k.a Boltzmann machine. These networks were the first networks capable of learning internal representations and solving very difficult combinatoric problems. One interpretation of the Boltzmann machine is that it is a Monte Carlo version of the Hopfield recurrent neural network. Despite this, the neural network can be quite difficult to train but when constrained they can prove more efficient than traditional neural networks. The most popular constraint on Boltzmann machines is to disallow direct connections between hidden neurons. This particular architecture is referred to as a Restricted Boltzmann Machine, which are used in Deep Botlzmann Machines.

Boltzmann Machine

This diagram shows how different Boltzmann Machines with connections between the different nodes can significantly affect the results of the neural network (graphs to the right of the networks)

Deep neural networks - there are neural networks with multiple hidden layers. Deep neural networks have become extremely popular in more recent years due to their unparalleled success in image and voice recognition problems. The number of deep neural network architectures is growing quite quickly but some of the most popular architectures include deep belief networks, convolutional neural networks, deep restricted Boltzmann machines, stacked auto-encoders, and many more. One of the biggest problems with deep neural networks, especially in the context of financial markets which are non-stationary, is overfitting. More more info see

Deep Neural Network Cat

This diagram shows a deep neural network which consists of multiple hidden layers.

Adaptive neural networks - are neural networks which simultaneously adapt and optimize their architectures whilst learning. This is done by either growing the architecture (adding more hidden neurons) or shrinking it (pruning unnecessary hidden neurons). I believe that adaptive neural networks are most appropriate for financial markets because markets are non-stationary. I say this because the features extracted by the neural network may strengthen or weaken over time depending on market dynamics. The implication of this is that  any architecture which worked optimally in the past would need to be altered to work optimally today.

Adaptive architecture neural networks

This diagram shows two different types of adaptive neural network architectures. The left image is a cascade neural network and the right image is a self-organizing map.

Radial basis networks - although not a different type of architecture in the sense of perceptrons and connections, radial basis functions make use of radial basis functions as their activation functions, these are real valued functions whose output depends on the distance from a particular point. The most commonly used radial basis functions is the Gaussian distribution. Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation. As such, a radial basis function neural network can have a much higher information capacity. Radial basis functions are also used in the kernel of a Support Vector Machine.

Radial basis function fitting

This diagram shows how curve fitting can be done using radial basis functions

In summary, many hundreds of neural network architectures exist and the performance of one neural network can be significantly superior to another. As such, quantitative analysts interested in using neural networks should probably test multiple neural network architectures and consider combining their outputs together in an ensemble to maximize their investment performance. I recommend reading my article, All Your Models are Wrong, 7 Sources of Model Risk, before using Neural Networks for trading because many of the problems still apply.

Back to the top

4. Size matters, but bigger isn't always better

Having selected an architecture one must then decide how large or small the neural network should be. How many inputs are there? How many hidden neurons should be used? How many hidden layers should be used (if we are using a deep neural network)? And how many outputs neurons are required? The reasons why these questions are important is because if the neural network is too large (too small) the neural network could potentially overfit (underfit) the data meaning that the network would not generalize well out of sample.

How many and which inputs should be used?

The number of inputs depends on the problem being solved, the quantity and quality of available data, and perhaps some creativity. Inputs are simply variables which we believe have some predictive power over the dependent variable being predicted. If the inputs to a problem are unclear, you can systematically determine which variables should be included by looking at the correlations and cross-correlation between potential independent variables and the dependent variables. This approach is detailed in the article, What Drives Real GDP Growth?

There are two problems with using correlations to select input variables. Firstly, if you are using a linear correlation metric you may inadvertently exclude useful variables. Secondly, two relatively uncorrelated variables could potentially be combined to produce a strongly correlated variable. If you look at the variables in isolation you may miss this opportunity. To overcome the second problem you could use principal component analysis to extract useful eigenvectors (linear combinations of the variables) as inputs. That said a problem with this is that the eigenvectors may not generalize well and they also assume the distributions of input patterns is stationary.

Another problem when selecting variables is multicollinearity. Multicollinearity is when two or more of the independent variables being fed into the model are highly correlated. In the context of regression models this may cause regression co-efficients to change erratically in response to small changes in the model or the data. Given that neural networks and regression models are similar I suspect this is also a problem for neural networks.

Last, but not least, one statistical bias which may be introduced when selecting variables is omitted-variable bias. Omitted variable bias occurs when a model is created which leaves out one or more important causal variables. The bias is created when the model incorrectly compensates for the missing variable by over or underestimating the effect of one of the other variables i.e. the weights may become too large on these variables or SSE will be large. 

How many hidden neurons should I use?

The optimal number of hidden units is problem specific. That said, as a general rule of thumb the more hidden units used the more probable the risk of overfitting becomes. Overfitting is when the neural network does not learn the underlying statistical properties of the data, but rather 'memorizes' the patterns and any noise they may contain. This results in neural networks which perform well in sample but poorly out of sample. So how can we avoid overfitting? There are two popular approaches used in industry namely early stopping and regularization and then there is my personal favourite approach, global search,

Early stopping involves splitting your training set into the main training set and a validation set. Then instead of training a neural network for a fixed number of iterations, you train then until the performance of the neural network on the validation set begins to deteriorate. Essentially this prevents the neural network from using all of the available parameters and limits it's ability to simply memorize every pattern it sees. The image on the right shows two potential stopping points for the neural network (a and b).

Early Stopping

The image below shows the performance and over-fitting of the neural network when stopped at a or b,

Early Stopping II

Regularization penalizes the neural network for using complex architectures. Complexity in this approach is measured by the size of the neural network weights. Regularization is done by adding a term to sum squared error objective function which depends on the size of the weights. This is the equivalent of adding a prior which essentially makes the neural network believe that the function it is approximating is smooth,

\epsilon = \beta \sum^{P_T}_{p=1} \big ( t_p - o_p \big )^2 + \alpha \sum^n_{j=1} v_j^2

where n is the number of weights in the neural network. The parameters \alpha and \beta control the degree to which the neural network over or underfits the data. Good values for \alpha and \beta can be derived using Bayesian analysis and optimization. This, and the above, are explained in considerably more detail in this brilliant chapter.

Neural Network Regularization

My favourite technique, which is also by far the most computationally expensive, is global search. In this approach a search algorithm is used to try different neural network architectures and arrive at a near optimal choice. This is most often done using genetic algorithms which are discussed further on in this article.

What Are the Outputs?

Neural networks can be used for either regression or classification. Under regression model a single value is outputted which may be mapped to a set of real numbers meaning that only one output neuron is required. Under classification model an output neuron is required for each potentially class to which the pattern may belong. If the classes are unknown unsupervised neural network techniques such as self organizing maps should be used.

In conclusion, the best approach is to follow Ockhams Razor. Ockham's razor argues that for two models of equivalent performance, the model with fewer free parameters will generalize better. On the other hand, one should never opt for an overly simplistic model at the cost of performance. Similarly, one should not assume that just because a neural network has more hidden neurons and maybe more hidden layers it will outperform a much simpler network. Unfortunately it seems to me that too much emphasis is placed on large networks and too little emphasis is placed on making good design decisions. In the case of neural networks, bigger isn't always better. 


Entities must not be multiplied beyond necessity - William of Ockham


Entities must not be reduced to the point of inadequacy - Karl Menger

Back to the top

5. Many training algorithms exist for neural networks

The learning algorithm of a neural network tries to optimize the neural network's weights until some stopping condition has been met. This condition is typically either when the error of the network reaches an acceptable level of accuracy on the training set, when the error of the network on the validation set begins to deteriorate, or when the specified computational budget has been exhausted. The most common learning algorithm for neural networks is the backpropagation algorithm which uses stochastic gradient descent which was discussed earlier on in this article. Backpropagation consists of two steps:

  1. The feedforward pass -  the training data set is passed through the network and the output from the neural network is recorded and the error of the network is calculated
  2. Backward propagation - the error signal is passed back through the network and the weights of the neural network are optimized using gradient descent.

The are some problems with this approach. Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is quite slow, and is susceptible to local minima. Local minima are a problem for specific types of neural networks including all product link neural networks. The first two problems can be addressed by using variants of gradient descent including momentum gradient descent (QuickProp), Nesterov's Accelerated Momentum (NAG) gradient descent, the Adaptive Gradient Algorithm (AdaGrad), Resilient Propagation (RProp), and Root Mean Squared Propagation (RMSProp). As can be seen from the image below significant improvements can be made on the classical gradient descent algorithm. 

Long Valley Training Algorithms

That having been said, these algorithms cannot overcome local minima and are also less useful when trying to optimize both the architecture and weights of the neural network concurrently. In order to achieve this global optimization algorithms are needed. Two popular global optimization algorithms are the Particle Swarm Optimization (PSO) and the Genetic Algorithm (GA). Here is how they can be used to train neural networks:

Neural network vector representation - by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network, we can train neural networks using most meta-heuristic search algorithms. This technique does not work well with deep neural networks because the vectors become too large.

Vector Representation Neural Network

This diagram illustrates how a neural network can be represented in a vector notation and related to the concept of a search space or fitness landscape.

Particle Swarm Optimization - to train a neural network using a PSO we construct a population / swarm of those neural networks. Each neural network is represented as a vector of weights and is adjusted according to it's position from the global best particle and it's personal best.

The fitness function is calculated as the sum-squared error of the reconstructed neural network after completing one feedforward pass of the training data set. The main consideration with this approach is the velocity of the weight updates. This is because if the weights are adjusted too quickly, the sum-squared error of the neural networks will stagnate and no learning will occur.

Particle Swarm Optimization Single Swarm

This diagram shows how particles are attracted to one another in a single swarm Particle Swarm Optimization algorithm.

Genetic Algorithm - to train a neural network using a genetic algorithm we first construct a population of vector represented neural networks. Then we apply the three genetic operators on that population to evolve better and better neural networks. These three operators are,

  1. Selection - Using the sum-squared error of each network calculated after one feedforward pass, we rank the population of neural networks. The top x% of the population are selected to 'survive' to the next generation and be used for crossover.
  2. Crossover - The top x% of the population's genes are allowed to cross over with one another. This process forms 'offspring'. In context, each offspring will represent a new neural network with weights from both of the 'parent' neural networks.
  3. Mutation - this operator is required to maintain genetic diversity in the population. A small percentage of the population are selected to undergo mutation. Some of the weights in these neural networks will be adjusted randomly within a particular range.
Genetic Algorithm

This algorithm shows the selection, crossover, and mutation genetic operators being applied to a population of neural networks represented as vectors.

In addition to these population-based metaheuristic search algorithms, other algorithms have been used to train of neural networks including backpropagation with added momentum, differential evolution, Levenberg Marquardt, simulated annealing, and many more. Personally I would recommend using a combination of local and global optimization algorithms to overcome the shortcomings of both.

Back to the top

6. Neural networks do not always require a lot of data

Neural networks can use one of three learning strategies namely a supervised learning strategy, an unsupervised learning strategy, or a reinforcement learning strategy. Supervised learning require at least two data sets, a training set which consists of inputs with the expected output, and a testing set which consists of inputs without the expected output. Both of these data sets must consist of labelled data i.e. data patterns for which the target is known upfront. Unsupervised learning strategies are typically used to discover hidden structures (such as hidden Markov chains) in unlabeled data. They behave in a similar way to clustering algorithms. Reinforcement learning are based on the simple premise of rewarding neural networks for good behaviours and punishing them for bad behaviours. Because unsupervised and reinforcement learning strategies do not require that data be labelled they can be applied to under-formulated problems where the correct output is not known.  

Unsupervised Learning

One of the most popular unsupervised neural network architectures is the Self Organizing Map (also known as the Kohonen Map). Self Organizing Maps are essentially a multi-dimensional scaling technique which construct an approximation of the probability density function of some underlying data set, \textbf{Z}, whilst preserving the topological structure of that data set. This is done by mapping input vectors, \textbf{z}_i, in the data set, \textbf{Z}, to weight vectors, \textbf{v}_j, (neurons) in the feature map, \textbf{V}. Preserving the topological structure simply means that if two input vectors are close together in \textbf{Z}, then the neurons to which those input vectors map in \textbf{V} will also be close together. 

Dimensionality Reduction using Principal Component Analysis and Self Organizing Maps

For more information on self organizing maps and how they can be used to produce lower-dimensionality data sets click here. Another interesting application of SOM's is in colouring time series charts for stock trading. This is done to show what the market conditions are at that point in time. This website provides a detailed tutorial and code snippets for implementing the idea for improved Forex trading strategies.

Reinforcement Learning

Reinforcement learning strategies consist of three components. A policy which specifies how the neural network will make decisions e.g. using technical and fundamental indicators. A reward function which distinguishes good from bad e.g. making vs. losing money. And a value function which specifies the long term goal. In the context of financial markets (and game playing) reinforcement learning strategies are particularly useful because the neural network learns to optimize a particular quantity such as an appropriate measure of risk adjusted return

Reinforcement Learning

This diagram shows how a neural network can be either negatively or positively reinforced.

Back to the top

7. Neural networks cannot be trained on any data

One of the biggest reasons why neural networks may not work is because people do not properly pre-process the data being fed into the neural network. Data normalization, removal of redundant information, and outlier removal should all be performed to improve the probability of good neural network performance.

Data normalization - neural networks consist of various layers of perceptrons linked together by weighted connections. Each perceptron contains an activation function which each have an 'active range' (except for radial basis functions). Inputs into the neural network need to be scaled within this range so that the neural network is able to differentiate between different input patterns.

For example, given a neural network trading system which receives indicators about a set of securities as inputs and outputs whether each security should be bought or sold. One of the inputs is the price of the security and we are using the Sigmoid activation function. However, most of the securities cost between 5$ and 15$ per share and the output of the Sigmoid function approaches 1.0. So the output of the Sigmoid function will be be 1.0 for all securities, all of the perceptrons will 'fire' and the neural network will not learn.

Neural networks trained on unprocessed data produce models where 'the lights are on but nobody's home'

Outlier removal - an outlier is value that is much smaller or larger than most of the other values in some set of data. Outliers can cause problems with statistical techniques like regression analysis and curve fitting because when the model tries to 'accommodate' the outlier, performance of the model across all other data deteriorates,

Outlier Removal

This diagram shows the effect of removing an outlier from the training data for a linear regression. The results are comparable for neural networks. Image source:

The illustration shows that trying to accommodate an outlier into the linear regression model results in a poor fits of the data set. The effect of outliers on non-linear regression models, including neural networks,  is similar. Therefore it is good practice is to remove outliers from the training data set. That said, identifying outliers is a challenge in and of itself, this tutorial and paper discuss existing techniques for outlier detection and removal.

Remove redundancy - when two or more of the independent variables being fed into the neural network are highly correlated (multiplecolinearity) this can negatively affect the neural networks learning ability. Highly correlated inputs also mean that the amount of unique information presented by each variable is small, so the less significant input can be removed. Another benefit to removing redundant variables is faster training times. Adaptive neural networks can be used to prune redundant connections and perceptrons.

Back to the top

8. Neural networks may need to be retrained

Given that you were able to train a neural network to trade successfully in and out of sample this neural network may still stop working over time. This is not a poor reflection on neural networks but rather an accurate reflection of the financial markets. Financial markets are complex adaptive systems meaning that they are constantly changing so what worked yesterday may not work tomorrow. This characteristic is called non-stationary or dynamic optimization problems and neural networks are not particularly good at handling them.

Dynamic environments, such as financial markets, are extremely difficult for neural networks to model. Two approaches are either to keep retraining the neural network over-time, or to use a dynamic neural network. Dynamic neural networks 'track' changes to the environment over time and adjust their architecture and weights accordingly. They are adaptive over time. For dynamic problems, multi-solution meta-heuristic optimization algorithms can be used to track changes to local optima over time. One such algorithm is the multi-swarm optimization algorithm, a derivative of the particle swarm optimization. Additionally, genetic algorithms with enhanced diversity or memory have also been shown to be robust in dynamic environments.

The illustration below demonstrates how a genetic algorithm evolves over time to find new optima in a dynamic environment. This illustration also happens to mimic trade crowding which is when market participants crowd a profitable trading strategy, thereby exhausting trading opportunities causing the trade to become less profitable.

Dynamic Environment

This animated image shows a dynamic fitness landscape (search space) change over time. Image source:

Back to the top

9. Neural networks are not black boxes

By itself a neural network is a black-box. This presents problems for people wanting to use them. For example, fund managers wouldn't know how a neural network makes trading decisions, so it is impossible to assess the risks of the trading strategies learned by the neural network. Similarly, banks using neural networks for credit risk modelling would not be able to justify why a customer has a particular credit rating, which is a regulatory requirement. That having been said, state of the art rule-extraction algorithms have been developed to vitrify some neural network architectures. These algorithms extract knowledge from the neural networks as either mathematical expressions, symbolic logic, fuzzy logic, or decision trees.

Black box

This image shows a neural network as a black box and how it related to rule extraction techniques.

Mathematical rules - algorithms have been developed which can extract multiple linear regression lines from neural networks. The problem with these techniques is that the rules are often still difficult to understand, therefore these do not solve the 'black-box' problem.

Propositional logic - propositional logic is a branch of mathematical logic which deals with operations done on discrete valued variables. These variables, such as A or B, are often either TRUE or FALSE, but they could occupy values within a discrete range e.g. {BUY,HOLD,SELL}.

Logical operations can then be applied to those variables such as OR, AND, and XOR. The results are called predicates which can also be quantified over sets using the exists or for-all quantifiers. This is the difference between predicate and propositional logic. If we had a simple neural network which Price (P), Simple Moving Average (SMA), and Exponential Moving Average (EMA) as inputs and we extracted a trend following strategy from the neural network in propositional logic, we might get rules like this,

Propositional Logic Example

Fuzzy logic - fuzzy logic is where probability and propositional logic meet. The problem with propositional logic is that is deals in absolutes e.g. BUY or SELL, TRUE or FALSE, 0 or 1. Therefore for traders there is no way to determine the confidence of these results. Fuzzy logic overcomes this limitation by introducing a membership function which specifies how much a variable belongs to a particular domain. For example, a company (GOOG) might belong 0.7 to the domain {BUY} and 0.3 to the domain {SELL}. Combinations of neural networks and fuzzy logic are called Neuro-Fuzzy systems. This research survey discusses various fuzzy rule extraction techniques.

Decision trees - decision trees show how decisions are made when given certain information. This article describes how to evolve security analysis decision trees using genetic programming. Decision tree induction is the term given to the process of extracting decision trees from neural networks.

Trading Strategy Decision Tree

An example of a simple trading strategy represented using a decision tree. The triangular boxes represent decision nodes, these could be to BUY, HOLD, or SELL a company. Each box represents a tuple of <indicator, inequality,="" value="">. An example might be <sma,>, 25> or <ema, <="," 30="">.

Back to the top

10. Neural networks are not hard to implement

This list is updated, from time to time, when I have time. Last updated: November 2015.

Speaking from experience, neural networks are quite challenging to code from scratch. Luckily there are now hundreds open source and proprietary packages which make working with neural networks a lot easier. Below is a list of packages which quants may find useful for quantitative finance. The list is NOT exhaustive, and is ordered alphabetically. If you have any additional comments, or frameworks to add, please share via the comment section.


Webpage -

GitHub Repository -

"Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.Yangqing Jia created the project during his PhD at UC Berkeley." - Caffe webpage (November 2015)


Webpage -

GitHub Repositories -

"Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. Most Encog training algoritms are multi-threaded and scale well to multicore hardware. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train machine learning algorithms." - Encog webpage


Webpage -

GitHub Repositories -

H2O is not strictly a package for machine learning, instead they expose an API for doing fast and scalable machine learning for smarter applications which use big data. Their API supports deep learning model, generalized boosting models, generalized linear models, and more. They also host a cool conference, checkout the videos :).

Google TensorFlow

Webpage -

GitHub repository -

"TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code." - GitHub repository (November 2015)

Microsoft Distributed Machine Learning Tookit

Webpage -

GitHub repository -

"DMTK includes the following projects: DMTK framework(Multiverso): The parameter server framework for distributed machine learning. LightLDA: Scalable, fast and lightweight system for large-scale topic modeling. Distributed word embedding: Distributed algorithm for word embedding. Distributed skipgram mixture: Distributed algorithm for multi-sense word embedding." - GitHub repository (November 2015)

Microsoft Azure Machine Learning

Webpage -

GitHub Repositories - 

The machine learning / predictive analytics platform in Microsoft Azure is a fully managed cloud service that enables you to easily build, deploy, and share predictive analytics solutions. This software basically allows you to drag and drop pre-built components (including machine learning models) and custom-built components which manipulate data sets into a process. This flow-chart is then compiled into a program and can be deployed as a web-service. It is similar to the older SAS enterprise miner solution except that is it more modern, more functional, supports deep learning models, and exposes clients for Python and R. 


Webpage -

GitHub Repositories -

"MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix the flavours of symbolic programming and imperative programming together to maximize the efficiency and your productivity. In its core, a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer is build on top, which makes symbolic execution fast and memory efficient. The library is portable and lightweight, and is ready scales to multiple GPUs, and multiple machines." - MXNet GitHub Repository (November 2015)


Webpage -

GitHub Repository -

"neon is Nervana's Python based Deep Learning framework and achieves the fastest performance on many common deep neural networks such as AlexNet, VGG and GoogLeNet. We have designed it with the following functionality in mind: 1) Support for commonly used models and examples: convnets, MLPs, RNNs, LSTMs, autoencoders, 2) Tight integration with nervanagpu kernels for fp16 and fp32 (benchmarks) on Maxwell GPUs, 3) Basic automatic differentiation support, 4) Framework for visualization, and 5) Swappable hardware backends ..." - neon GitHub repository (November 2015)


Webpage -

GitHub repository -

"Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation." - Theano GitHub repository (November 2015). Theano, like TensorFlow and Torch, is more broadly applicable than just Neural Networks. It is a framework for implementing existing or creating new machine learning models using off-the-shelf data-structures and algorithms. 


Webpage -

GitHub Repository -

"Torch is a scientific computing framework with wide support for machine learning algorithms ... A summary of core features include an N-dimensional array, routines for indexing, slicing, transposing, an interface to C, via LuaJIT, linear algebra routines, neural network, energy-based models, numeric optimization routines, Fast and efficient GPU support, Embeddable, with ports to iOS, Android and FPGA" - Torch Webpage (November 2015). Like Tensorflow and Theano, Torch is more broadly applicable than just Neural Networks. It is a framework for implementing existing or creating new machine learning models using off-the-shelf data-structures and algorithms. 

SciKit Learn

Webpage -

GitHub Repository -

SciKit Learn is a very popular package for doing machine learning in Python. It is built on NumPy, SciPy, and matplotlib Open source, and exposes implementations of various machine learning models for classification, regression, clustering, dimensionality reduction, model selection, and data preprocessing.

As I mentioned, there are now hundreds of machine learning packages and frameworks out there. Before committing to any one solution I would recommend doing a best-fit analysis to see which open source or proprietary machine learning package or software best matches your use-cases. Generally speaking a good rule to follow in software engineering and model development for quantitative finance is to not reinvent the wheel ... that said, for any sufficiently advanced model you should expect to have to write some of your own code.

Back to the top


Neural networks are a class of powerful machine learning algorithms. They are based on solid statistical foundations and have been applied successfully in financial models as well as in trading strategies for many years. Despite this, they have a bad reputation due to the many unsuccessful attempts to use them in practice. In most cases, unsuccessful neural network implementations can be traced back to inappropriate neural network design decisions and general misconceptions about how they work. This article aims to articulate some of these misconceptions in the hopes that they might help individuals implementing neural networks meet with success.

For readers interested in getting more information, I have found the following books to be quite instructional when it comes to neural networks and their role in financial modelling and algorithmic trading. 

Neural network textbooks

Some instructional textbooks when it comes to implementing neural networks and other machine learning algorithms in finance. Many of the misconceptions presented in this article are discussed in more detail in Professor Andries Engelbrecht's book, 'An Introduction to Computational Intelligence'


  1. Great effort behind this article, Stuart.

    Kindly check the email

    • Hi Michal, thank you for your email. I'm glad you enjoyed the article, please let me know if you have any suggestions for further material!

  2. Din Vadhia

    A terrific resource.

    It would be really illustrative to understand how the example applications mentioned - time-series forecasting, proprietary trading signal generation, fully automated trading (decision making), financial modelling, derivatives pricing, credit risk assessments, pattern matching, and security classification - are solved using neural networks or other machine learning methods. Is there a resource or blog that covers this?

    • Hi Dinesh, thanks for commenting. I think that online literature for the topic of Neural Networks applied to finance is fragmented. Therefore, it may be worthwhile trying to get a copy a book called "Neural Networks in Finance" by Paul D. McNelis. The book is a bit dated, and probably won't cover all the latest developments in Neural Networks but it will definitely covers most of the applications I mentioned in my blog. Otherwise, the best resources are academic journal articles written on the topic. Journal articles are obviously a bit more technical but there is no better way to learn in my humble opinion. Good luck!

  3. Faiyaz

    Excellent blog Stuart...well-written, articulate & nuanced in its descriptions.

    • Thank you very much Faiyaz. I only hope that you and other readers are able to find good applications of the techniques discussed here 🙂

  4. Brian Maja

    Nice blog Mr Stuart, and thanks for summarizing alot of things. I was working on a neural network for my company inkunzi markets in Sandton, and just finished after 3 months(built from scratch), fuzzy neurons are not as easy to control and build indeed, but rather better when done perfectly interms of pattern recognition and market forecasting. Keep up the good work fellow Quant,
    BSc Mathematical Statistics, Physics and Electronics from Rhodes University.

    • Hi Brian, thanks for getting in touch. Thank you for the information, I have only read up on the neuro-fuzzy systems but never applied them in practice. I will check them out in more detail this year :).

  5. jack zhang

    Hi Stu, I am starting a quant invest platform development project here in Beijng based on big data intelligence from market emotion to technical trading signal using, and I am looking for international partner's join, if you have interests, maybe we can schedule a skype chat. Thank you with regards, your personal blog is awesome! Jack

    • Hi Jack, thank you for the compliments :). I will definitely be in touch, Beijing is an incredible city which I was lucky enough to visit last year for a conference.

  6. Irving Monzon


    • Thanks man. I appreciate the comment, that said this article is getting a little bit old now 🙂 so I'm busy working on a more technical follow up with implementation-level detail.

      Should come out in the next few months. Thanks again!

  7. Dr Walter H Delashmit

    Please sign me up for updates.

  8. Li

    My concern with neural networks is its ability to handle categorical data. I get the impression that in supervised learning situations, neural networks work best when all your independent variables are numeric (or at least mostly numeric). Is there any truth to this?

  9. Iwansyah Putra

    Thanks for the Article. I think this article is a must read for everyone 'new' at this field. As I call this method is a 'breadth-first' learning approach to Introduction to Neural Networks.

    Sorry for my bad english.

    • Thank you for the kind words, your English is fine 🙂

  10. john

    Looking for something like this for a while, all i can find are click-bait articles.
    Great research! Favorited!

    • Thanks John; I also really dislike all the mindless click-bait articles out there. This blog is all about content 🙂 - I really need to write more about neural networks though.

  11. Steven

    Thank you very very much!!! Your article is amazing especially for the beginner like me. From your article, I get an outline for what Neural Network is, how many kinds of NNs and how to use them properly. Plus, the external resources you provided are excellent too.

    • Thanks for the kind words Steven. I'm happy to hear that the article was helpful to you 🙂 good luck!

  12. Ankur

    Great article Stuart. Would you recommend any open source ANN tools that implement the Levenberg Marquardt learning algorithm?

    • Hi Ankur. The one package I used a few years ago which offered Levenberg Marquardt (often referred to as LMA) was Encog. I'm sure some of the others offer it as well.

  13. Tanneguy

    Great article. Neural network and article

  14. Louis Newton

    Hi Stuart, Thank you for this article - it was most illuminating!
    How do spiking neural networks fit into the overall picture of neural networks? (architecturally speaking and from the point of view of most suitable applications)

    • Hey Louis, thanks for the comment!

      That's an interesting question. Let me preface my response by stating that I have neither worked with nor explicitly studied spiking neural networks.

      That said, I have come across them before. Architecturally they are similar to any other neural network except that each individual neuron's complexity is higher like with product unit neural networks - which, by the way, I quite like. This added complexity makes spiking neural networks "more similar" to biological neural networks in the sense that neuron activation is not a continuous process, it is discontinuous. Which is actually how I came across them originally :-): I was researching applications for jump diffusion stochastic processes one of which is modelling the firing rate of neurons in spiking neural networks. But like I said, I haven't worked with them or studied them explicitly and I am not one hundred perfect sure of their use cases.

      All I can say us that I am supportive of complex neural network architectures because I believe they may hold the key to more efficient and human-esque intelligence in machines.

  15. Michael Tony

    Hi Stuart,
    My name is Michael. I have read that Neural Network Regression can predict the market more than any other software or strategy. I have a question then; how can i use the neural network in trading, my main concern is the forex market. Is neural network regression a software? What is it? How can i use it in trading forex? How can it predict or forcaste the price of the eurusd for me?
    Am like totally naive on this and i need your help. If its a program software, how can i get one?

    Thank you and i anticipate your reply.


  16. Eric He

    Wow, thanks for the excellent write-up. It was incredibly well-researched and articulated. Keep it up!

Submit a Comment