10 misconceptions about Neural Networks
Neural networks are one of the most popular and powerful classes of machine learning algorithms. In quantitative finance neural networks are often used for timeseries forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling. They have also been used to construct stochastic process models and price derivatives. Despite their usefulness neural networks tend to have a bad reputation because their performance is "temperamental". In my opinion this can be attributed to poor network design owing to misconceptions regarding how neural networks work. This article discusses some of those misconceptions.
 Neural networks are not models of the human brain
 Neural networks are not just a "weak form" of statistics
 Neural networks come in many different architectures
 Size matters, but bigger isn't always better
 Many training algorithms exist for neural networks
 Neural networks do not always require a lot of data
 Neural networks cannot be trained on any data
 Neural networks may need to be retrained
 Neural networks are not black boxes
 Neural networks are not hard to implement
1. Neural networks are not models of the human brain
The human brain is one of the great mysteries of our time and scientists have not reached a consensus on exactly how it works. Two theories of the brain exist namely the grandmother cell theory and the distributed representation theory. The first theory asserts that individual neurons have high information capacity and are capable of representing complex concepts such as your grandmother or even Jennifer Aniston. The second theory neurons asserts that neurons are much more simple and representations of complex objects are distributed across many neurons. Artificial neural networks are loosely inspired by the second theory.
One reason why I believe current generation neural networks are not capable of sentience (a different concept to intelligence) is because I believe that biological neurons are much more complex than artificial neurons.
Another big difference between the brain and neural networks is size and organization. Human brains contain many more neurons and synapses than neural network and they are selforganizing and adaptive. Neural networks, by comparison, are organized according to an architecture. Neural networks are not "selforganizing" in the same sense as the brain which much more closely resemble a graph than an ordered network.
So what does that mean? Think of it this way: a neural network is inspired by the brain in the same way that the Olympic stadium in Beijing is inspired by a bird's nest. That does not mean that the Olympic stadium isa bird's nest, it means that some elements of birds nests are present in the design of the stadium. In other words, elements of the brain are present in the design of neural networks but they are a lot less similar than you might think.
In fact neural networks are more closely related to statistical methods such as curve fitting and regression analysis than the human brain. In the context of quantitative finance I think it is important to remember that because whilst it may sound cool to say that something is 'inspired by the brain', this statement may result unrealistic expectations or fear. For more info see 'No! Artificial Intelligence is not an existential threat'.
2. Neural networks aren't a "weak form" of statistics
Neural networks consist of layers of interconnected nodes. Individual nodes are called perceptrons and resemble a multiple linear regression. The difference between a multiple linear regression and a perceptron is that a perceptron feeds the signal generated by a multiple linear regression into an activation function which may or may not be nonlinear. In a multi layered perceptron (MLP) perceptrons are arranged into layers and layers are connected with other another. In the MLP there are three types of layers namely, the input layer, hidden layer(s), and the output layer. The input layer receives input patterns and the output layer could contain a list of classifications or output signals to which those input patterns may map. Hidden layers adjust the weightings on those inputs until the error of the neural network is minimized. One interpretation of this is that the hidden layers extract salient features in the input data which have predictive power with respect to the outputs.
Mapping Inputs : Outputs
A perceptron receives a vector of inputs, , consisting on attributes. This vector of inputs is called an input pattern. These inputs are weighted according to the weight vector belonging to that perceptron, . In the context of multiple linear regression these can be thought of as regression coefficients or beta's. The net input signal, , of the perceptron is usually the sum product of the input pattern and their weights. Neurons which use the sumproduct for are called summation units.
The net input signal, minus a bias is then fed into some activation function, . Activation functions are usually monotonically increasing functions which are bounded between either or (this is discussed further on in this article). Activation functions can be linear or nonlinear.
The simplest neural network is one which has just one neuron which maps inputs to an output. Given a pattern, , the objective of this network would be to minimize the error of the output signal, , relative to some known target value for some given training pattern, . For example, if the neuron was supposed to map to 1 but it mapped it to 1 then the error, as measured by sumsquared distance, of the neuron would be 4, .
Layering
As shown in the image above perceptrons are organized into layers. The first layer or perceptrons, called the input later, receives the patterns, , in the training set, . The last layer maps to the expected outputs for those patterns. An example of this is that the patterns may be a list of quantities for different technical indicators regarding a security and the potential outputs may be the categories .
A hidden layer is one which receives as inputs the outputs from another layer; and for which the outputs form the inputs into yet another layer. So what do these hidden layers do? One interpretation is that they extract salient features in the input data which have predictive power with respect to the outputs. This is called feature extraction and in a way it performs a similar function to statistical techniques such as principal component analysis.
Deep neural networks have a large number of hidden layers and are able to extract much deeper features from the data. Recently, deep neural networks have performed particularly well for image recognition problems. An illustration of feature extraction in the context of image recognition is shown below,
I think that one of the problems facing the use of deep neural networks for trading (in addition to the obvious risk of overfitting) is that the inputs into the neural network are almost always heavily preprocessed meaning that there may be few features to actually extract because the inputs are already to some extent features.
Learning Rules
As mentioned previously the objective of the neural network is to minimize some measure of error, . The most common measure of error is sumsquarederror although this metric is sensitive to outliers and may be less appropriate than tracking error in the context of financial markets.
Sum squared error (SSE),
Given that the objective of the network is to minimize we can use an optimization algorithm to adjust the weights in the neural network. The most common learning algorithm for neural networks is the gradient descent algorithm although other and potentially better optimization algorithms can be used. Gradient descent works by calculating the partial derivative of the error with respect to the weights for each layer in the neural network and then moving in the opposite direction to the gradient (because we want to minimize the error of the neural network). By minimizing the error we maximize the performance of the neural network insample.
Expressed mathematically the update rule for the weights in the neural network () is given by,
where
where
where is the learning rate which controls how quickly or slowly the neural network converges. It is worth nothing that the calculation of the partial derivative of with respect to the net input signal for a pattern represents a problem for any discontinuous activation functions; which is one reason why alternative optimization algorithms may be used. The choice of learning rate has a large impact on the performance of the neural network. Small values for may result in very slow convergence whereas high values for could result in a lot of variance in the training.
Summary
Despite what some of the statisticians I have met in my time believe, neural networks are not just a "weak form of statistics for lazy analysts" (I have actually been told this before and it was quite funny); neural networks represent an abstraction of solid statistical techniques which date back hundreds of years. For a fantastic explanation of the statistics behind neural networks I recommend reading this chapter. That having been said I do agree that some practitioners like to treat neural networks as a "black box" which can be thrown at any problem without first taking the time to understand the nature of the problem and whether or not neural networks are an appropriate choice. An example of this is the use of neural networks for trading; markets are dynamic yet neural networks assume the distribution of input patterns remains stationary over time. This is discussed in more detail here.
3. Neural networks come in many architectures
Up until now we have just discussed the most simple neural network architecture, namely the multilayer perceptron. There are many different neural network architectures (far too many to mention here) and the performance of any neural network is a function of its architecture and weights. Many modern day advances in the field of machine learning do not come from rethinking the way that perceptrons and optimization algorithms work but rather from being creative regarding how these components fit together. Below I discuss some very interesting and creative neural network architectures which have been developed over time,
Recurrent Neural Networks  some or all connections flow backwards meaning that feed back loops exist in the network. These networks are believed to perform better on time series data. As such, they may be particularly relevant in the context of the financial markets. For more information here is a link to a fantastic article entitled, The unreasonable performance of recurrent [deep] neural networks.
A more recent interesting recurrent neural network architecture is the Neural Turing Machine. This network combines a recurrent neural network architecture with memory. It has been shown that these neural networks are Turing complete and were able to learn sorting algorithms and other computing tasks.
Boltzmann neural network  one of the first fully connected neural networks was the Boltzmann neural network a.k.a Boltzmann machine. These networks were the first networks capable of learning internal representations and solving very difficult combinatoric problems. One interpretation of the Boltzmann machine is that it is a Monte Carlo version of the Hopfield recurrent neural network. Despite this, the neural network can be quite difficult to train but when constrained they can prove more efficient than traditional neural networks. The most popular constraint on Boltzmann machines is to disallow direct connections between hidden neurons. This particular architecture is referred to as a Restricted Boltzmann Machine, which are used in Deep Botlzmann Machines.
Deep neural networks  there are neural networks with multiple hidden layers. Deep neural networks have become extremely popular in more recent years due to their unparalleled success in image and voice recognition problems. The number of deep neural network architectures is growing quite quickly but some of the most popular architectures include deep belief networks, convolutional neural networks, deep restricted Boltzmann machines, stacked autoencoders, and many more. One of the biggest problems with deep neural networks, especially in the context of financial markets which are nonstationary, is overfitting. More more info see DeepLearning.net.
Adaptive neural networks  are neural networks which simultaneously adapt and optimize their architectures whilst learning. This is done by either growing the architecture (adding more hidden neurons) or shrinking it (pruning unnecessary hidden neurons). I believe that adaptive neural networks are most appropriate for financial markets because markets are nonstationary. I say this because the features extracted by the neural network may strengthen or weaken over time depending on market dynamics. The implication of this is that any architecture which worked optimally in the past would need to be altered to work optimally today.
Radial basis networks  although not a different type of architecture in the sense of perceptrons and connections, radial basis functions make use of radial basis functions as their activation functions, these are real valued functions whose output depends on the distance from a particular point. The most commonly used radial basis functions is the Gaussian distribution. Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation. As such, a radial basis function neural network can have a much higher information capacity. Radial basis functions are also used in the kernel of a Support Vector Machine.
In summary, many hundreds of neural network architectures exist and the performance of one neural network can be significantly superior to another. As such, quantitative analysts interested in using neural networks should probably test multiple neural network architectures and consider combining their outputs together in an ensemble to maximize their investment performance. I recommend reading my article, All Your Models are Wrong, 7 Sources of Model Risk, before using Neural Networks for trading because many of the problems still apply.
4. Size matters, but bigger isn't always better
Having selected an architecture one must then decide how large or small the neural network should be. How many inputs are there? How many hidden neurons should be used? How many hidden layers should be used (if we are using a deep neural network)? And how many outputs neurons are required? The reasons why these questions are important is because if the neural network is too large (too small) the neural network could potentially overfit (underfit) the data meaning that the network would not generalize well out of sample.
How many and which inputs should be used?
The number of inputs depends on the problem being solved, the quantity and quality of available data, and perhaps some creativity. Inputs are simply variables which we believe have some predictive power over the dependent variable being predicted. If the inputs to a problem are unclear, you can systematically determine which variables should be included by looking at the correlations and crosscorrelation between potential independent variables and the dependent variables. This approach is detailed in the article, What Drives Real GDP Growth?
There are two problems with using correlations to select input variables. Firstly, if you are using a linear correlation metric you may inadvertently exclude useful variables. Secondly, two relatively uncorrelated variables could potentially be combined to produce a strongly correlated variable. If you look at the variables in isolation you may miss this opportunity. To overcome the second problem you could use principal component analysis to extract useful eigenvectors (linear combinations of the variables) as inputs. That said a problem with this is that the eigenvectors may not generalize well and they also assume the distributions of input patterns is stationary.
Another problem when selecting variables is multicollinearity. Multicollinearity is when two or more of the independent variables being fed into the model are highly correlated. In the context of regression models this may cause regression coefficients to change erratically in response to small changes in the model or the data. Given that neural networks and regression models are similar I suspect this is also a problem for neural networks.
Last, but not least, one statistical bias which may be introduced when selecting variables is omittedvariable bias. Omitted variable bias occurs when a model is created which leaves out one or more important causal variables. The bias is created when the model incorrectly compensates for the missing variable by over or underestimating the effect of one of the other variables i.e. the weights may become too large on these variables or SSE will be large.
How many hidden neurons should I use?
The optimal number of hidden units is problem specific. That said, as a general rule of thumb the more hidden units used the more probable the risk of overfitting becomes. Overfitting is when the neural network does not learn the underlying statistical properties of the data, but rather 'memorizes' the patterns and any noise they may contain. This results in neural networks which perform well in sample but poorly out of sample. So how can we avoid overfitting? There are two popular approaches used in industry namely early stopping and regularization and then there is my personal favourite approach, global search,
Early stopping involves splitting your training set into the main training set and a validation set. Then instead of training a neural network for a fixed number of iterations, you train then until the performance of the neural network on the validation set begins to deteriorate. Essentially this prevents the neural network from using all of the available parameters and limits it's ability to simply memorize every pattern it sees. The image on the right shows two potential stopping points for the neural network (a and b).
Regularization penalizes the neural network for using complex architectures. Complexity in this approach is measured by the size of the neural network weights. Regularization is done by adding a term to sum squared error objective function which depends on the size of the weights. This is the equivalent of adding a prior which essentially makes the neural network believe that the function it is approximating is smooth,
where is the number of weights in the neural network. The parameters and control the degree to which the neural network over or underfits the data. Good values for and can be derived using Bayesian analysis and optimization. This, and the above, are explained in considerably more detail in this brilliant chapter.
My favourite technique, which is also by far the most computationally expensive, is global search. In this approach a search algorithm is used to try different neural network architectures and arrive at a near optimal choice. This is most often done using genetic algorithms which are discussed further on in this article.
What Are the Outputs?
Neural networks can be used for either regression or classification. Under regression model a single value is outputted which may be mapped to a set of real numbers meaning that only one output neuron is required. Under classification model an output neuron is required for each potentially class to which the pattern may belong. If the classes are unknown unsupervised neural network techniques such as self organizing maps should be used.
In conclusion, the best approach is to follow Ockhams Razor. Ockham's razor argues that for two models of equivalent performance, the model with fewer free parameters will generalize better. On the other hand, one should never opt for an overly simplistic model at the cost of performance. Similarly, one should not assume that just because a neural network has more hidden neurons and maybe more hidden layers it will outperform a much simpler network. Unfortunately it seems to me that too much emphasis is placed on large networks and too little emphasis is placed on making good design decisions. In the case of neural networks, bigger isn't always better.
5. Many training algorithms exist for neural networks
The learning algorithm of a neural network tries to optimize the neural network's weights until some stopping condition has been met. This condition is typically either when the error of the network reaches an acceptable level of accuracy on the training set, when the error of the network on the validation set begins to deteriorate, or when the specified computational budget has been exhausted. The most common learning algorithm for neural networks is the backpropagation algorithm which uses stochastic gradient descent which was discussed earlier on in this article. Backpropagation consists of two steps:
 The feedforward pass  the training data set is passed through the network and the output from the neural network is recorded and the error of the network is calculated
 Backward propagation  the error signal is passed back through the network and the weights of the neural network are optimized using gradient descent.
The are some problems with this approach. Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is quite slow, and is susceptible to local minima. Local minima are a problem for specific types of neural networks including all product link neural networks. The first two problems can be addressed by using variants of gradient descent including momentum gradient descent (QuickProp), Nesterov's Accelerated Momentum (NAG) gradient descent, the Adaptive Gradient Algorithm (AdaGrad), Resilient Propagation (RProp), and Root Mean Squared Propagation (RMSProp). As can be seen from the image below significant improvements can be made on the classical gradient descent algorithm.
That having been said, these algorithms cannot overcome local minima and are also less useful when trying to optimize both the architecture and weights of the neural network concurrently. In order to achieve this global optimization algorithms are needed. Two popular global optimization algorithms are the Particle Swarm Optimization (PSO) and the Genetic Algorithm (GA). Here is how they can be used to train neural networks:
Neural network vector representation  by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network, we can train neural networks using most metaheuristic search algorithms. This technique does not work well with deep neural networks because the vectors become too large.
Particle Swarm Optimization  to train a neural network using a PSO we construct a population / swarm of those neural networks. Each neural network is represented as a vector of weights and is adjusted according to it's position from the global best particle and it's personal best.
The fitness function is calculated as the sumsquared error of the reconstructed neural network after completing one feedforward pass of the training data set. The main consideration with this approach is the velocity of the weight updates. This is because if the weights are adjusted too quickly, the sumsquared error of the neural networks will stagnate and no learning will occur.
Genetic Algorithm  to train a neural network using a genetic algorithm we first construct a population of vector represented neural networks. Then we apply the three genetic operators on that population to evolve better and better neural networks. These three operators are,
 Selection  Using the sumsquared error of each network calculated after one feedforward pass, we rank the population of neural networks. The top x% of the population are selected to 'survive' to the next generation and be used for crossover.
 Crossover  The top x% of the population's genes are allowed to cross over with one another. This process forms 'offspring'. In context, each offspring will represent a new neural network with weights from both of the 'parent' neural networks.
 Mutation  this operator is required to maintain genetic diversity in the population. A small percentage of the population are selected to undergo mutation. Some of the weights in these neural networks will be adjusted randomly within a particular range.
In addition to these populationbased metaheuristic search algorithms, other algorithms have been used to train of neural networks including backpropagation with added momentum, differential evolution, Levenberg Marquardt, simulated annealing, and many more. Personally I would recommend using a combination of local and global optimization algorithms to overcome the shortcomings of both.
6. Neural networks do not always require a lot of data
Neural networks can use one of three learning strategies namely a supervised learning strategy, an unsupervised learning strategy, or a reinforcement learning strategy. Supervised learning require at least two data sets, a training set which consists of inputs with the expected output, and a testing set which consists of inputs without the expected output. Both of these data sets must consist of labelled data i.e. data patterns for which the target is known upfront. Unsupervised learning strategies are typically used to discover hidden structures (such as hidden Markov chains) in unlabeled data. They behave in a similar way to clustering algorithms. Reinforcement learning are based on the simple premise of rewarding neural networks for good behaviours and punishing them for bad behaviours. Because unsupervised and reinforcement learning strategies do not require that data be labelled they can be applied to underformulated problems where the correct output is not known.
Unsupervised Learning
One of the most popular unsupervised neural network architectures is the Self Organizing Map (also known as the Kohonen Map). Self Organizing Maps are essentially a multidimensional scaling technique which construct an approximation of the probability density function of some underlying data set, , whilst preserving the topological structure of that data set. This is done by mapping input vectors, , in the data set, , to weight vectors, , (neurons) in the feature map, . Preserving the topological structure simply means that if two input vectors are close together in , then the neurons to which those input vectors map in will also be close together.
For more information on self organizing maps and how they can be used to produce lowerdimensionality data sets click here. Another interesting application of SOM's is in colouring time series charts for stock trading. This is done to show what the market conditions are at that point in time. This website provides a detailed tutorial and code snippets for implementing the idea for improved Forex trading strategies.
Reinforcement Learning
Reinforcement learning strategies consist of three components. A policy which specifies how the neural network will make decisions e.g. using technical and fundamental indicators. A reward function which distinguishes good from bad e.g. making vs. losing money. And a value function which specifies the long term goal. In the context of financial markets (and game playing) reinforcement learning strategies are particularly useful because the neural network learns to optimize a particular quantity such as an appropriate measure of risk adjusted return.
7. Neural networks cannot be trained on any data
One of the biggest reasons why neural networks may not work is because people do not properly preprocess the data being fed into the neural network. Data normalization, removal of redundant information, and outlier removal should all be performed to improve the probability of good neural network performance.
Data normalization  neural networks consist of various layers of perceptrons linked together by weighted connections. Each perceptron contains an activation function which each have an 'active range' (except for radial basis functions). Inputs into the neural network need to be scaled within this range so that the neural network is able to differentiate between different input patterns.
For example, given a neural network trading system which receives indicators about a set of securities as inputs and outputs whether each security should be bought or sold. One of the inputs is the price of the security and we are using the Sigmoid activation function. However, most of the securities cost between 5$ and 15$ per share and the output of the Sigmoid function approaches 1.0. So the output of the Sigmoid function will be be 1.0 for all securities, all of the perceptrons will 'fire' and the neural network will not learn.
Neural networks trained on unprocessed data produce models where 'the lights are on but nobody's home'
Outlier removal  an outlier is value that is much smaller or larger than most of the other values in some set of data. Outliers can cause problems with statistical techniques like regression analysis and curve fitting because when the model tries to 'accommodate' the outlier, performance of the model across all other data deteriorates,
The illustration shows that trying to accommodate an outlier into the linear regression model results in a poor fits of the data set. The effect of outliers on nonlinear regression models, including neural networks, is similar. Therefore it is good practice is to remove outliers from the training data set. That said, identifying outliers is a challenge in and of itself, this tutorial and paper discuss existing techniques for outlier detection and removal.
Remove redundancy  when two or more of the independent variables being fed into the neural network are highly correlated (multiplecolinearity) this can negatively affect the neural networks learning ability. Highly correlated inputs also mean that the amount of unique information presented by each variable is small, so the less significant input can be removed. Another benefit to removing redundant variables is faster training times. Adaptive neural networks can be used to prune redundant connections and perceptrons.
8. Neural networks may need to be retrained
Given that you were able to train a neural network to trade successfully in and out of sample this neural network may still stop working over time. This is not a poor reflection on neural networks but rather an accurate reflection of the financial markets. Financial markets are complex adaptive systems meaning that they are constantly changing so what worked yesterday may not work tomorrow. This characteristic is called nonstationary or dynamic optimization problems and neural networks are not particularly good at handling them.
Dynamic environments, such as financial markets, are extremely difficult for neural networks to model. Two approaches are either to keep retraining the neural network overtime, or to use a dynamic neural network. Dynamic neural networks 'track' changes to the environment over time and adjust their architecture and weights accordingly. They are adaptive over time. For dynamic problems, multisolution metaheuristic optimization algorithms can be used to track changes to local optima over time. One such algorithm is the multiswarm optimization algorithm, a derivative of the particle swarm optimization. Additionally, genetic algorithms with enhanced diversity or memory have also been shown to be robust in dynamic environments.
The illustration below demonstrates how a genetic algorithm evolves over time to find new optima in a dynamic environment. This illustration also happens to mimic trade crowding which is when market participants crowd a profitable trading strategy, thereby exhausting trading opportunities causing the trade to become less profitable.
9. Neural networks are not black boxes
By itself a neural network is a blackbox. This presents problems for people wanting to use them. For example, fund managers wouldn't know how a neural network makes trading decisions, so it is impossible to assess the risks of the trading strategies learned by the neural network. Similarly, banks using neural networks for credit risk modelling would not be able to justify why a customer has a particular credit rating, which is a regulatory requirement. That having been said, state of the art ruleextraction algorithms have been developed to vitrify some neural network architectures. These algorithms extract knowledge from the neural networks as either mathematical expressions, symbolic logic, fuzzy logic, or decision trees.
Mathematical rules  algorithms have been developed which can extract multiple linear regression lines from neural networks. The problem with these techniques is that the rules are often still difficult to understand, therefore these do not solve the 'blackbox' problem.
Propositional logic  propositional logic is a branch of mathematical logic which deals with operations done on discrete valued variables. These variables, such as A or B, are often either TRUE or FALSE, but they could occupy values within a discrete range e.g. {BUY,HOLD,SELL}.
Logical operations can then be applied to those variables such as OR, AND, and XOR. The results are called predicates which can also be quantified over sets using the exists or forall quantifiers. This is the difference between predicate and propositional logic. If we had a simple neural network which Price (P), Simple Moving Average (SMA), and Exponential Moving Average (EMA) as inputs and we extracted a trend following strategy from the neural network in propositional logic, we might get rules like this,
Fuzzy logic  fuzzy logic is where probability and propositional logic meet. The problem with propositional logic is that is deals in absolutes e.g. BUY or SELL, TRUE or FALSE, 0 or 1. Therefore for traders there is no way to determine the confidence of these results. Fuzzy logic overcomes this limitation by introducing a membership function which specifies how much a variable belongs to a particular domain. For example, a company (GOOG) might belong 0.7 to the domain {BUY} and 0.3 to the domain {SELL}. Combinations of neural networks and fuzzy logic are called NeuroFuzzy systems. This research survey discusses various fuzzy rule extraction techniques.
Decision trees  decision trees show how decisions are made when given certain information. This article describes how to evolve security analysis decision trees using genetic programming. Decision tree induction is the term given to the process of extracting decision trees from neural networks.
10. Neural networks are not hard to implement
This list is updated, from time to time, when I have time. Last updated: November 2015.
Speaking from experience, neural networks are quite challenging to code from scratch. Luckily there are now hundreds open source and proprietary packages which make working with neural networks a lot easier. Below is a list of packages which quants may find useful for quantitative finance. The list is NOT exhaustive, and is ordered alphabetically. If you have any additional comments, or frameworks to add, please share via the comment section.
Caffe
Webpage  http://caffe.berkeleyvision.org/
GitHub Repository  https://github.com/BVLC/caffe
"Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.Yangqing Jia created the project during his PhD at UC Berkeley."  Caffe webpage (November 2015)
Encog
Webpage  http://www.heatonresearch.com/encog/
GitHub Repositories  https://github.com/encog
"Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. Most Encog training algoritms are multithreaded and scale well to multicore hardware. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train machine learning algorithms."  Encog webpage
H2O
Webpage  http://h2o.ai/
GitHub Repositories  https://github.com/h2oai
H2O is not strictly a package for machine learning, instead they expose an API for doing fast and scalable machine learning for smarter applications which use big data. Their API supports deep learning model, generalized boosting models, generalized linear models, and more. They also host a cool conference, checkout the videos :).
Google TensorFlow
Webpage  http://www.tensorflow.org/
GitHub repository  https://github.com/tensorflow/tensorflow
"TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code."  GitHub repository (November 2015)
Microsoft Distributed Machine Learning Tookit
Webpage  http://www.dmtk.io/
GitHub repository  https://github.com/Microsoft/DMTK
"DMTK includes the following projects: DMTK framework(Multiverso): The parameter server framework for distributed machine learning. LightLDA: Scalable, fast and lightweight system for largescale topic modeling. Distributed word embedding: Distributed algorithm for word embedding. Distributed skipgram mixture: Distributed algorithm for multisense word embedding."  GitHub repository (November 2015)
Microsoft Azure Machine Learning
Webpage  https://azure.microsoft.com/enus/services/machinelearning
GitHub Repositories  https://github.com/Azure?utf8=%E2%9C%93&query=MachineLearning
The machine learning / predictive analytics platform in Microsoft Azure is a fully managed cloud service that enables you to easily build, deploy, and share predictive analytics solutions. This software basically allows you to drag and drop prebuilt components (including machine learning models) and custombuilt components which manipulate data sets into a process. This flowchart is then compiled into a program and can be deployed as a webservice. It is similar to the older SAS enterprise miner solution except that is it more modern, more functional, supports deep learning models, and exposes clients for Python and R.
MXNet
Webpage  http://mxnet.readthedocs.org/en/latest/
GitHub Repositories  https://github.com/dmlc/mxnet
"MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix the flavours of symbolic programming and imperative programming together to maximize the efficiency and your productivity. In its core, a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer is build on top, which makes symbolic execution fast and memory efficient. The library is portable and lightweight, and is ready scales to multiple GPUs, and multiple machines."  MXNet GitHub Repository (November 2015)
Neon
Webpage  http://neon.nervanasys.com/docs/latest/index.html
GitHub Repository  https://github.com/nervanasystems/neon
"neon is Nervana's Python based Deep Learning framework and achieves the fastest performance on many common deep neural networks such as AlexNet, VGG and GoogLeNet. We have designed it with the following functionality in mind: 1) Support for commonly used models and examples: convnets, MLPs, RNNs, LSTMs, autoencoders, 2) Tight integration with nervanagpu kernels for fp16 and fp32 (benchmarks) on Maxwell GPUs, 3) Basic automatic differentiation support, 4) Framework for visualization, and 5) Swappable hardware backends ..."  neon GitHub repository (November 2015)
Theano
Webpage  http://deeplearning.net/software/theano/
GitHub repository  https://github.com/Theano/Theano
"Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation."  Theano GitHub repository (November 2015). Theano, like TensorFlow and Torch, is more broadly applicable than just Neural Networks. It is a framework for implementing existing or creating new machine learning models using offtheshelf datastructures and algorithms.
Torch
Webpage  http://torch.ch/
GitHub Repository  https://github.com/torch/torch7
"Torch is a scientific computing framework with wide support for machine learning algorithms ... A summary of core features include an Ndimensional array, routines for indexing, slicing, transposing, an interface to C, via LuaJIT, linear algebra routines, neural network, energybased models, numeric optimization routines, Fast and efficient GPU support, Embeddable, with ports to iOS, Android and FPGA"  Torch Webpage (November 2015). Like Tensorflow and Theano, Torch is more broadly applicable than just Neural Networks. It is a framework for implementing existing or creating new machine learning models using offtheshelf datastructures and algorithms.
SciKit Learn
Webpage  http://scikitlearn.org/stable/
GitHub Repository  https://github.com/scikitlearn/scikitlearn
SciKit Learn is a very popular package for doing machine learning in Python. It is built on NumPy, SciPy, and matplotlib Open source, and exposes implementations of various machine learning models for classification, regression, clustering, dimensionality reduction, model selection, and data preprocessing.
As I mentioned, there are now hundreds of machine learning packages and frameworks out there. Before committing to any one solution I would recommend doing a bestfit analysis to see which open source or proprietary machine learning package or software best matches your usecases. Generally speaking a good rule to follow in software engineering and model development for quantitative finance is to not reinvent the wheel ... that said, for any sufficiently advanced model you should expect to have to write some of your own code.
Conclusion
Neural networks are a class of powerful machine learning algorithms. They are based on solid statistical foundations and have been applied successfully in financial models as well as in trading strategies for many years. Despite this, they have a bad reputation due to the many unsuccessful attempts to use them in practice. In most cases, unsuccessful neural network implementations can be traced back to inappropriate neural network design decisions and general misconceptions about how they work. This article aims to articulate some of these misconceptions in the hopes that they might help individuals implementing neural networks meet with success.
For readers interested in getting more information, I have found the following books to be quite instructional when it comes to neural networks and their role in financial modelling and algorithmic trading.

Great effort behind this article, Stuart.
Kindly check the email

A terrific resource.
It would be really illustrative to understand how the example applications mentioned  timeseries forecasting, proprietary trading signal generation, fully automated trading (decision making), financial modelling, derivatives pricing, credit risk assessments, pattern matching, and security classification  are solved using neural networks or other machine learning methods. Is there a resource or blog that covers this?

Excellent blog Stuart...wellwritten, articulate & nuanced in its descriptions.

Nice blog Mr Stuart, and thanks for summarizing alot of things. I was working on a neural network for my company inkunzi markets in Sandton, and just finished after 3 months(built from scratch), fuzzy neurons are not as easy to control and build indeed, but rather better when done perfectly interms of pattern recognition and market forecasting. Keep up the good work fellow Quant,
BSc Mathematical Statistics, Physics and Electronics from Rhodes University. 
Hi Stu, I am starting a quant invest platform development project here in Beijng based on big data intelligence from market emotion to technical trading signal using, and I am looking for international partner's join, if you have interests, maybe we can schedule a skype chat. Thank you with regards, your personal blog is awesome! Jack

Awesome

Please sign me up for updates.
Thanks 
My concern with neural networks is its ability to handle categorical data. I get the impression that in supervised learning situations, neural networks work best when all your independent variables are numeric (or at least mostly numeric). Is there any truth to this?

Thanks for the Article. I think this article is a must read for everyone 'new' at this field. As I call this method is a 'breadthfirst' learning approach to Introduction to Neural Networks.
Sorry for my bad english.

Looking for something like this for a while, all i can find are clickbait articles.
Great research! Favorited! 
Thank you very very much!!! Your article is amazing especially for the beginner like me. From your article, I get an outline for what Neural Network is, how many kinds of NNs and how to use them properly. Plus, the external resources you provided are excellent too.

Great article Stuart. Would you recommend any open source ANN tools that implement the Levenberg Marquardt learning algorithm?

Great article. Neural network and article

Hi Stuart, Thank you for this article  it was most illuminating!
How do spiking neural networks fit into the overall picture of neural networks? (architecturally speaking and from the point of view of most suitable applications) 
Hi Stuart,
My name is Michael. I have read that Neural Network Regression can predict the market more than any other software or strategy. I have a question then; how can i use the neural network in trading, my main concern is the forex market. Is neural network regression a software? What is it? How can i use it in trading forex? How can it predict or forcaste the price of the eurusd for me?
Am like totally naive on this and i need your help. If its a program software, how can i get one?Thank you and i anticipate your reply.
Michael.

Wow, thanks for the excellent writeup. It was incredibly wellresearched and articulated. Keep it up!
Comments