Lossless Compression Algorithms and Market Efficiency?
In Hacking The Random Walk Hypothesis we applied the NIST suite of cryptographic tests for randomness to binarized daily market returns. Overall the NIST suite failed on the data. This result was taken to mean that markets are not quite the "coin flipping competition" famously posited by some notable academics. That having been said, some of the tests in the suite did pass indicating some degree of noise in market returns.
One such test, the linear complexity test, caught my eye because it applies compression to test the randomness of a sequence. This idea  that randomness is related to compressibility  dates back to Claude Shannon's information theory. Which, through algorithmic information theory, is related to Alonzo Church and Alan Turing's computability theory. Being a computer scientist these topics fascinate me; almost moreso than the markets do.
In this article we will expand upon this line of thought by applying greedy lossless compression algorithms to binarized and detrended market index returns and, spoiler alert, we will arrive at some unexpected conclusions. Following this article we shall continue our nonrandom walk down wall street by considering lossy compression algorithms (including autoencoder neural networks) and then Lo's test for long term memory in stock markets.
.
As mentioned in my last post, Stock Markets Do Not Follow Random Walks, the featured images for this series of articles are simply screenshots from some of my favourite Wall Street inspired films. This featured image is taken from the classic 1987 Wall Street movie and is, of course, portraying the infamous Gordon Gekko.
Article Outline
The article is broken up into the following sub sections,
 The Efficient Market and Random Walk hypotheses.
 Algorithmic Information Theory and Randomness.
 Compression and Market Randomness; Challenges.
 Proposed Compression Test of Market Randomness.
 Overview of Lossless Compression Algorithms in R.
 Results obtained and their meaning(lessness)?
 Conclusions and further research efforts.
 Appendix A: Link to all of the source code.
The Efficient Market and Random Walk hypotheses
The strong form of the efficient market hypothesis states that markets accurately and instantaneously discount all publicly and privatelyavailable information about a security into its price. As such, the best forecast of any securities price tomorrow is just its price today because its price today is necessarily correct.
Furthermore, this hypothesis implies that no careful study of past prices  or fundamental or economic information  will result in a consistently more accurate forecast of the securities price in the future. In an informationally efficient market the change in price from today to tomorrow depends upon unknowable information from tomorrow; which is why the change is random from today's perspective and a rational reaction from tomorrow's perspective. Only timetravellers and high frequency information arbitrageurs can hope to beat this market.
Despite my inclination towards more "complex" market hypotheses such as evolutionary and behavioural finance, I cannot argue that the efficient market hypothesis is without merit. It has merit for three reasons. Firstly, markets are damn difficult to forecast. Secondly, the average active manager does not beat the market consistently. And thirdly, the efficient market hypothesis has enabled exciting financial innovations such as derivatives securities. However, having merit does not imply that the hypothesis is true; it most probably isn't true.
The strongest criticism of the efficient market hypothesis originates from behavioural finance which challenges the underlying assumption that investors are rational, or rather, that they are rational enough to accurately price securities. Given all the available information about a security it's unlikely that market agents would agree on a price. Agents do not have homogeneous expectations in reality; as such, there is no intrinsic value in reality.
My favourite criticism of the efficient market hypothesis comes from evolutionary finance. According to evolutionary finance markets are complex systems consisting of heterogeneous agents which interact with one another and are subject to evolutionary forces. Through these interactions and as a result of evolutionary forces statistical properties emerge. Emergent properties of markets include randomness, momentum, value, and reversion but none of these properties are persistent or truer than the others. Their magnitudes change as the population of market agents adapt, evolve, and succumb. Such properties are, therefore, not anomalies (as they are called under the efficient markets hypothesis), they are expected emergent properties of markets.
In response to such criticism two variations of this hypothesis were proposed. The semistrong form of the efficient market hypothesis relaxes the definition by stating that only publicly available information is discounted into a securities price and the weakform of the efficient market hypothesis states that, at the very least, all historical price and volume data is discounted into a securities price. The key takeaway is that regardless of the definition you choose, in the absence sideinformation, the evolution of asset prices follows a Markov process; a random walk.
This theory, dubbed the random walk hypothesis, lies at the heart of almost all modern day asset pricing models which essentially assume that prices evolve according to a Markov process or random walk. Whether or not this is true has obvious and profound implications for you, me, and (almost) everybody else on this planet. Which might explain why academics and practitioners have spent decades studying the relative efficiency and / or randomness of markets. Our last article introduced one such test: Lo and MacKinlay's heteroskedastic variance ratio test. In this article we will consider, amongst others, the popular LempelZiv compression test.
References for this discussion
If you enjoy the philosophy of markets as much as I do, here are a few of my favourite publications:
Random Walks
 Bachelier, Louis. "Louis Bachelier's theory of speculation: the origins of modern finance". Princeton University Press, 2011.
 Fama, Eugene F. "Random walks in stock market prices." Financial analysts journal, 1995.
 Malkiel, Burton G., and Eugene F. Fama. "Efficient capital markets: A review of theory and empirical work." The journal of Finance, 1970.
 Fama, Eugene F. "Market efficiency, longterm returns, and behavioral finance." Journal of financial economics, 1998.
Nonrandom Walks
 Simon, Herbert A. "A behavioral model of rational choice." The quarterly journal of economics, 1955.
 Barberis, Nicholas, and Richard Thaler. "A survey of behavioral finance." Handbook of the Economics of Finance 1, 2003.
 Lo, Andrew W. "The adaptive markets hypothesis: Market efficiency from an evolutionary perspective." Journal of Portfolio Management, 2004.
 LeBaron, Blake, W. Brian Arthur, and Richard Palmer. "Time series properties of an artificial stock market." Journal of Economic Dynamics, 1999.
Algorithmic Information Theory and Randomness
The study of randomness is not an activity unique to quantitative analysis and economics, it is central to many fields including cryptography, quantum mechanics, and genetics. Therefore, it is quite unsurprising that we have borrowed and repurposed ideas from other fields to argue in favour of, or against, the efficiency or randomness of markets. Nowhere is this more true than the field of algorithmic information theory, an amalgamation of computability theory and information theory (a.k.a a computer scientists dream). As Gregory Chaitin once stated,
"It [algorithmic information theory] is the result of putting Shannon's information theory and Turing's computability theory into a cocktail shaker and shaking vigorously."
As mentioned previously one of the ideas borrowed from algorithmic information theory by economists is the idea that random sequences are incompressible. Let's take some time to explain the thought process behind this ...
In the 1930's Alan Turing introduced an abstract model for computation called the automatic machine or, as it has come to be known today, the Turing Machine. I'm not going to explain the mechanism behind the Turing Machine (you can find that at Cambridge's website or Wolfram Mathworld) because it is not relevant to our discussion of randomness. All you need to know is this: "despite the model's simplicity, given any computer algorithm, a Turing machine can be constructed that is capable of simulating that algorithm's logic"  Wikipedia.
.
What is an algorithm? In it's most simple form an algorithm is a deterministic sequence of steps for transforming a set of inputs into a set of outputs. Consider the the Fibonacci sequence as an example. The sequence of numbers starts with 0, 1 and proceeds as follows: 1, 2, 3, 5, 8, 13, .... The algorithm for reproducing this sequence can be stated very simply: the next term in the sequence is just the sum of the previous two terms. Additionally it is extremely easy for a Turing Machine to emulate this logic; the resulting program  found here  is quite simple.
What is complexity? According to another great mathematician, Andrey Kolmorogov, the complexity of a sequence can be measured by the length of the shortest computer program that produces the sequence as output. If that program is shorter than the original sequence then we have compressed that sequence; we have encoded it.
What happens when we give our Turing Machine a random sequence, S? If the data is random that means that there is no logic that was used to produce it. If there is no logic there is nothing for the Turing Machine to simulate. Therefore, the only program that will produce the random sequence as an output, is the random sequence: print S.
The program, print S, is longer than the random sequence, S. Therefore a random sequence necessarily has a Kolmorogov complexity greater than the original the sequence itself. This can be interpreted as saying: a random sequence is incompressible ... but here's the [halting] problem: how do we know for a fact that the program we have found which outputs the sequence is the shortest one possible? We would have had to have to tried every single possible program, of which there are an infinity, to make such a conclusion. It's an impossible task.
Put simply the Kolmorogov complexity of a sequence is incomputable and, as such, randomness is incomputable. We cannot prove that a sequence is random but, that doesn't mean that we can't test the probability that a sequence might be random. This is where all of our statistical tests for randomness come into play ... some of which use compression algorithms to test whether or not a statistically significant amount of compression is achievable.
Some of those who we have to thank for algorithmic information theory:
Compression and Market Randomness; Challenges
There have been quite a few academics who have attempted to use compression algorithms to test the efficiency of markets. At face value this may seem like a well formulated idea, but the idea faces some unique challenges.
Challenge 1  Infinite vs. Finite Sequences
The first challenge (which isn't really a problem) is that our information theory definition of randomness applies as sequences approach infinite length. Any sufficiently finite sequence is likely to be slightly compressible on the basis of probability alone. As such when using compression to test relatively short sequences of market returns we are not interested in whether the algorithm was able to compress the returns, we are interested in whether the degree of compression is statistically significantly more than what is expected on a random sequence.
Challenge 2  Market Drift
The second problem is that there are different types of randomness. The first type, often called the Martingale property or true, algorithmic, or Kolmorogov randomness, is the one most often associated with information theory and computer science (especially cryptography). The second type of randomness, called predictability or the Markov property, is most often associated with finance, stochastic processes, and the random walk hypothesis.
What this means is that we need to be extremely careful  admittedly, more careful than I have been in the past  when selecting our statistical tests for market randomness as well as when choosing our inputs. This is because a statistical test designed to test for true randomness will have a very different confidence interval than a statistical test designed to look for breakdowns in the Markov property. A simple explanation of this is given below.
Suppose we have two models. The first is a Brownian Motion model and the second is a Geometric Brownian Motion model which a nonzero drift. In this example the first model satisfies both the Martingale and Markov properties but the second model only satisfies the Markov property. As such if we applied tests designed to test for true randomness the output from the first model it should pass but output from the second model should fail.
This is a major problem for individuals wishing to refute the random walk hypothesis using compression because the hypothesis only really concerns itself with the excess predictability of the market above drift i.e. ones ability to "beat the market". Which is why academics such as Andrew Lo and Jonathan MacKinlay spent decades designing more powerful tests of market randomness such as the Heteroskedastic Variance Ratio Test.
Challenge 3  Stochastic Volatility & Jumps
The third problem is that the nonstationarity of volatility can cause certain returns to appear more or less frequently than they theoretically should. This problem is commonly called "fat tails" and manifests itself because of either stochastic volatility or discontinuities (jumps) in market returns. However, if we assume that the noise present in market returns is centred around zero, then we can show that regardless of jumps and stochastic volatility by binarizing the market returns we remove this problem alltogether  see the next section.
So what do these challenges mean for researchers wishing to refute the random walk hypothesis using compressionbased tests of randomness? It just means that we need to think about our methodology carefully.
Proposed Compression Test for Market Randomness
Academics have applied lossless compression algorithms to market returns in order to test or rank their efficiency for many years now. At the end of this section you will find links to relevant papers which take this approach.
Most of the papers concluded that developed markets exhibit high levels of efficiency whereas some developing markets exhibit excess compressibility and, therefore, some degree of inefficiency. Such inefficiencies can only be explained by specific return patterns having higherthanexpected probabilities through time.
Our testing methodology
The following testing methodology is followed for three wellknown compression algorithms: {gzip, bzip2, and xz}, on a set of developed and developing market indices, and foreign exchange rates.
 Download the market index prices from Quandl.com.
 Compute the logarithmic returns of the index and denote them as .
 Compute the mean logarithmic return (drift component), .
 Detrend by subtracting (drift) from .
 Binarize by replacing >0 returns with 1's and <0 returns with 0's.
 Split into overlapping discrete windows, , of 7 years each (+450 data points).
 Convert each window, , into hexadecimal (4day subsequences), .
 Compress each , , and compute the compression ratio,
 Return the average compression ratio over all windows, .
 Compute the expected compression ratio using pseudorandom data, .
 If : market exhibits no excess compressibility which implies efficiency.
 If : market exhibits some excess compressibility which implies inefficiency.
The test proposed above is similar to those used in the literature  we use binarized returns and multiple lossless compression algorithms  but also dissimilar in some important ways  we detrend market returns to remove the influence of drift and we compute the average compressibility using a walkforward methodology.
Our test in R
Very little R code was required to construct this test 🙂 due to the memCompress and memDecompress functions which exist in R. Similar Python functions, encode and decode, exist for readers who favour Python.
Some relevant academic papers:
 Shmilovici, Armin, Yael AlonBrimer, and Shmuel Hauser. "Using a stochastic complexity measure to check the efficient market hypothesis." Computational Economics 22.23 (2003): 273284.
 Giglio, Ricardo, Raul Matsushita, and Sergio Da Silva. "The relative efficiency of stockmarkets." Economics Bulletin 7.6 (2008): 112.
 Risso, Wiston Adrián. "The informational efficiency: the emerging markets versus the developed markets." Applied Economics Letters 16.5 (2009): 485487.
 Shmilovici, Armin, et al. "Measuring the efficiency of the intraday forex market with a universal data compression algorithm." Computational Economics 33.2 (2009): 131154.
 Giglio, Ricardo, et al. "Algorithmic complexity theory and the relative efficiency of financial markets." EPL (Europhysics Letters) 84.4 (2008): 48005.
Overview of Lossless Compression Algorithms in R
A detailed discussion of each of the lossless compression algorithms available to R's memCompress and memDecompress functions is beyond the scope of this article. Below you will find a very brief summary of the three compression schemes used as well as references to their underlying compression algorithms.
gzip compression
gzip is based on the DEFLATE algorithm, which combines LZ1 and Huffman coding. The LZ1 algorithm, named after Abraham Lempel and Jacob Ziv, works by replacing repeated patterns in the sequence with references to a copy of that pattern from earlier in the uncompressed sequence. Each match is encoded using a lengthdistance pair which tells the decoder that: the next y characters is equal to the characters x distance characters behind it. The Huffman Coding algorithm, named after David Huffman, works by constructing an optimal prefix tree given the occurrences of patterns in the sequence. For a great visualization of this checkout this YouTube video.
bzip2 compression
bzip2 uses three mechanisms in its compression: the BurrowsWheeler Transform, the MovetoFront Transform, and then Huffman Coding. The BurrowsWheeler Transformation is a reversible method for rearranging the sequence into runs of similar characters so that it can be more easily compressed by RunLengthEncoding schemes or MovetoFront encoding. Movetofront transformations work as follows: letters in the sequence are replaced with their indices in a stack containing the most recently observed letters. As such, long sequences of identical letters are replaced by smaller integers whereas infrequent letters are replaced with larger integers. Lastly the transformed data is fed through the Huffman Coding algorithm as with the DEFLATE algorithm.
xz compression
xz implements the (very cool sounding) Lempel–Ziv–Markov chain algorithm (LZMA) which was developed in the 7z compression format. LZMA is a dictionary compression algorithm derived from the LZ1 algorithm. It's output is encoded using a model which makes a probability prediction of each bit, called an adaptive binary range coder.
All of these algorithms are available natively in R via the memCompress and memDecompress functions. The biggest challenge in using these functions (at least for our purposes) was that the input needs to be given as a raw hexadecimal sequence. Converting daily, binarized returns into hexadecimal sequences is done by the function bin2rawhex. Essentially what this does is break our daily returns sequence into discrete, nonoverlapping, fourday, subsequences which can then be converted directly into hexadecimal. It's not ideal, but it seems to work.
Results obtained and their meaning(lessness)?
For this article we ran our compression test on 51 global stock market indices from developed and developing markets and 12 liquid currency pairs. These sequences were tested on both a weekly and daily frequency.
The exact breakdown of tested assets is given below:
 Dow Jones Industrial (USA): YAHOO/INDEX_DJI
 S&P 500 Index (USA): YAHOO/INDEX_GSPC
 S&P 400 Midcap Index (USA): YAHOO/INDEX_MID
 S&P 600 Smallcap Index (USA): YAHOO/INDEX_SML
 NASDAQ Composite (USA): YAHOO/NDAQ
 NYSE Composite (USA): YAHOO/INDEX_NYA
 Russel 1000 Index (USA): YAHOO/INDEX_RUI
 Russel 2000 Index (USA): YAHOO/INDEX_RUT
 Russel 3000 Index (USA): YAHOO/INDEX_RUA
 NYSE AMEX Oil Index (USA): YAHOO/INDEX_XOI
 NYSE AMEX Composite Index (USA): YAHOO/INDEX_XAX
 Merval Index (Argentina): YAHOO/INDEX_MERV
 Bovespa Index (Brazil): YAHOO/INDEX_BVSP
 Inmex Index (Mexico): YAHOO/INDEX_MXX
 NYSE ARCA Mexico Index (Mexico): YAHOO/INDEX_MXY
 S&P TSX Index (Canada): YAHOO/INDEX_GSPTSE
 ISEQ General Total Return Index (Ireland): YAHOO/INDEX_IEQR_IR
 Athens Composite Index (Greece): YAHOO/INDEX_GD_AT
 CAC40 Index (France): YAHOO/INDEX_FCHI
 DAX Index (Germany): YAHOO/INDEX_GDAXI
 OMX 20 Copenhagen (Denmark): YAHOO/INDEX_OMXC20_CO
 SATRIX Top 40 (South Africa): GOOG/JSE_STX40
 SATRIX Industrials (South Africa): GOOG/JSE_STXIND
 SATRIX Financials (South Africa): GOOG/JSE_STXFIN
 SATRIX Resources (South Africa): GOOG/JSE_STXRES
 SATRIX SWIX (South Africa): GOOG/JSE_STXSWX
 EGX 70 Price Index (Egypt) : YAHOO/INDEX_CCSI
 Russia RTS Index (Russia): YAHOO/INDEX_RTS_RS
 Swiss Market Index (Switzerland): YAHOO/INDEX_SSMI
 Tel Aviv Top 100 Index (Israel): YAHOO/INDEX_TA100
 Eurostoxx 50 (Europe): YAHOO/INDEX_STOXX50E
 IBEX 35 (Spain): YAHOO/INDEX_IBEX
 AEX Index (Netherlands): YAHOO/INDEX_AEX
 BEL20 Index (Belgium): YAHOO/INDEX_BFX
 ATX Index (Austria): YAHOO/INDEX_ATX
 OMX30 Index (Sweden): YAHOO/INDEX_OMX
 Nikkei 225 (Japan): YAHOO/INDEX_N225
 Japan AMEX Index (Japan): YAHOO/INDEX_JPN
 Shanghai Composite Index (China): YAHOO/INDEX_SSEC
 CSI300 Index (China): YAHOO/SS_000300
 Hang Send Index (Hong Kong): YAHOO/INDEX_HSI
 KOSPI Composite Index (South Korea): YAHOO/INDEX_KS11
 BSE 30 Sensitivity Index (SENSEX) (India): YAHOO/INDEX_BSESN
 Nifty Fifty Index (India): YAHOO/INDEX_NSEI
 Colombo All Share Index (Sri Lanka): YAHOO/INDEX_CSE
 New Zealand Stock Exchange (New Zealand): YAHOO/INDEX_NZ50
 Taiwan Weighted Index (Taiwan): YAHOO/INDEX_TWII
 All Ordinaries Index (Australia): YAHOO/INDEX_AORD
 S&P/ASX 200 Index (Australia): YAHOO/INDEX_AXJO
 Jakarta Composite Index (Indonesia): YAHOO/INDEX_JKSE
 Strait Times Index (Singapore): YAHOO/INDEX_STI
 US Dollar to British Pound: CURRFX/USDGBP
 US Dollar to Euro: CURRFX/USDEUR
 British Pound to Euro: CURRFX/GBPEUR
 US Dollar to South African Rand: CURRFX/USDZAR
 US Dollar to Brazilian Real: CURRFX/USDBRL
 US Dollar to Russian Ruble: CURRFX/USDRUB
 Euro to South African Rand: CURRFX/EURZAR
 Euro to Brazilian Real: CURRFX/EURBRL
 Euro to Russian Ruble: CURRFX/EURRUB
 British Pound to South African Rand: CURRFX/GBPZAR
 British Pound to Brazilian Real: CURRFX/GBPBRL
 British Pound to Russian Ruble: CURRFX/GBPRUB
All of these assets were compressed using gzip, bzip2, and xz lossless compression algorithms within the walkforward averaging methodology explained earlier and, for once, the results are really easy to communicate ...
With the exception of the US Dollar to Russian Ruble foreign exchange rate, none of the assets demonstrated excess compressibility. In fact, none of the assets demonstrated any compressibility whatsoever.
Our results with gzip, bzip2, and xz  sophisticated compression algorithms  confirm our negative result obtained in Hacking The Random Walk Hypothesis with the linear complexity test. But they contradict our results obtained with NIST tests and our results obtained with Lo and MacKinlay's Heteroskedastic Variance Ratio Test. Why?
Hypothesis 1: Markets are efficient.
The first interpretation of this result is that markets, across the world, are relatively efficient; they appear quite random. This is the conclusion most of the academics in the papers above arrived at.
.
Hypothesis 2: Compression isn't appropriate.
The second interpretation of this result is that there is something fundamentally "wrong" with using lossless compression algorithms to test the relative efficiency or randomness of financial markets.
Since I was unsatisfied with the results I decided to dig a bit deeper and develop a test for the compression test ... and what I found was really, really interesting ;).
Testing the compression test
People who refute the random walk hypothesis, like me, are not arguing that markets are 100% deterministic; we are arguing that they are not 100% stochastic. In other words, we are arguing that markets have two faces: the signal and the noise. The signal manifests itself in the many hundreds of socalled anomalies: value, momentum, reversion, indexation, etc. and the noise manifests itself as apparent randomness and efficiency. They are two sides of the same coin; both emergent properties of a complex adaptive system of heterogeneous, imperfect agents. We then go one step further and claim that the amount of noise produced by the system does not drown out the signal and that it is, therefore, possible to beat the market by identifying such signals and exploiting them.
With that in mind I set out to design a simple test of our compression test:
Consider the following toy market: where
As we increase the size of the standard deviation of the noise component, , we see that the compressibility of the returns begins to degrade. When the standard deviation passes 0.01 the compression test almostalways fails:
Example outputs from our toy market for , , and look as follows:
According to our compression test all of these toy markets are random or efficient because they could not be compressed using the lossless compression algorithms gzip, bzip2, or xz. I must admit they look quite random too! But remember, they consist of both a signal and a noise so they are definitely not truly random.
Now, for added good measure, consider the following meanreversion betting strategy / algorithm:
 Inputs: window size, .
 For time, , in time:
 Compute the cumulative return of the past days excluding day , .
 If < 0.0: go long 100% the asset (bet the asset goes up)
 Else: go short 100% the asset (bet the asset goes down)
Below you will find two charts for each of the noise levels. The first chart will show thirty equity curves generated by our simple mean reversion betting strategy and the second chart will show the average toy market return (from above) against the average mean reversion betting strategy return on these markets.
Now obviously this is a silly / extreme example but it highlights a contradiction:
If a market (even our toy market) is incompressible it "follows" that it is efficient but if it were truly efficient then there could not exist any betting strategy which consistently beats it, such as the above one.
To resolve this we must accept that because we cannot prove randomness, the sequence is actually not random and our compression test has failed. Why? I think that the test is failing because it is more sensitive to noise than our trading strategy  my simulations indicate that in the case of our toy market the compression test in up to 10x more sensitive to noise than our trading strategy  this may be due to using greedy or lossless compression algorithms.
The code for running this test of the compression test is shown below.
Conclusions and further research efforts
In this article we introduced the concept of using compression as a test for randomness and explained how such tests have been applied to the market to conclude that they are relatively efficient. We then developed a "meta test" for market randomness tests based on greedy lossless compression algorithms which indicates that such tests are highly sensitive to any noise. This may explain why they are contradicted by other randomness tests.
From this article we might want to conclude the following:
 After detrending markets exhibit enough noise to make them incompressible using greedy, lossless compression algorithms such as gzip, bzip2, xz, and a linear feedback shift register.
 If the sensitivity of the test to noise is greater than our trading strategies robustness in the presence of noise, the negative result does not imply that markets are efficient in the sense that they cannot be beaten.
 Our toy market shows that the sensitivity of the greedy lossless compression algorithms used to test market efficiency can be as much as 10x more sensitive to noise than a simple betting strategy.
 As such, some academic literature in support of the efficient market hypothesis presented in this article may inadvertently be applying a toostrict interpretation of nonrandomness to market returns.
 Statistical tests for market randomness vary in terms of their size, power, and sensitivity. Multiple tests should be used in favour of single tests for more statistically significant and robust results.
As I mentioned in the introduction, further research efforts stemming from this article with include a study of lossy compression algorithms most notably, a stacked autoencoder neural network. We will follow much the same methodology as we did here: look for excess compressibility beyond what would be expected from some finite pseudorandom sequence and then test how sensitive the lossy compression tests are to noise.
If you have any criticisms, counterarguments, compliments, or suggestions for our upcoming article on lossy compression algorithms I would love to hear them in the comment section below :). Thanks for reading!
Appendix A: All the Source Code
All of the source code required to run this analysis is available in R on GitHub here.

First off, I tip my hat to the brilliant awesomeness of the work on this blog. Seriously amazing.
I'm no where near as educated in mathematics as you are, but I have an idea. If I understood correctly, your measure originates essentially from an arbitrary zero baseline. The measure is random from this point, but it is likely not random when measured from other points such as its daily velocity or moving averages or such. There may also be a timing component. Another possibility is that the data is not random but correlated to an external variable.
A larger, stratified sample might also result in more insight. I think you are onto something here. Many years ago I worked on some simulation logic for market analysis and discovered that what most people think is correlation is simply different ways of measuring the exact same thing from different baselines, either temporal or some exchange rate for labor, resources, etc. The trick is to find the independent correlations. Sadly, the person leading my project just didn't get that and built a marvelously complex program that told him what he already knew and he went broke quickly.
Caveat: I don't play in the market so I may be unknowingly stating the obvious.

Since lossless compression hasn't proven useful, what about lossy compression? Compress market data using something akin to jpeg or mpeg compression, and compare it to the compression of true noise (or random walk) signals.

Hmmm. I just had my morning coffee(s) so this will be a stream of consciousness...
On the right track I think with a denoising auto encoder. As your dealing with time series, it also be worth kicking the tires on a generative recurrent neural network (GRU or a mutated LSTM perhaps). which attempts to output the next sequence.
Also, what about smoothing the data with a Kalman filter and then attempting the lossless compression? For that matter what about just trying to predict the returns with a particle filter?
Great read. This provided me some novel thoughts to apply to my own general analytic work.
Looking forward to the autoencoder..

Just a quick follow up on the last comment, I intended the "your abilities" statement as a complement, in the sense that you would be able to lay out the math and code behind even the vanilla RNN, which is also quite involved (and I myself have never done).
Just wanted to make sure it wasn't misconstrued as I probably phrased it poorly (again coffee).
Slainte

Hi, have you seen this interview? I find it very enjoyable and you would probably too. :>
Comments