Wednesday, October 4, 2017

Interesting Problem for Serious Geeks: Self-correcting Random Walks

This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some spare time could find an elegant solution. Share it with your colleagues to see how math-savvy they are, or with your students. I was able to make substantial progress in 1-2 hours of work using Excel alone, thought I haven't found a final solution yet (maybe you will.) My Excel spreadsheet with all computations is accessible from this article. You don't need a deep statistical background to quickly discover some fun and interesting results playing with this stuff. Computer scientists, software engineers, quants, BI and analytic professionals from beginners to veterans, will also be able to enjoy it!
The problem
We are dealing with a stochastic process barely more complicated than a random walk. Random walks are also called drunken walks, as they represent the path of a drunken guy moving left and right seemingly randomly, and getting lost over time. Here the process is a self-correcting random walk, also called controlled random walk, in the sense that the walker, less drunk than in a random walk, is able to correct any departure from a straight path, more and more over time, by either slightly over- or under-correcting at each step. One of the model parameter (the positive parameter a) represents how drunk the walker is, with a = 0 being the worst. Unless a = 0, the amplitude of the corrections decreases over time to the point that eventually (after many steps) the walker walks almost straight and arrives at his destination. This model represents many physical processes, for instance the behavior of a stock market somewhat controlled by a government to avoid bubbles and implosions, and it is defined as follows:
Let's start with X(1) = 0, and define X(k) recursively as follows, for k > 1:
and let's define U(k), Z(k), and Z as follows:
where the V(k)'s are deviates from independent uniform variables on [0, 1], obtained for instance using the function RAND in Excel. So there are two positive parameters in this problem, a and b, and U(k) is always between 0 and 1. When b = 1, the U(k)'s are just standard uniform deviates, and if b = 0, then U(k) = 1. The case a = b = 0 is degenerate and should be ignored. The case a > 0 and b = 0 is of special interest, and it is a number theory problem in itself. Also, just like in random walks or Markov chains, the X(k)'s are not independent; they are indeed highly auto-correlated.
Figure 1: Mixture-like distribution of Z (estimated) when b = 0.01 and a = 0.8
Prove that if a  > 0, then  X(k) rapidly converges to 0 as k increases. Also prove that the limiting distribution Z
  • always exists,
  • always takes values between -1 and +1, with min(Z) = -1 and max(Z) = +1,
  • is symmetric, with mean and median equal to 0
  • and does not depend on a, but only on b (if b not too close to 0)
For instance, for b =1, even a = 0 yields the same triangular distribution for Z, as any a  > 0.
If a  > 0 and b = 0, (the non-stochastic case) prove that 
  • Z can only take 3 values: -1 with probability 0.25, +1 with probability 0.25, and 0 with probability 0.50
  • If U(k) and U(k+1) have the same sign,then U(k+2) is of opposite sign  
And here is a more challenging question: In general, what is the limiting distribution of Z? Also, what happens if you replace the U(k)'s with (say) Gaussian deviates? Or with U(k) = | sin (k*k) | which has a somewhat random behavior?
Click here to read more, including partial solution.

Monday, October 2, 2017

9 Off-the-beaten-path Statistical Science Topics with Interesting Applications

You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.
The topics discussed in this article include:
  • Random walks in one, two and three dimensions - With Video
  • Estimation of the convex hull of a set of points - Application to clustering and oil industry
  • Constrained linear regression on unusual domains - Application to food industry
  • Robust and scale-invariant variances
  • Distribution of arrival times of extreme events - Application to flood predictions
  • The Tweedie distributions - Numerous applications
  • The arithmetic-geometric mean - Fast computations of decimals of Pi
  • Weighted version of the K-NN clustering algorithm
  • Multivariate exponential distribution and storm modeling

Sunday, September 24, 2017

Interesting Application of the Greedy Algorithm for Egyptian Fractions

This article was originally posted here. Check the original article for any update and additions. 

Here we discuss a new system to represent numbers, for instance constants such as Pi, e, or log 2, using rational fractions. Each iteration doubles the precision (the number of correct decimals computed) making it converging much faster than current systems such as continued fractions, to represent any positive real number. The algorithm discussed here is equivalent to the greedy algorithm to compute Egyptian fractions, except that we use it here mostly for irrational numbers.
For any real number x, the representation is as follows:
where the integer numbers p(k) are computed iteratively with the following two-steps iteration, starting with k = 1::
where the brackets represent the floor function which always returns an integer. You start with a seed p(1) = 1 (though you can work with other seed values) and loop over k = 1, 2, and so on. As k tends to infinity, x(k) tends to x. The speed of convergence is remarkable, yielding 14 decimals after five or six iterations, and doubling the precision at each iteration, because p(k+1) is of the same order of magnitude as the square of p(k). It works well if x is between 1 and 2, 
Here are a few interesting examples: 
  • For x = p - 2, we get p(1) = 1, p(2) = 8, p(3) =  61, p(4) = 5020, p(5) = 128541459, yielding x(5) = 1.14159265358979, with 13 correct decimals. 
  • For x = 1 + log 2, we get p(1) = 1, p(2) = 2, p(3) =  6, p(4) = 38, p(5) = 6071, p(6) = 144715222, yielding x(6) = 1.69314718055995, with 14 correct decimals.
  • For x = 2, we get p(1) = 1, p(2) = 2, p(3) =  3, p(4) = 7, p(5) = 43, p(6) = 1807, yielding x(6) = 1.99999969357507, with 6 correct decimals.
The sequence p(1), p(2), and so on uniquely characterizes the number x. Note that for k as small as 6 or 7, p(k) becomes so large that you need to use special arithmetic (long numbers) to perform the computations in any programming language, as you have reached the limit in terms of computer precision. 
By contrast, continued fractions converge much more slowly, click here to see the difference. The 6-th convergent for p (with continued fractions) is 104348 / 33215 = 3.141592654. Compare this with our approximation x(6) for p-2, which has a far higher precision.  
General framework and proof of convergence
This methodology generalizes as follows: Start with a value p(1) with x > 1 / p(1), and iteratively compute p(k+1) as the smallest number from an infinite set E, such that x(k+1)   <  x, with   
In our particular case, E is the set of all positive integers, and p(1) = 1. With p(1) = 1, convergence is fastest when x gets very close (above) 1. If x = 1 and p(1) = 2, then p(k+1) = 1 + p(k) * (p(k) - 1), explaining the incredibly fast convergence: This case is identical to the case x = 2 and p(1) = 1, discussed in the above section.  The numbers p(k) produced in this case, namely 2, 3, 7, 43, 1807 and so on, correspond to Sylvester's sequence
Note that E could be any set of strictly positive numbers (for instance, prime numbers, or numbers that are not even integers), as long of the sum of the inverse of the elements of E, is infinite. 
Final comments
Unlike with other number representations, all numbers seem to be created equal here, whether they are a fraction of two integers, irrational or transcendental. Note that irrational numbers will always satisfy the fact that x(k) is different from x, guaranteeing convergence. With other number representations (continued fractions or base 10 representation), some classes of numbers result in periodicity.
The coefficients p(k) could be used to build random number generators. Finally, a similar methodology can be developed to represent numbers as infinite products, click here for details
For a simple recursion yielding all the binary digits of the square root of 2, one at a time, click here. For another interesting challenge, read the section "Potential Areas of Research" in my article How to detect if numbers are random or not. For other articles featuring difficult mathematical problems, click here. For a statistical problem with several potential applications (clustering, data reduction) click here and read the last section. More challenges can be found here. . 

Tuesday, August 15, 2017

Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

I describe here an interesting and intuitive clustering algorithm (that can be used for data reduction as well) offering several advantages, over traditional classifiers:
  • More robust against outliers and erroneous data
  • Executing much faster
  • Generalizing well known algorithms
You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either. 
Let's describe this new algorithm and its various components, in simple English 
General Framework
We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points.
Cliques computed based on the Smallest-Circle-Problem
Cliques and Clique Density
The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.
Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density.  Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.
Additional sections:
  • Classification Rule, Computational Complexity, Memory Requirements
  • Cliques Building and Smallest-Circle-Problem 
  • Gravitational Field Generated by Clusters
  • Building an Efficient Clique System
  • Non-Circular Cliques
  • Source Code
  • Potential Enhancements, Data Reduction, and Conclusions
To read the full article, click here.  To access data science articles from the same author, click here
DSC Resources
Popular Articles

Monday, July 10, 2017

How to Detect if Numbers are Random or Not

In this article, you will learn some modern techniques to detect whether a sequence appears as random or not, whether it satisfies the central limit theorem (CLT) or not -- and what the limiting distribution is if CLT does not apply -- as well as some tricks to detect abnormalities.  It leads to the exploration of time series with massive, large-scale (long term) auto-correlation structure, as well as model-free, data-driven statistical testing. No statistical knowledge is required: we will discuss deep results that can be expressed in simple English. Most of the testing involved here uses big data (more than a billion computations) and data science, to the point that we reached the accuracy limits of our machines.  So there is even a tiny piece of numerical analysis in this article.
Potential applications include testing randomness, Monte Carlo simulations for statistical testing, encryption, blurring, and steganography (encoding secret messages into images) using pseudo-random numbers. A number of open questions are discussed here, offering the professional post-graduate statistician new research topics both in theoretical statistics and advanced number theory. The level here is state-of-the-art, but we avoid jargon and some technicalities to allow newbies and non-statisticians to understand and enjoy most of the content.  An Excel spreadsheet, attached to this document, summarizes our computations and will help you further understand the methodology used here.
Interestingly, I started to research this topic by trying to apply the notorious central limit theorem (CLT) to non-random (static) variables -- that is, to fixed sequences of numbers that look chaotic enough to simulate randomness. Ironically, it turned out to be far more complicated than using CLT for regular random variables. So I start here by describing what the initial CLT problem was, before moving into other directions such as testing randomness, and the distribution of the largest gap in seemingly random sequences.  As we will see, these problems are connected. 

Monday, June 26, 2017

Data Science and Machine Learning Without Mathematics

There is a set of techniques covering all aspects of machine learning (the statistical engine behind data science) that does not use any mathematics or statistical theory beyond high school level. So when you hear that some serious mathematical knowledge is required to become a data scientist, this should be taken with a grain of salt.
The reason maths is a thought to be a requirement is because of the following reasons:
  • Standard tools such as logistic regression, decision trees or confidence intervals, are math-heavy
  • Most employers use standard tools
  • As a result, hiring managers are looking for candidates with a strong math background, mostly for historical reasons
  • Academic training for data scientists are math-heavy for historical reasons (using the professors that used to teach stat classes)
Because of this, you need to really be math savvy to get a "standard" job, so sticking to standard math-heavy training and standard tools work for people interested in becoming a data scientist. To make things more complicated, most of the courses advertised as "math-free" or "learn data science in three days" are selling you snake oil (it won't help you get a job, and many times the training material is laughable.) You can learn data science very quickly, even on your own if you are a self-learner with a strong background working with data and programming (maybe you have a physics background) but that is another story.
Yet there is a set of techniques, designed by a data scientist with a strong mathematical background and long list of publications in top statistical journals that does not use mathematics nor statistical modeling. These techniques work just as well and some of them have been proved to be equivalent to their math-heavy cousins, with the additional bonus of generally being more robust. They are easy to understand and lead to easy interpretations, yet it is not snake oil: it is actually based on years of experience processing large volumes of diverse data, mostly in automated mode.
If you create your own startup, develop your own data science consultancy, or work for an organization that does not care about the tools that you use -- as long as they are cheap, easy to implement, and reliable -- you might consider using these simple, scalable, math-free methods. For instance, if you develop algorithms for stock trading, you wouldn't want to use the same tools as your competitors. These math-free techniques can give you a competitive advantage.
Below, I describe several math-free techniques covering a good chunk of data science, and how they differ from their traditional math-heavy cousins. I use them pretty much every day, though most of the time, in some automated ways.

Sunday, June 25, 2017

Advanced Machine Learning with Basic Excel

In this article, I present a few modern techniques that have been used in various business contexts, comparing performance with traditional methods. The advanced techniques in question are math-free, innovative, efficiently process large amounts of unstructured data, and are robust and scalable. Implementations in Python, R, Julia and Perl are provided, but here we focus on an Excel version that does not even require any Excel macros, coding, plug-ins, or anything other than the most basic version of Excel. It is actually easily implemented in standard, basic SQL too, and we invite readers to work on an SQL version.

Who should use the spreadsheet?

First, the spreadsheet (as well as the Python, R, Perl or Julia version) are free to use and modify in any context, even commercial, and even to make a product out of it and sell it. It is part of my concept of open patent, in which I share all my intellectual property publicly and for free. 

The spreadsheet is designed as a tutorial, thought it processes the same data set as the one used for the Python version. It is aimed at people that are not professional coders, people who manage data scientists, BI experts, MBA professionals, and people from other fields, with an interest in understanding the mechanics of some state-of-the-art machine learning techniques, without having to spend months or years learning mathematics, programming, and computer science. A few hours is needed to understand the details. This spreadsheet can be the first step to help you transition to a new, more analytical career path, or to better understand the data scientists that you manage or interact with. Or to spark a career in data science. Or even to teach machine learning concepts to high school students.

The spreadsheet also features a traditional technique (linear regression) for comparison purposes.

Click here to read this article, download the spreadsheet, and start using it.

Interesting Problem for Serious Geeks: Self-correcting Random Walks

This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my appr...