Tuesday, October 17, 2017

How Mathematical Discoveries are Made

In one of my previous articles, you can learn the process about how discoveries are made by research scientists, from exploratory analysis, testing, simulations, data science guesswork, all the way to the discovery of a new theory and state-of-the-art statistical modeling,including new, fundamental mathematical/statistical equations.
This is a unique occasion to discover the creative path followed by inventors. Traditional research papers do not explain how a new paradigm was found, instead they focus on showing that the author's methodology is sound, original, and leads to useful results. My article fills the gap, and even though much of the material is accessible to non-experts, it also features deep results accessible to graduate students and professional PhD statisticians. If you work on a PhD thesis or in research (Academia or research laboratories from government and companies such as Microsoft, Google, Facebook, Intel or IBM) this article provides insights on how the brain works to come up with a discovery in analytical fields.  Each researcher obviously has her own approach to creativity, so I do not claim that this is the only path to innovation. Here I only share my own thought process.
I invite you to read or re-read the article (with major updates added recently), especially section 3. It is posted here, and as in all research articles, it features a number of open questions and challenges, that you might want to solve. Interestingly, it started as a little mathematical problem.

Wednesday, October 4, 2017

Interesting Problem for Serious Geeks: Self-correcting Random Walks

This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some spare time could find an elegant solution. Share it with your colleagues to see how math-savvy they are, or with your students. I was able to make substantial progress in 1-2 hours of work using Excel alone, thought I haven't found a final solution yet (maybe you will.) My Excel spreadsheet with all computations is accessible from this article. You don't need a deep statistical background to quickly discover some fun and interesting results playing with this stuff. Computer scientists, software engineers, quants, BI and analytic professionals from beginners to veterans, will also be able to enjoy it!
The problem
We are dealing with a stochastic process barely more complicated than a random walk. Random walks are also called drunken walks, as they represent the path of a drunken guy moving left and right seemingly randomly, and getting lost over time. Here the process is a self-correcting random walk, also called controlled random walk, in the sense that the walker, less drunk than in a random walk, is able to correct any departure from a straight path, more and more over time, by either slightly over- or under-correcting at each step. One of the model parameter (the positive parameter a) represents how drunk the walker is, with a = 0 being the worst. Unless a = 0, the amplitude of the corrections decreases over time to the point that eventually (after many steps) the walker walks almost straight and arrives at his destination. This model represents many physical processes, for instance the behavior of a stock market somewhat controlled by a government to avoid bubbles and implosions, and it is defined as follows:
Let's start with X(1) = 0, and define X(k) recursively as follows, for k > 1:
and let's define U(k), Z(k), and Z as follows:
where the V(k)'s are deviates from independent uniform variables on [0, 1], obtained for instance using the function RAND in Excel. So there are two positive parameters in this problem, a and b, and U(k) is always between 0 and 1. When b = 1, the U(k)'s are just standard uniform deviates, and if b = 0, then U(k) = 1. The case a = b = 0 is degenerate and should be ignored. The case a > 0 and b = 0 is of special interest, and it is a number theory problem in itself. Also, just like in random walks or Markov chains, the X(k)'s are not independent; they are indeed highly auto-correlated.
Figure 1: Mixture-like distribution of Z (estimated) when b = 0.01 and a = 0.8
Prove that if a  > 0, then  X(k) rapidly converges to 0 as k increases. Also prove that the limiting distribution Z
  • always exists,
  • always takes values between -1 and +1, with min(Z) = -1 and max(Z) = +1,
  • is symmetric, with mean and median equal to 0
  • and does not depend on a, but only on b (if b not too close to 0)
For instance, for b =1, even a = 0 yields the same triangular distribution for Z, as any a  > 0.
If a  > 0 and b = 0, (the non-stochastic case) prove that 
  • Z can only take 3 values: -1 with probability 0.25, +1 with probability 0.25, and 0 with probability 0.50
  • If U(k) and U(k+1) have the same sign,then U(k+2) is of opposite sign  
And here is a more challenging question: In general, what is the limiting distribution of Z? Also, what happens if you replace the U(k)'s with (say) Gaussian deviates? Or with U(k) = | sin (k*k) | which has a somewhat random behavior?
Click here to read more, including partial solution.

Monday, October 2, 2017

9 Off-the-beaten-path Statistical Science Topics with Interesting Applications

You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.
The topics discussed in this article include:
  • Random walks in one, two and three dimensions - With Video
  • Estimation of the convex hull of a set of points - Application to clustering and oil industry
  • Constrained linear regression on unusual domains - Application to food industry
  • Robust and scale-invariant variances
  • Distribution of arrival times of extreme events - Application to flood predictions
  • The Tweedie distributions - Numerous applications
  • The arithmetic-geometric mean - Fast computations of decimals of Pi
  • Weighted version of the K-NN clustering algorithm
  • Multivariate exponential distribution and storm modeling

Sunday, September 24, 2017

Interesting Application of the Greedy Algorithm for Egyptian Fractions

This article was originally posted here. Check the original article for any update and additions. 

Here we discuss a new system to represent numbers, for instance constants such as Pi, e, or log 2, using rational fractions. Each iteration doubles the precision (the number of correct decimals computed) making it converging much faster than current systems such as continued fractions, to represent any positive real number. The algorithm discussed here is equivalent to the greedy algorithm to compute Egyptian fractions, except that we use it here mostly for irrational numbers.
For any real number x, the representation is as follows:
where the integer numbers p(k) are computed iteratively with the following two-steps iteration, starting with k = 1::
where the brackets represent the floor function which always returns an integer. You start with a seed p(1) = 1 (though you can work with other seed values) and loop over k = 1, 2, and so on. As k tends to infinity, x(k) tends to x. The speed of convergence is remarkable, yielding 14 decimals after five or six iterations, and doubling the precision at each iteration, because p(k+1) is of the same order of magnitude as the square of p(k). It works well if x is between 1 and 2, 
Examples
Here are a few interesting examples: 
  • For x = p - 2, we get p(1) = 1, p(2) = 8, p(3) =  61, p(4) = 5020, p(5) = 128541459, yielding x(5) = 1.14159265358979, with 13 correct decimals. 
  • For x = 1 + log 2, we get p(1) = 1, p(2) = 2, p(3) =  6, p(4) = 38, p(5) = 6071, p(6) = 144715222, yielding x(6) = 1.69314718055995, with 14 correct decimals.
  • For x = 2, we get p(1) = 1, p(2) = 2, p(3) =  3, p(4) = 7, p(5) = 43, p(6) = 1807, yielding x(6) = 1.99999969357507, with 6 correct decimals.
The sequence p(1), p(2), and so on uniquely characterizes the number x. Note that for k as small as 6 or 7, p(k) becomes so large that you need to use special arithmetic (long numbers) to perform the computations in any programming language, as you have reached the limit in terms of computer precision. 
By contrast, continued fractions converge much more slowly, click here to see the difference. The 6-th convergent for p (with continued fractions) is 104348 / 33215 = 3.141592654. Compare this with our approximation x(6) for p-2, which has a far higher precision.  
General framework and proof of convergence
This methodology generalizes as follows: Start with a value p(1) with x > 1 / p(1), and iteratively compute p(k+1) as the smallest number from an infinite set E, such that x(k+1)   <  x, with   
In our particular case, E is the set of all positive integers, and p(1) = 1. With p(1) = 1, convergence is fastest when x gets very close (above) 1. If x = 1 and p(1) = 2, then p(k+1) = 1 + p(k) * (p(k) - 1), explaining the incredibly fast convergence: This case is identical to the case x = 2 and p(1) = 1, discussed in the above section.  The numbers p(k) produced in this case, namely 2, 3, 7, 43, 1807 and so on, correspond to Sylvester's sequence
Note that E could be any set of strictly positive numbers (for instance, prime numbers, or numbers that are not even integers), as long of the sum of the inverse of the elements of E, is infinite. 
Final comments
Unlike with other number representations, all numbers seem to be created equal here, whether they are a fraction of two integers, irrational or transcendental. Note that irrational numbers will always satisfy the fact that x(k) is different from x, guaranteeing convergence. With other number representations (continued fractions or base 10 representation), some classes of numbers result in periodicity.
The coefficients p(k) could be used to build random number generators. Finally, a similar methodology can be developed to represent numbers as infinite products, click here for details
For a simple recursion yielding all the binary digits of the square root of 2, one at a time, click here. For another interesting challenge, read the section "Potential Areas of Research" in my article How to detect if numbers are random or not. For other articles featuring difficult mathematical problems, click here. For a statistical problem with several potential applications (clustering, data reduction) click here and read the last section. More challenges can be found here. . 

Tuesday, August 15, 2017

Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

I describe here an interesting and intuitive clustering algorithm (that can be used for data reduction as well) offering several advantages, over traditional classifiers:
  • More robust against outliers and erroneous data
  • Executing much faster
  • Generalizing well known algorithms
You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either. 
Let's describe this new algorithm and its various components, in simple English 
General Framework
We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points.
Cliques computed based on the Smallest-Circle-Problem
Cliques and Clique Density
The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.
Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density.  Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.
Additional sections:
  • Classification Rule, Computational Complexity, Memory Requirements
  • Cliques Building and Smallest-Circle-Problem 
  • Gravitational Field Generated by Clusters
  • Building an Efficient Clique System
  • Non-Circular Cliques
  • Source Code
  • Potential Enhancements, Data Reduction, and Conclusions
To read the full article, click here.  To access data science articles from the same author, click here
DSC Resources
Popular Articles

Monday, July 10, 2017

How to Detect if Numbers are Random or Not

In this article, you will learn some modern techniques to detect whether a sequence appears as random or not, whether it satisfies the central limit theorem (CLT) or not -- and what the limiting distribution is if CLT does not apply -- as well as some tricks to detect abnormalities.  It leads to the exploration of time series with massive, large-scale (long term) auto-correlation structure, as well as model-free, data-driven statistical testing. No statistical knowledge is required: we will discuss deep results that can be expressed in simple English. Most of the testing involved here uses big data (more than a billion computations) and data science, to the point that we reached the accuracy limits of our machines.  So there is even a tiny piece of numerical analysis in this article.
Potential applications include testing randomness, Monte Carlo simulations for statistical testing, encryption, blurring, and steganography (encoding secret messages into images) using pseudo-random numbers. A number of open questions are discussed here, offering the professional post-graduate statistician new research topics both in theoretical statistics and advanced number theory. The level here is state-of-the-art, but we avoid jargon and some technicalities to allow newbies and non-statisticians to understand and enjoy most of the content.  An Excel spreadsheet, attached to this document, summarizes our computations and will help you further understand the methodology used here.
Interestingly, I started to research this topic by trying to apply the notorious central limit theorem (CLT) to non-random (static) variables -- that is, to fixed sequences of numbers that look chaotic enough to simulate randomness. Ironically, it turned out to be far more complicated than using CLT for regular random variables. So I start here by describing what the initial CLT problem was, before moving into other directions such as testing randomness, and the distribution of the largest gap in seemingly random sequences.  As we will see, these problems are connected. 

Monday, June 26, 2017

Data Science and Machine Learning Without Mathematics

There is a set of techniques covering all aspects of machine learning (the statistical engine behind data science) that does not use any mathematics or statistical theory beyond high school level. So when you hear that some serious mathematical knowledge is required to become a data scientist, this should be taken with a grain of salt.
The reason maths is a thought to be a requirement is because of the following reasons:
  • Standard tools such as logistic regression, decision trees or confidence intervals, are math-heavy
  • Most employers use standard tools
  • As a result, hiring managers are looking for candidates with a strong math background, mostly for historical reasons
  • Academic training for data scientists are math-heavy for historical reasons (using the professors that used to teach stat classes)
Because of this, you need to really be math savvy to get a "standard" job, so sticking to standard math-heavy training and standard tools work for people interested in becoming a data scientist. To make things more complicated, most of the courses advertised as "math-free" or "learn data science in three days" are selling you snake oil (it won't help you get a job, and many times the training material is laughable.) You can learn data science very quickly, even on your own if you are a self-learner with a strong background working with data and programming (maybe you have a physics background) but that is another story.
Yet there is a set of techniques, designed by a data scientist with a strong mathematical background and long list of publications in top statistical journals that does not use mathematics nor statistical modeling. These techniques work just as well and some of them have been proved to be equivalent to their math-heavy cousins, with the additional bonus of generally being more robust. They are easy to understand and lead to easy interpretations, yet it is not snake oil: it is actually based on years of experience processing large volumes of diverse data, mostly in automated mode.
If you create your own startup, develop your own data science consultancy, or work for an organization that does not care about the tools that you use -- as long as they are cheap, easy to implement, and reliable -- you might consider using these simple, scalable, math-free methods. For instance, if you develop algorithms for stock trading, you wouldn't want to use the same tools as your competitors. These math-free techniques can give you a competitive advantage.
Below, I describe several math-free techniques covering a good chunk of data science, and how they differ from their traditional math-heavy cousins. I use them pretty much every day, though most of the time, in some automated ways.

How Mathematical Discoveries are Made

In one of my previous articles, you can learn the process about how discoveries are made by research scientists, from exploratory analysis,...