Sunday, November 19, 2017

High Precision Computing: Benchmark, Examples, and Tutorial

In some applications, using the standard precision in your programming language of choice, may not be enough, and can lead to disastrous errors. In some cases, you work with a library that is supposed to provide very high precision, when in fact the library in question does not work as advertised. In some cases, lack of precision results in obvious problems that are easy to spot, and in some cases, everything seems to be working fine and you are not aware that your simulations are completely wrong after as little as 30 iterations. We explore this case in this article, using a simple example that can be used to test the precision of your tool and of your results. 
Such problems arise frequently with algorithms that do not converge to a fixed solution, but instead generate numbers that oscillate continuously in some interval, converging in distribution rather than in value, unlike traditional algorithms that aim to optimize some function. The examples abound in chaotic theory, and the simplest case is the recursion X(k + 1) = 4 X(k) (1- X(k)), starting with a seed s = X(0) in [0, 1]. We will use this example - known as the logistic map - to benchmark various computing systems.  
Read full article for explanations about this picture
Examples of algorithms that can be severely impacted by aggregated loss of precision, besides ill-conditioned problems, include:
  • Markov Chain Monte Carlo (MCMC) simulations, a modern statistical method of estimation for complex problems and nested or hierarchical models, including Bayesian networks. 
  • Reflective stochastic processes, see here. This includes some some types or Brownian or Wiener processes.
  • Chaotic processes, see here (especially section 2.) These include fractals. 
  • Continuous random number generators, see here.
The conclusions based on the faulty sequences generated are not necessarily invalid, as long as the focus is on the distribution being studied, rather than on the exact values from specific sequences.

Tuesday, November 7, 2017

Fascinating Chaotic Sequences with Cool Applications

Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations  -- a fact rarely if ever mentioned in the scientific literature, but illustrated here, together with a solution. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue.
This article is accessible to non-experts, even though we solve a special stochastic equation for the first time, providing an unexpected exact solution, for a new chaotic process that generalizes the logistic map. We also describe a general framework for continuous random number generators, and investigate the interesting auto-correlation structure associated with some of these sequences. References are provided, as well as fast source code to process big numbers accurately, and even an elegant mathematical proof in the last section.
This article is also a useful read for participants in our upcoming competition (to be announced soon) as it addresses a similar stochastic integral equation problem, also with exact solution, in the related context of self-correcting random walks - another kind of memory-less process. 
The approach used here starts with traditional data science and simulations for exploratory analysis, with empirical results confirmed later by mathematical arguments in the last section. 

Monday, October 30, 2017

Logistic Map, Chaos, Randomness and Quantum Algorithms

The logistic map is the most basic recurrence formula exhibiting various levels of chaos depending on its parameter. It has been used in population demographics to model chaotic behavior. Here we explore this model in the context of randomness simulation, and revisit a bizarre non-periodic random number generator discovered 70 years ago, based on the logistic map equation. We then discuss flaws and strengths in widely used random number generators, as well as how to reverse-engineer such algorithms. Finally, we discuss quantum algorithms, as they are appropriate in our context.
Highlights
  • Java, Perl and Excel random number generators compared
  • Historical considerations
  • Backdoor planted by the NSA in some of these systems
  • Original material on complex random number generators
  • Image encryption
  • Periodicity detection, disctinctness quantum algorithm (big data)  
  • Practical solutions
  • Need for new programming language for quantum computing
  • Cool animated gif
  • Post-quantum cryptography
  • Generators based on irrational numbers
The article is not too long, as most of the technical details are provided in the numerous references. It covers many topics ranging from computer science, algorithms, big data, to probability theory and mathematics. The level is simple enough to be read by non-experts, yet of great value for the experts as well. Click here to read this new article.

Wednesday, October 25, 2017

Graph Theory: Six Degrees of Separation Problem

This famous statement -- the six degrees of separation -- claims that there is at most 6 degrees of separation between you and anyone else on Earth. Here we feature a simple algorithm that simulates how we are connected, and indeed confirms the claim. We also explain how it applies to web crawlers: Any web page is connected to any other web page by a path of 6 links at most.
The algorithm below is rudimentary and can be used for simulation purposes by any programmer: It does not even use tree or graph structures.  Applied to a population of 2,000,000 people, each having 20 friends, we show that there is a path involving 6 levels or intermediaries between you and anyone else. Note that the shortest path typically involves fewer levels, as some people have far more than 20 connections. 
Starting with you, at level one, you have twenty friends or connections. These connections in turn have 20 friends, so at level two, you are connected to 400 people. At level three, you are connected to 7,985 people, which is a little less than 20 x 400, since some level-3 connections were already level-2 or level-1. And so on. 
To read the full article with simulator, application to web crawling, some maths, and source code, click here
DSC Resources
Popular Articles

Tuesday, October 17, 2017

How Mathematical Discoveries are Made

In one of my previous articles, you can learn the process about how discoveries are made by research scientists, from exploratory analysis, testing, simulations, data science guesswork, all the way to the discovery of a new theory and state-of-the-art statistical modeling,including new, fundamental mathematical/statistical equations.
This is a unique occasion to discover the creative path followed by inventors. Traditional research papers do not explain how a new paradigm was found, instead they focus on showing that the author's methodology is sound, original, and leads to useful results. My article fills the gap, and even though much of the material is accessible to non-experts, it also features deep results accessible to graduate students and professional PhD statisticians. If you work on a PhD thesis or in research (Academia or research laboratories from government and companies such as Microsoft, Google, Facebook, Intel or IBM) this article provides insights on how the brain works to come up with a discovery in analytical fields.  Each researcher obviously has her own approach to creativity, so I do not claim that this is the only path to innovation. Here I only share my own thought process.
I invite you to read or re-read the article (with major updates added recently), especially section 3. It is posted here, and as in all research articles, it features a number of open questions and challenges, that you might want to solve. Interestingly, it started as a little mathematical problem.

Wednesday, October 4, 2017

Interesting Problem for Serious Geeks: Self-correcting Random Walks

Section 3 was added on October 11. Section 4 was added on October 19.  An award is offered to solve any of the open questions, click here for details

This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some spare time could find an elegant solution. Share it with your colleagues to see how math-savvy they are, or with your students. I was able to make substantial progress in 1-2 hours of work using Excel alone, thought I haven't found a final solution yet (maybe you will.) My Excel spreadsheet with all computations is accessible from this article. You don't need a deep statistical background to quickly discover some fun and interesting results playing with this stuff. Computer scientists, software engineers, quants, BI and analytic professionals from beginners to veterans, will also be able to enjoy it!
The problem
We are dealing with a stochastic process barely more complicated than a random walk. Random walks are also called drunken walks, as they represent the path of a drunken guy moving left and right seemingly randomly, and getting lost over time. Here the process is a self-correcting random walk, also called controlled random walk, in the sense that the walker, less drunk than in a random walk, is able to correct any departure from a straight path, more and more over time, by either slightly over- or under-correcting at each step. One of the model parameter (the positive parameter a) represents how drunk the walker is, with a = 0 being the worst. Unless a = 0, the amplitude of the corrections decreases over time to the point that eventually (after many steps) the walker walks almost straight and arrives at his destination. This model represents many physical processes, for instance the behavior of a stock market somewhat controlled by a government to avoid bubbles and implosions, and it is defined as follows:
Let's start with X(1) = 0, and define X(k) recursively as follows, for k > 1:
and let's define U(k), Z(k), and Z as follows:
where the V(k)'s are deviates from independent uniform variables on [0, 1], obtained for instance using the function RAND in Excel. So there are two positive parameters in this problem, a and b, and U(k) is always between 0 and 1. When b = 1, the U(k)'s are just standard uniform deviates, and if b = 0, then U(k) = 1. The case a = b = 0 is degenerate and should be ignored. The case a > 0 and b = 0 is of special interest, and it is a number theory problem in itself. Also, just like in random walks or Markov chains, the X(k)'s are not independent; they are indeed highly auto-correlated.
Figure 1: Mixture-like distribution of Z (estimated) when b = 0.01 and a = 0.8
Prove that if a  > 0, then  X(k) rapidly converges to 0 as k increases. Also prove that the limiting distribution Z
  • always exists,
  • always takes values between -1 and +1, with min(Z) = -1 and max(Z) = +1,
  • is symmetric, with mean and median equal to 0
  • and does not depend on a, but only on b (if b not too close to 0)
For instance, for b =1, even a = 0 yields the same triangular distribution for Z, as any a  > 0.
If a  > 0 and b = 0, (the non-stochastic case) prove that 
  • Z can only take 3 values: -1 with probability 0.25, +1 with probability 0.25, and 0 with probability 0.50
  • If U(k) and U(k+1) have the same sign,then U(k+2) is of opposite sign  
And here is a more challenging question: In general, what is the limiting distribution of Z? Also, what happens if you replace the U(k)'s with (say) Gaussian deviates? Or with U(k) = | sin (k*k) | which has a somewhat random behavior?
Click here to read more, including partial solution.

Monday, October 2, 2017

9 Off-the-beaten-path Statistical Science Topics with Interesting Applications

You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.
The topics discussed in this article include:
  • Random walks in one, two and three dimensions - With Video
  • Estimation of the convex hull of a set of points - Application to clustering and oil industry
  • Constrained linear regression on unusual domains - Application to food industry
  • Robust and scale-invariant variances
  • Distribution of arrival times of extreme events - Application to flood predictions
  • The Tweedie distributions - Numerous applications
  • The arithmetic-geometric mean - Fast computations of decimals of Pi
  • Weighted version of the K-NN clustering algorithm
  • Multivariate exponential distribution and storm modeling

High Precision Computing: Benchmark, Examples, and Tutorial

In some applications, using the standard precision in your programming language of choice, may not be enough, and can lead to disastrous er...