Thursday, March 21, 2019

Fascinating Developments in the Theory of Randomness

I present here some innovative results in my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available here) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.
None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. 
Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity.  .
Interesting connections with the golden ratio, special polynomials, and other special mathematical constants, are discussed in section 2. Finally, all the analyses performed during this work were done in Excel. I share my spreadsheets in this article, as well as many illustration, and all the results are replicable.
Content of this article
1. General framework, notations and terminology
  • Finding the equilibrium distribution
  • Auto-correlation and spectral analysis
  • Ergodicity, convergence, and attractors
  • Space state, time state, and Markov chain approximations
  • Examples
2. Case study
  • First fundamental theorem
  • Second fundamental theorem
  • Convergence to equilibrium: illustration
3. Applications
  • Potential application domains
  • Example: the golden ratio process
  • Finding other useful b-processes
4. Additional research topics
  • Perfect stochastic processes
  • Characterization of equilibrium distributions (the attractors)
  • Probabilistic calculus and number theory, special integrals
5. Appendix
  • Computing the auto-correlation at equilibrium
  • Proof of the first fundamental theorem
  • How to find the exact equilibrium distribution
6. Additional Resources

Wednesday, March 13, 2019

How to Automatically Determine the Number of Clusters in your Data - and more

Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.
For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. 
How many clusters here? (source: see here)
In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. 
A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:
  • Model fitting techniques: an example is using a mixture model to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see here.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not  made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. 
  • Visual techniques: for instance, the silhouette or elbow rule (very popular.)
In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0.  Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.
The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)
Good references on the topic are available. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.

Thursday, March 7, 2019

Deep Analytical Thinking and Data Science Wizardry

Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide sound answers to business problems. These skills are usually acquired by experience more than by training, and data science generalists (see here how to become one) usually possess them.
This article is targeted to data science managers and decision makers, as well as to junior professionals who want to become one at some point in their career. Deep thinking, unlike deep learning, is also more difficult to automate, so it provides better job security. Those automating deep learning are actually the new data science wizards, who can think out-of-the box. Much of what is described in this article is also data science wizardry, and not taught in standard textbooks nor in the classroom. By reading this tutorial, you will learn and be able to use these data science secrets, and possibly change your perspective on data science. Data science is like an iceberg: everyone knows and can see the tip of the iceberg (regression models, neural nets, cross-validation, clustering, Python, and so on, as presented in textbooks.) Here I focus on the unseen bottom, using a statistical level almost accessible to the layman, avoiding jargon and complicated math formulas, yet discussing a few advanced concepts.  
1. Case Study: The Problem
2. Deep Analytical Thinking
  • Answering hidden questions
  • Business questions
  • Data questions
  • Metrics questions
3. Data Science Wizardry
  • Generic algorithm
  • Illustration with three different models
  • Results
4. A few data science hacks

Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations avai...