Saturday, February 23, 2019

New Perspectives on Statistical Distributions and Mixture Models - with Broad Spectrum of Applications

In this data science article, emphasis is placed on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. Mixtures have been studied and used in applications for a long time, including by myself when working on my Ph.D. 25 years ago, and it is still a subject of active research. Yet you will find here plenty of new material.
Introduction and Context
In a previous article (see here) I attempted to approximate a random variable representing real data, by a weighted sum of simple kernels such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to
  • avoid over-fitting,
  • approximate any empirical distribution (the inverse of the percentiles function) attached to real data,
  • easily compute data-driven confidence intervals regardless of the underlying distribution,
  • derive simple tests of hypothesis,
  • perform model reduction, 
  • optimize data binning to facilitate feature selection, and to improve visualizations of histograms
  • create perfect histograms,
  • build simple density estimators,
  • perform interpolations, extrapolations, or predictive analytics
  • perform clustering and detect the number of clusters.
Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.
I also tried with independently but NOT identically distributed kernels, and again, failed to make any progress. By "not identically distributed kernels", I mean basic random variables from a same family, say with a uniform or Gaussian distribution, but with parameters (mean and variance) that are different for each term in the weighted sum. The reason being that sums of Gaussian's, even with different parameters, are still Gaussian, and sums of Uniform's end up being Gaussian too unless the weights decay fast enough. Details about why this is happening are provided in the last section. 
Now, in this article, starting in the next section, I offer a full solution, using mixtures rather than sums. The possibilities are endless. 
Content of this article
1. Introduction and Context
2. Approximations Using Mixture Models
  • The error term
  • Kernels and model parameters
  • Algorithms to find the optimum parameters
  • Convergence and uniqueness of solution
  • Find near-optimum with fast, black-box step-wise algorithm
3. Example
  • Data and source code
  • Results
4. Applications
  • Optimal binning
  • Predictive analytics
  • Test of hypothesis and confidence intervals
  • Clustering
5. Interesting problems
  • Gaussian mixtures uniquely characterize a broad class of distributions
  • Weighted sums fail to achieve what mixture models do
  • Stable mixtures
  • Correlations
Read full article here

Wednesday, February 13, 2019

A Plethora of Original, Non-standard Statistical Tests

Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some of them are illustrated here, with examples that do not require statistical knowledge for understanding or implementation.
This material should appeal to managers, executives, industrial engineers, software engineers, operations research professionals, economists, and to anyone dealing with data, such as biometricians, analytical chemists, astronomers, epidemiologists, journalists, or physicians. Statisticians with a different perspective are invited to discuss my methodology and the tests described here, in the comment section at the bottom of this article. In my case, I used these tests mostly in the context of experimental mathematics, which is a branch of data science that few people talk about. In that context, the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid.
I provide here a general overview, as well as my simple approach to statistical testing, accessible to professionals with little or no formal statistical training. Detailed applications of these tests are found in my recent book and in this article. Precise references to these documents are provided as needed, in this article.
Examples of traditional tests
1. General Methodology
Despite my strong background in statistical science, over the years, I moved away from relying too much on traditional statistical tests and statistical inference. I am not the only one: these tests have been abused and misused, see for instance this article on p-hacking. Instead, I favored a methodology of my own, mostly empirical, based on simulations, data- rather than model-driven. It is essentially a non-parametric approach. It has the advantage of being far easier to use, implement, understand, and  interpret, especially to the non-initiated. It was initially designed to be integrated in black-box, automated decision systems. Here I share some of these tests, and many can be implemented easily in Excel. 

Six Degrees of Separation Between Any Two Data Sets

This is an interesting data science conjecture, inspired by the well known   six degrees of separation problem , stating that there is a li...