## Saturday, February 23, 2019

### New Perspectives on Statistical Distributions and Mixture Models - with Broad Spectrum of Applications

In this data science article, emphasis is placed on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. Mixtures have been studied and used in applications for a long time, including by myself when working on my Ph.D. 25 years ago, and it is still a subject of active research. Yet you will find here plenty of new material.
Introduction and Context
In a previous article (see here) I attempted to approximate a random variable representing real data, by a weighted sum of simple kernels such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to
• avoid over-fitting,
• approximate any empirical distribution (the inverse of the percentiles function) attached to real data,
• easily compute data-driven confidence intervals regardless of the underlying distribution,
• derive simple tests of hypothesis,
• perform model reduction,
• optimize data binning to facilitate feature selection, and to improve visualizations of histograms
• create perfect histograms,
• build simple density estimators,
• perform interpolations, extrapolations, or predictive analytics
• perform clustering and detect the number of clusters.
Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.
I also tried with independently but NOT identically distributed kernels, and again, failed to make any progress. By "not identically distributed kernels", I mean basic random variables from a same family, say with a uniform or Gaussian distribution, but with parameters (mean and variance) that are different for each term in the weighted sum. The reason being that sums of Gaussian's, even with different parameters, are still Gaussian, and sums of Uniform's end up being Gaussian too unless the weights decay fast enough. Details about why this is happening are provided in the last section.
Now, in this article, starting in the next section, I offer a full solution, using mixtures rather than sums. The possibilities are endless.
1. Introduction and Context
2. Approximations Using Mixture Models
• The error term
• Kernels and model parameters
• Algorithms to find the optimum parameters
• Convergence and uniqueness of solution
• Find near-optimum with fast, black-box step-wise algorithm
3. Example
• Data and source code
• Results
4. Applications
• Optimal binning
• Predictive analytics
• Test of hypothesis and confidence intervals
• Clustering
5. Interesting problems
• Gaussian mixtures uniquely characterize a broad class of distributions
• Weighted sums fail to achieve what mixture models do
• Stable mixtures
• Correlations

## Wednesday, February 13, 2019

### A Plethora of Original, Non-standard Statistical Tests

Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some of them are illustrated here, with examples that do not require statistical knowledge for understanding or implementation.
This material should appeal to managers, executives, industrial engineers, software engineers, operations research professionals, economists, and to anyone dealing with data, such as biometricians, analytical chemists, astronomers, epidemiologists, journalists, or physicians. Statisticians with a different perspective are invited to discuss my methodology and the tests described here, in the comment section at the bottom of this article. In my case, I used these tests mostly in the context of experimental mathematics, which is a branch of data science that few people talk about. In that context, the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid.
I provide here a general overview, as well as my simple approach to statistical testing, accessible to professionals with little or no formal statistical training. Detailed applications of these tests are found in my recent book and in this article. Precise references to these documents are provided as needed, in this article.