Thursday, March 21, 2019

I present here some innovative results in my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available here) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.
None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. 
Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity.  .
Interesting connections with the golden ratio, special polynomials, and other special mathematical constants, are discussed in section 2. Finally, all the analyses performed during this work were done in Excel. I share my spreadsheets in this article, as well as many illustration, and all the results are replicable.
Content of this article
1. General framework, notations and terminology
  • Finding the equilibrium distribution
  • Auto-correlation and spectral analysis
  • Ergodicity, convergence, and attractors
  • Space state, time state, and Markov chain approximations
  • Examples
2. Case study
  • First fundamental theorem
  • Second fundamental theorem
  • Convergence to equilibrium: illustration
3. Applications
  • Potential application domains
  • Example: the golden ratio process
  • Finding other useful b-processes
4. Additional research topics
  • Perfect stochastic processes
  • Characterization of equilibrium distributions (the attractors)
  • Probabilistic calculus and number theory, special integrals
5. Appendix
  • Computing the auto-correlation at equilibrium
  • Proof of the first fundamental theorem
  • How to find the exact equilibrium distribution
6. Additional Resources

Wednesday, March 13, 2019

How to Automatically Determine the Number of Clusters in your Data - and more

Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.
For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. 
How many clusters here? (source: see here)
In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. 
A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:
  • Model fitting techniques: an example is using a mixture model to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see here.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not  made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. 
  • Visual techniques: for instance, the silhouette or elbow rule (very popular.)
In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0.  Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.
The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)
Good references on the topic are available. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.

Thursday, March 7, 2019

Deep Analytical Thinking and Data Science Wizardry

Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide sound answers to business problems. These skills are usually acquired by experience more than by training, and data science generalists (see here how to become one) usually possess them.
This article is targeted to data science managers and decision makers, as well as to junior professionals who want to become one at some point in their career. Deep thinking, unlike deep learning, is also more difficult to automate, so it provides better job security. Those automating deep learning are actually the new data science wizards, who can think out-of-the box. Much of what is described in this article is also data science wizardry, and not taught in standard textbooks nor in the classroom. By reading this tutorial, you will learn and be able to use these data science secrets, and possibly change your perspective on data science. Data science is like an iceberg: everyone knows and can see the tip of the iceberg (regression models, neural nets, cross-validation, clustering, Python, and so on, as presented in textbooks.) Here I focus on the unseen bottom, using a statistical level almost accessible to the layman, avoiding jargon and complicated math formulas, yet discussing a few advanced concepts.  
1. Case Study: The Problem
2. Deep Analytical Thinking
  • Answering hidden questions
  • Business questions
  • Data questions
  • Metrics questions
3. Data Science Wizardry
  • Generic algorithm
  • Illustration with three different models
  • Results
4. A few data science hacks

Saturday, February 23, 2019

New Perspectives on Statistical Distributions and Mixture Models - with Broad Spectrum of Applications

In this data science article, emphasis is placed on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. Mixtures have been studied and used in applications for a long time, including by myself when working on my Ph.D. 25 years ago, and it is still a subject of active research. Yet you will find here plenty of new material.
Introduction and Context
In a previous article (see here) I attempted to approximate a random variable representing real data, by a weighted sum of simple kernels such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to
  • avoid over-fitting,
  • approximate any empirical distribution (the inverse of the percentiles function) attached to real data,
  • easily compute data-driven confidence intervals regardless of the underlying distribution,
  • derive simple tests of hypothesis,
  • perform model reduction, 
  • optimize data binning to facilitate feature selection, and to improve visualizations of histograms
  • create perfect histograms,
  • build simple density estimators,
  • perform interpolations, extrapolations, or predictive analytics
  • perform clustering and detect the number of clusters.
Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.
I also tried with independently but NOT identically distributed kernels, and again, failed to make any progress. By "not identically distributed kernels", I mean basic random variables from a same family, say with a uniform or Gaussian distribution, but with parameters (mean and variance) that are different for each term in the weighted sum. The reason being that sums of Gaussian's, even with different parameters, are still Gaussian, and sums of Uniform's end up being Gaussian too unless the weights decay fast enough. Details about why this is happening are provided in the last section. 
Now, in this article, starting in the next section, I offer a full solution, using mixtures rather than sums. The possibilities are endless. 
Content of this article
1. Introduction and Context
2. Approximations Using Mixture Models
  • The error term
  • Kernels and model parameters
  • Algorithms to find the optimum parameters
  • Convergence and uniqueness of solution
  • Find near-optimum with fast, black-box step-wise algorithm
3. Example
  • Data and source code
  • Results
4. Applications
  • Optimal binning
  • Predictive analytics
  • Test of hypothesis and confidence intervals
  • Clustering
5. Interesting problems
  • Gaussian mixtures uniquely characterize a broad class of distributions
  • Weighted sums fail to achieve what mixture models do
  • Stable mixtures
  • Correlations
Read full article here

Wednesday, February 13, 2019

A Plethora of Original, Non-standard Statistical Tests

Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some of them are illustrated here, with examples that do not require statistical knowledge for understanding or implementation.
This material should appeal to managers, executives, industrial engineers, software engineers, operations research professionals, economists, and to anyone dealing with data, such as biometricians, analytical chemists, astronomers, epidemiologists, journalists, or physicians. Statisticians with a different perspective are invited to discuss my methodology and the tests described here, in the comment section at the bottom of this article. In my case, I used these tests mostly in the context of experimental mathematics, which is a branch of data science that few people talk about. In that context, the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid.
I provide here a general overview, as well as my simple approach to statistical testing, accessible to professionals with little or no formal statistical training. Detailed applications of these tests are found in my recent book and in this article. Precise references to these documents are provided as needed, in this article.
Examples of traditional tests
1. General Methodology
Despite my strong background in statistical science, over the years, I moved away from relying too much on traditional statistical tests and statistical inference. I am not the only one: these tests have been abused and misused, see for instance this article on p-hacking. Instead, I favored a methodology of my own, mostly empirical, based on simulations, data- rather than model-driven. It is essentially a non-parametric approach. It has the advantage of being far easier to use, implement, understand, and  interpret, especially to the non-initiated. It was initially designed to be integrated in black-box, automated decision systems. Here I share some of these tests, and many can be implemented easily in Excel. 

Monday, December 31, 2018

Announcement: Winner of the Data Science Central Competition

Back in 2017, we posted a problem related to stochastic processes and controlled random walks, offering a $2,000 award for a sound solution, see here for full details. The problem, which had a FinTech flavor, was only solved recently (December 2018) by Victor Zurkowski.
About the problem:
Let's start with X(1) = 0, and define X(k) recursively as follows, for k > 1:
and let's define U(k), Z(k), and Z as follows:
where the V(k)'s are deviates from independent uniform variables on [0, 1].
So there are two positive parameters in this problem, a and b, and U(k) is always between 0 and 1. When b = 1, the U(k)'s are just standard uniform deviates, and if b = 0, then U(k) = 1. The case a = b = 0 is degenerate and should be ignored. The case a > 0 and b = 0 is of special interest, and it is a number theory problem in itself, related to this problem when a = 1. Also, just like in random walks or Markov chains, the X(k)'s are not independent; they are indeed highly auto-correlated.
Prove that if a < 1, then  X(k) converges to 0 as k increases. Under the same condition, prove that the limiting distribution Z
  • always exists, (Note: if a > 1, X(k) may not converge to zero, causing a drift and asymmetry)
  • always takes values between -1 and +1, with min(Z) = -1 and max(Z) = +1,
  • is symmetric, with mean and median equal to 0
  • and does not depend on a, but only on b.
For instance, for b =1, even a = 0 yields the same triangular distribution for Z, as any a  > 0.
Main question: In general, what is the limiting distribution of Z? I guessed, using empirical data science techniques such as model fitting, simulations, and goodness-of-fit tests,  that the solution (which implied solving a stochastic integral solution) was, with z in [-1. 1]:
About the author and the solution:
Victor not only confirmed that the above density function is a solution to this problem, but also that the solution is unique, focusing on convergence issues, in a 27-page long paper. One detail still needs to be worked out: whether or not scaled Z visits the neighborhood of every point in [-1,1] infinitely often. Victor believes that the answer is positive. You can read his solution here, and we hope it will result in a publication in a scientific journal.
Victor Zurkowski, PhD, is a predictive modeling, machine learning, and optimization expert with 20+ years of experience, with deep expertise developing pricing models and optimization engines across industries, including Retail, Financial Services. He published various academic papers in Mathematics and Statistics across numerous topics, and is currently VP of Data Science at Polymatiks. Victor holds a Ph.D. in Mathematics from the University of Minnesota and an M.Sc. in Statistics from the University of Toronto.

Thursday, December 27, 2018

Why You Should be a Data Science Generalist - and How to Become One

The new advice today for data scientists is not to become a generalist. You can read recent articles on this topic, for instance here.  In this blog, I explain why I believe it should be the opposite. I wrote about this here not long ago, and provide additional arguments in this article, as to why it helps to be a generalist.  
Of course, it is difficult, and probably impossible to become a data science generalist just after graduating. It takes years to acquire all the skills, yet you don't need to master all of them. It might be easier for a physicist, engineer, or biostatistician currently learning data science, after years of corporate experience, than it is for a data scientist with no business experience. Possibly the easiest way to become one is to work for start-up's or small companies, taking on many hats as you will probably be the only data scientist in your company, and will have to change jobs more frequently than if you work for a big company. To the contrary, for a big company, you are expected to work in a very specialized area, though it does not hurt to be a generalist, as I will illustrate shortly. Being a specialized data scientist could put you on a very predictable path that limits your career growth and flexibility, especially if you want to create your company down the line. Let's start with explaining what a data science generalist is.
The data science generalist
The generalist has experience working in different roles and different environments, for instance, over a period of 15 years, having worked as a
  • Business analyst or BI professional, communicating insights to decision makers, mastering tools such as Tableau, SQL and Excel; or maybe being the decision maker herself
  • Statistician / data analyst with expertise in predictive modeling
  • Expert in algorithm design and optimization
  • Researcher in an academic-like setting, or experience in testing / prototyping new data science systems and proofs of concept (POC)
  • Builder / architect: designing APIs, dashboards, databases, and deploying/maintaining yourself some modest systems in production mode
  • Programmer (statistical or scientific programmer with exposure to high performance computing and parallel architectures - you might even have designed your own software)
  • Consultant, directly working with clients, or adviser
  • Manager or director role rather than individual contributor
  • Professional with roles in various industries (IT, media, Internet, finance, health care, smart cities) in both big and small companies, in various domains ranging from fraud detection, to optimizing sales or marketing, with proven, measurable accomplishments
In short, the generalist has been involved at one time or another, in all phases of the data science project lifecycle
The generalist might not command a higher salary, but has more flexibility career-wise. Even in a big company, when downsizing occurs, it is easier for the generalist to make a lateral move (get transferred to a different department), than it is for the "one-trick pony". 
Timing is important too. If you become a generalist at age 50 (as opposed to age 45) it might not help as getting hired becomes more difficult as you get past 45. Still, even if 50 or more, it opens up some possibilities, for instance starting your own business. And if you can prove that you have been consistently broadening your skills throughout your career cycle, as generalists do by definition, it will be easier to land a job, especially if your salary expectations are reasonable, and your health is not an issue for your future employer.  
To read the full article, click here

I present here some innovative results in my most recent research on stochastic processes. chaos modeling, and dynamical systems, with appl...