Saturday, October 26, 2019

More Weird Statistical Distributions

Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understand this article: I tried to make the presentation as simple as possible, focusing on high-level results rather than technicalities. Yet, professional statisticians and mathematicians, even academic researchers, will find some deep and fascinating results worth further exploring.
Can you identify patterns in this chart? (see section 2.2. in the article for an answer)
Let's start with 
Here the X(k)'s are random variable identically and independently distributed, commonly referred to as X. We are trying to find the distribution of Z.
Contents
1. Using a Simple Discrete Distribution for X
2. Towards a Better Model
  • Approximate Solution
  • The Fractal, Brownian-like Error Term
3. Finding X and Z Using Characteristic Functions
  • Test with Log-normal Distribution for X
  • Playing with the Characteristic Functions
  • Generalization to Continued Fractions and Nested Cubic Roots
4. Exercises
Read this article here

Wednesday, October 2, 2019

Surprising Uses of Synthetic Random Data Sets

I have used synthetic data sets many times for simulation purposes, most recently in my articles Six degrees of Separations between any two Datasets and How to Lie with p-values. Many applications (including the data sets themselves) can be found in my books Applied Stochastic Processes and New Foundations of Statistical Science. For instance, these data sets can be used to benchmark some statistical tests of hypothesis (the null hypothesis known to be true or false in advance) and to assess the power of such tests or confidence intervals. In other cases, it is used to simulate clusters and test cluster detection / pattern detection algorithms, see here.  I also used such data sets to discover two new deep conjectures in number theory (see here), to design new Fintech models such as bounded Brownian motions, and find new families of statistical distributions (see here).
Goldbach's comet 
In this article, I focus on peculiar random data sets to prove -- heuristically -- two of the most famous math conjectures in number theory, related to prime numbers: the Twin Prime conjecture, and the Goldbach conjecture. The methodology is at the intersection of probability theory, experimental math, and probabilistic number theory. It involves working with infinite data sets, dwarfing any data set found in any business context.
Read full article here.

Monday, September 9, 2019

Six Degrees of Separation Between Any Two Data Sets

This is an interesting data science conjecture, inspired by the well known six degrees of separation problem, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea.   
Here the link is between any two univariate data sets of the same size, say Data A and Data B. The claim is that there is a chain involving no more than 6 intermediary data sets, each highly correlated to the previous one (with a correlation above 0.8), between Data A and Data B. The concept is illustrated in the example below, where only 4 intermediary data sets (labeled Degree 1, Degree 2, Degree 3, and Degree 4) are actually needed. 
Correlation table for the 6 data sets
The view the (random) data sets, understand how the chain of intermediary data sets was built, and access the spreadsheets to reproduce the results or test on different data, follow this link. It makes for an interesting theoretical data science research project, for people with too much free time on their hands. 

Sunday, September 8, 2019

Two New Deep Conjectures in Probabilistic Number Theory

The material discussed here is also of interest to machine learning, AI, big data, and data science practitioners, as much of the work is based on heavy data processing, algorithms, efficient coding, testing, and experimentation. Also, it's not just two new conjectures, but paths and suggestions to solve these problems. The last section contains a few new, original exercises, some with solutions, and may be useful to students, researchers, and instructors offering math and statistics classes at the college level: they range from easy to very difficult. Some great probability theorems are also discussed, in layman's terms: see section 1.2. 
The two deep conjectures highlighted in this article (conjectures B and C) are related to the digit distribution of well known math constants such as Pi or log 2, with an emphasis on binary digits of SQRT(2). This is an old problem, one of the most famous ones in mathematics, still unsolved today.
Content of this article
A Strange Recursive Formula
  • Conjecture A
  • A deeper result
  • Conjecture B
  • Connection to the Berry-Esseen theorem
  • Potential path to solving this problem
Potential Solution Based on Special Rational Number Sequences
  • Interesting statistical result
  • Conjecture C
  • Another curious statistical result
Exercises
Read the full article here

Friday, August 30, 2019

A Strange Family of Statistical Distributions

I introduce here a family of very peculiar statistical distributions governed by two parameters: p, a real number in [0, 1], and b, an integer > 1. 
Potential applications are found in cryptography, Fintech (stock market modeling), Bitcoin, number theory, random number generation, benchmarking statistical tests (see here) and even gaming (see here.) However, the most interesting application is probably to gain insights about how non-normal numbers look like, especially their chaotic nature. It is a fundamental tool to help solve one of the most intriguing mathematical conjectures of all times (yet unsolved): are the digits of standard constants such as Pi or SQRT(2) uniformly distributed or not? For instance, when b = 2, any departure from p = 0.5 (a normal seed) results in a strong discontinuity for f(x) at x = 0.5. If you look at the above chart, f(0) = f(1/2) = f(1) regardless of p, but discontinuities are masking this fact. 

Extreme Events Modeling Using Continued Fractions

Continued fractions are usually considered as a beautiful, curious mathematical topic, but with applications mostly theoretical and limited to math and number theory. Here we show how it can be used in applied business and economics contexts, leveraging the mathematical theory developed for continued fraction, to model and explain natural phenomena. 
The interest in this project started when analyzing sequences such as x(n) = { nq } = nq - INT(nq) where n= 1, 2, and so on, and q is an irrational number in [0, 1] called the seed. The brackets denote the fractional part function. The values x(n) are also in [0, 1] and get arbitrarily close to 0 and 1 infinitely often, and indeed arbitrarily close to any number in [0, 1] infinitely often. I became interested to see what happens when it gets very close to 1, and more precisely, about the distribution of the arrival times t(n) of successive records. I was curious to compare these arrival times with those from truly random numbers, or from real-life time series such as temperature, stock market or gaming/sports data. Such arrival times are known to have an infinite expectation under stable conditions, though their medians always exist: after all, any record could be the final one, never to be surpassed again in the future. This always happens at some point with the sequence x(n), if q is a rational number -- thus our focus on irrational seeds: they yield successive records that keep growing over and over, without end, although the gaps between successive records eventually grow very large, in a chaotic, unpredictable way, just like records in traditional time series.
Content:
  • Theoretical background (simplified)
  • Generalization and potential applications to real life problems
  • Original applications in music and probabilistic number theory

Sunday, June 23, 2019

Free Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes

This book is intended for busy professionals working with data of any kind: engineers, BI analysts, statisticians, operations research, AI and machine learning professionals, economists, data scientists, biologists, and quants, ranging from beginners to executives. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach. The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.
New ideas, advanced topics and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (machine learning, statistics, computer science, operations research, dynamical systems, number theory), broadening the knowledge and interest of the reader in ways that are not found in any other book. This book contains a large amount of condensed material that would typically be covered in 1,000 pages in traditional publications, including data sets, source code, business applications, and Excel spreadsheets. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.
Chapters are organized and grouped by themes: natural language processing (NLP), re-sampling, time series, central limit theorem, statistical tests, boosted models (ensemble methods), tricks and special topics, appendices, and so on. The text in blue consists of clickable links to provide the reader with additional references.  Source code and Excel spreadsheets summarizing computations, are also accessible as hyperlinks for easy copy-and-paste or replication purposes. The most recent version is accessible from this link, accessible to DSC members only.
About the author
Vincent Granville is a start-up entrepreneur, patent owner, author, investor, pioneering data scientist with 30 years of corporate experience in companies small and large (eBay, Microsoft, NBC, Wells Fargo, Visa, CNET) and a former VC-funded executive, with a strong academic and research background including Cambridge University.
Download the book (members only) 
Click here to get the book. For Data Science Central members only. If you have any issues accessing the book please contact us at info@datasciencecentral.com. To become a member, click here
Content
Part 1 - Machine Learning Fundamentals and NLP
We introduce a simple ensemble technique (or boosted algorithm) known as Hidden Decision Trees, combining robust regression with unusual decision trees, useful in the context of transaction scoring. We then describe other original and related machine learning techniques for clustering large data sets, structuring unstructured data via indexation (a natural language processing or NLP technique), and perform feature selection, with Python code and even an Excel implementation.
  • Multi-use, Robust, Pseudo Linear Regression -- page 12
  • A Simple Ensemble Method, with Case Study (NLP) -- page 15
  • Excel Implementation -- page 24
  • Fast Feature Selection -- page 31
  • Fast Unsupervised Clustering for Big Data (NLP) -- page 36
  • Structuring Unstructured Data -- page 40
Part 2 - Applied Probability and Statistical Science
We discuss traditional statistical tests to detect departure from randomness (the null hypothesis) with applications to sequences (the observations) that behave like stochastic processes. The central limit theorem (CLT) is revisited and generalized with applications to time series (both univariate and multivariate) and Brownian motions. We discuss how weighted sums of random variables and stable distributions are related to the CLT, and then explore mixture models -- a better framework to represent a rich class of phenomena. Applications are numerous, including optimum binning for instance. The last chapter summarizes many of the statistical tests used earlier.
  • Testing for Randomness -- page 42
  • The Central Limit Theorem Revisited -- page 48
  • More Tests of Randomness -- page 55
  • Random Weighted Sums and Stable Distributions -- page 63
  • Mixture Models, Optimum Binning and Deep Learning -- page 73
  • Long Range Correlations in Time Series -- page 87
  • Stochastic Number Theory and Multivariate Time Series -- page 95
  • Statistical Tests: Summary -- page 101
Part 3 - New Foundations of Statistical Science
We set the foundations for a new type of statistical methodology fit for modern machine learning problems, based on generalized resampling. Applications are numerous, ranging from optimizing cross-validation to computing confidence intervals, without using classic statistical theory, p-values, or probability distributions. Yet we introduce a few new fundamental theorems, including one regarding the asymptotic properties of generic, model-free confidence intervals.
  • Modern Resampling Techniques for Machine Learning -- page 107
  • Model-free, Assumption-free Confidence Intervals -- page 121
  • The Distribution of the Range: A Beautiful Probability Theorem -- page 133
Part 4 - Case Studies, Business Applications
These chapters deal with real life business applications. Chapter 18 is peculiar in the sense that it features a very original business application (in gaming) described in details with all its components, based on the material from the previous three chapters. Then we move to more traditional machine learning use cases. Emphasis is on providing sound business advice to data science managers and executives, by showing how data science can be successfully leveraged to solve problems. The presentation style is compact, focusing on strategy rather than technicalities. 
  • Gaming Platform Rooted in Machine Learning and Deep Math -- page 136
  • Digital Media: Decay-adjusted Rankings -- page 148
  • Building a Website Taxonomy -- page 153
  • Predicting Home Values -- page 158
  • Growth Hacking -- page 161
  • Time Series and Growth Modeling -- page 169
  • Improving Facebook and Google Algorithms -- page 179
Part 5 - Additional Topics
Here we cover a large number of topics, including sample size problems, automated exploratory data analysis, extreme events, outliers, detecting the number of clusters, p-values, random walks, scale-invariant methods, feature selection, growth models, visualizations, density estimation, Markov chains, A/B testing, polynomial regression, strong correlation and causation, stochastic geometry, K nearest neighbors, and even the exact value of an intriguing integral computed using statistical science, just to name a few.
  • Solving Common Machine Learning Challenges -- page 187
  • Outlier-resistant Techniques, Cluster Simulation, Contour Plots -- page 214
  • Strong Correlation Metric -- page 225
  • Special Topics -- page 229
Appendix
  • Linear Algebra Revisited -- page 266
  • Stochastic Processes and Organized Chaos -- page 272
  • Machine Learning and Data Science Cheat Sheet  -- page 297

More Weird Statistical Distributions

Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understa...