Sunday, March 1, 2020

State-of-the-Art Statistical Science to Tackle Famous Number Theory Conjectures

The methodology described here has broad applications, leading to new statistical tests, new type of ANOVA (analysis of variance), improved design of experiments, interesting fractional factorial designs, a better understanding of irrational numbers leading to cryptography, gaming and Fintech applications, and high quality random numbers generators (and when you really need them). It also features exact arithmetic / high performance computing and distributed algorithms to compute millions of binary digits for an infinite family of real numbers, including detection of auto- and cross-correlations (or lack of) in the digit distributions.
The data processed in my experiment, consisting of raw irrational numbers (described by a new class of elementary recurrences) led to the discovery of unexpected apparent patterns in their digit distribution: in particular, the fact that a few of these numbers, contrarily to popular belief, do not have 50% of their binary digits equal to 1. It turned out that perfectly random digits simulated in large numbers, with a good enough pseudo-random generator, also exhibit the same strange behavior, pointing to the fact that pure randomness may not be as random as we imagine it is. Ironically, failure to exhibit these patterns would be an indicator that there really is a departure from pure randomness in the digits in question.
In addition to new statistical / mathematical methods and discoveries and interesting applications, you will learn in my article how to avoid this type of statistical traps that lead to erroneous conclusions, when performing a large number of statistical tests, and how to not be misled by false appearances. I call them statistical hallucinations and false outliers.
This article has two main sections: section 1, with deep research in number theory, and section 2, with deep research in statistics, with applications. You may skip one of the two sections depending on your interests and how much time you have. Both sections, despite state-of-the-art in their respective fields, are written in simple English. It is my wish that with this article, I can get data scientists to be interested in math, and the other way around: the topics in both cases have been chosen to be exciting and modern. I also hope that this article will give you new powerful tools to add to your arsenal of tricks and techniques. Both topics are related, the statistical analysis being based on the numbers discussed in the math section. 
One of the interesting new topics discussed here for the first time is the cross-correlation between the digits of two irrational numbers. These digit sequences are treated as multivariate time series. I believe this is the first time ever that this subject is not only investigated in detail, but  in addition comes with a deep, spectacular probabilistic number theory result about the distributions in question, with important implications in security and cryptography systems. Another related topic discussed here is a generalized version of the Collatz conjecture, with some insights on how to potentially solve it.
Content
1. On the Digits Distribution of Quadractic Irrational Numbers
  • Properties of the recursion
  • Reverse recursion
  • Properties of the reverse recursion
  • Connection to Collatz conjecture
  • Source code
  • New deep probabilistic number theory results
  • Spectacular new result about cross-correlations
  • Applications
2. New Statistical Techniques Used in Our Analysis
  • Data, features, and preliminary analysis
  • Doing it the right way
  • Are the patterns found a statistical illusion, or caused by errors, or real?
  • Pattern #1: Non-Gaussian behavior
  • Pattern #2: Illusionary outliers
  • Pattern #3: Weird distribution for block counts
  • Related articles and books
Appendix

Thursday, January 30, 2020

New Perspective on Fermat's Last Theorem

Fermat's last conjecture has puzzled mathematicians for 300 years, and was eventually proved only recently. In this note, I propose a generalization, that could actually lead to a much simpler proof and a more powerful result with broader applications, including to solve numerous similar equations. As usual, my research involves a significant amount of computations and experimental math, as an exploratory step before stating new conjectures, and eventually trying to prove them. The methodology is very similar to that used in data science, involving the following steps:
  1. Identify and process the data. Here the data set consists of all real numbers; it is infinite, which brings its own challenges. On the plus side, the data is public and accessible to everyone, though very powerful computation techniques are required, usually involving a distributed architecture. 
  2. Data cleaning: in this case, inaccuracies are caused by no using enough precision; the solution consists of finding better / faster algorithms for your computations, and sometimes having to work with exact arithmetic, using Bignum libraries.
  3. Sample data and perform exploratory analysis to identify patterns. Formulate hypotheses. Perform statistical tests to validate (or not) these hypotheses. Then formulate conjectures based on this analysis. 
  4. Build models (about how your numbers seem to behave) and focus on models offering the best fit. Perform simulations based on your model, see if your numbers agree with your simulations, by testing on a much larger set of numbers. Discard conjectures that do not pass these tests.
  5. Formally prove or disprove retained conjectures, when possible. Then write a conclusion if possible: in this case, a new, major mathematical theorem, showing potential applications. This last step is similar to data scientists presenting the main insights of their analysis, to a layman audience.
See full article for explanations about this table (representing the number of solutions)
The motivation in this article is two-fold:
  • Presenting a new path that can lead to new interesting results and theoretical research in mathematics (yet my writing style and content is accessible to the layman).
  • Offering data scientists and machine learning / AI practitioners (including newbies) an interesting framework to test their programming, discovery and analysis skills, using a huge (infinite) data set that has been available to everyone since the beginning of times, and applied to a fascinating problem. 
Read full article here. For more math-oriented articles, visit this page (check the math section), or download my books, available here.

Friday, November 29, 2019

Variance, Attractors and Behavior of Chaotic Statistical Systems

We study the properties of a typical chaotic system to derive general insights that apply to a large class of unusual statistical distributions. The purpose is to create a unified theory of these systems. These systems can be deterministic or random, yet due to their gentle chaotic nature, they exhibit the same behavior in both cases. They lead to new models with numerous applications in Fintech, cryptography, simulation and benchmarking tests of statistical hypotheses. They are also related to numeration systems. One of the highlights in this article is the discovery of a simple variance formula for an infinite sum of highly correlated random variables. We also try to find and characterize attractor distributions: these are the limiting distributions for the systems in question, just like the Gaussian attractor is the universal attractor with finite variance in the central limit theorem framework. Each of these systems is governed by a specific functional equation, typically a stochastic integral equation whose solutions are the attractors. This equation helps establish many of their properties. The material discussed here is state-of-the-art and original, yet presented in a format accessible to professionals with limited exposure to statistical science. Physicists, statisticians, data scientists and people interested in signal processing, chaos modeling, or dynamical systems will find this article particularly interesting. Connection to other similar chaotic systems is also discussed.
Read the full article here.
Content of this article:
1. The Geometric System: Definition and Properties
  • A test for independence
  • Connection to the Fixed-Point Theorem
2. Geometric and Uniform Attractors
  • General formula
  • The geometric attractor
  • Not any distribution can be an attractor
  • The uniform attractor
3. Discrete X Resulting in a Gaussian-looking Attractor
  • Towards a numerical solution
4. Special Cases with Continuous Distribution for X
  • An almost perfect equality
  • Is the log-normal distribution an attractor?
5. Connection to Binary Digits and Singular Distributions
  • Numbers made up of random digits
  • Singular distributions
  • Connection to Infinite Random Products
6. A General Classification of Chaotic Statistical Distributions
Read the full article here.

Thursday, November 28, 2019

New Family of Generalized Gaussian or Cauchy Distributions

In this article, we explore a new type of generalized univariate normal distributions that satisfies useful statistical properties, with interesting applications. This new class of distributions is defined by its characteristic function, and applications are discussed in the last section. These distributions are semi-stable (we define what this means below). In short it is a much wider class than the stable distributions (the only stable distribution with a finite variance being the Gaussian one) and it encompasses all stable distributions as a subset. It is a sub-class of the divisible distributions. 
Content of this article:
  • New two-parameter distribution G(ab): introduction, properties
  • Generalized central limit theorem
  • Characteristic function
  • Density: special cases, moments, mathematical conjecture
  • Simulations
  • Weakly semi-stable distributions
  • Counter-example
  • Applications and conclusions
Read the full article here

Saturday, October 26, 2019

More Weird Statistical Distributions

Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understand this article: I tried to make the presentation as simple as possible, focusing on high-level results rather than technicalities. Yet, professional statisticians and mathematicians, even academic researchers, will find some deep and fascinating results worth further exploring.
Can you identify patterns in this chart? (see section 2.2. in the article for an answer)
Let's start with 
Here the X(k)'s are random variable identically and independently distributed, commonly referred to as X. We are trying to find the distribution of Z.
Contents
1. Using a Simple Discrete Distribution for X
2. Towards a Better Model
  • Approximate Solution
  • The Fractal, Brownian-like Error Term
3. Finding X and Z Using Characteristic Functions
  • Test with Log-normal Distribution for X
  • Playing with the Characteristic Functions
  • Generalization to Continued Fractions and Nested Cubic Roots
4. Exercises
Read this article here

Wednesday, October 2, 2019

Surprising Uses of Synthetic Random Data Sets

I have used synthetic data sets many times for simulation purposes, most recently in my articles Six degrees of Separations between any two Datasets and How to Lie with p-values. Many applications (including the data sets themselves) can be found in my books Applied Stochastic Processes and New Foundations of Statistical Science. For instance, these data sets can be used to benchmark some statistical tests of hypothesis (the null hypothesis known to be true or false in advance) and to assess the power of such tests or confidence intervals. In other cases, it is used to simulate clusters and test cluster detection / pattern detection algorithms, see here.  I also used such data sets to discover two new deep conjectures in number theory (see here), to design new Fintech models such as bounded Brownian motions, and find new families of statistical distributions (see here).
Goldbach's comet 
In this article, I focus on peculiar random data sets to prove -- heuristically -- two of the most famous math conjectures in number theory, related to prime numbers: the Twin Prime conjecture, and the Goldbach conjecture. The methodology is at the intersection of probability theory, experimental math, and probabilistic number theory. It involves working with infinite data sets, dwarfing any data set found in any business context.
Read full article here.

Monday, September 9, 2019

Six Degrees of Separation Between Any Two Data Sets

This is an interesting data science conjecture, inspired by the well known six degrees of separation problem, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea.   
Here the link is between any two univariate data sets of the same size, say Data A and Data B. The claim is that there is a chain involving no more than 6 intermediary data sets, each highly correlated to the previous one (with a correlation above 0.8), between Data A and Data B. The concept is illustrated in the example below, where only 4 intermediary data sets (labeled Degree 1, Degree 2, Degree 3, and Degree 4) are actually needed. 
Correlation table for the 6 data sets
The view the (random) data sets, understand how the chain of intermediary data sets was built, and access the spreadsheets to reproduce the results or test on different data, follow this link. It makes for an interesting theoretical data science research project, for people with too much free time on their hands. 

State-of-the-Art Statistical Science to Tackle Famous Number Theory Conjectures

The methodology described here has broad applications, leading to new statistical tests, new type of ANOVA (analysis of variance), improved...