Tuesday, May 7, 2019

Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism  to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non-normal, and even non-stationary. 
No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers. 
In particular, we provide a confidence interval (CI) for the width of confidence intervals without using Bayesian statistics. The width is modeled as L = A / n^B and we compute, using Excel alone, a 95% CI for B in the classic case where B = 1/2. We also exhibit an artificial data set where L = 1 / (log n)^Pi. Here n is the sample size.

Despite the apparent simplicity of our approach, we are dealing here with martingales. But you don't need to know what a martingale is to understand the concepts and use our methodology. 

Saturday, May 4, 2019

Re-sampling: Amazing Results and Applications

This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum k in k-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data.
Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.
In particular, the concept of p-value is not explicitly included in this tutorial. Instead, following the new trend after the recent p-value debacle (addressed by the president of the American Statistical Association), it is replaced with a range of values computed on multiple sub-samples. 
Our algorithms are suitable for inclusion in black-box systems, batch processing, and automated data science. Our technology is data-driven and model-free. Finally, our approach to this problem shows the contrast between the data science unified, bottom-up, and computationally-driven perspective, and the traditional top-down statistical analysis consisting of a collection of disparate results that emphasizes the theory. 
1. Re-sampling and Statistical Inference
  • Main Result
  • Sampling with or without Replacement
  • Illustration
  • Optimum Sample Size 
  • Optimum K in K-fold Cross-Validation
  • Confidence Intervals, Tests of Hypotheses
2. Generic, All-purposes Algorithm
  • Re-sampling Algorithm with Source Code
  • Alternative Algorithm
  • Using a Good Random Number Generator
3. Applications
  • A Challenging Data Set
  • Results and Excel Spreadsheet
  • A New Fundamental Statistics Theorem
  • Some Statistical Magic
  • How does this work?
  • Does this contradict entropy principles?
4. Conclusions

Thursday, April 25, 2019

Some Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number Theory

So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.
Lag-1 auto-correlation in digit distribution of good seeds, for b-processes
I show here how I used the golden ratio for a new number guessing game (to generate chaos and randomness in ergodic time series) as well as new intriguing results, in particular:
  • Proof that the rabbit constant it is not normal in any base; this might be the first instance of a non-artificial mathematical constant for which the normalcy status is formally established.
  • Beatty sequences, pseudo-periodicity, and infinite-range auto-correlations for the digits of irrational numbers in the numeration system derived from perfect stochastic processes
  • Properties of multivariate b-processes, including integer or non-integer bases.
  • Weird behavior of auto-correlations for the digits of normal numbers (good seeds) in the numeration system derived from stochastic b-processes
  • A strange recursion that generates all the digits of the rabbit constant
Content of this article
1. Some Definitions
2. Digits Distribution in b-processes
3. Strange Facts and Conjectures about the Rabbit Constant
4. Gaming Application
  • De-correlating Using Mapping and Thinning Techniques
  • Dissolving the Auto-correlation Structure Using Multivariate b-processes
5. Related Articles
Read full articles, here

Monday, April 15, 2019

New Stock Trading and Lottery Game Rooted in Deep Math

I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, based on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral, efficient stock market. In short, there is nothing random, everything is deterministic and fixed in advance, and known to all users. Yet it behaves in a way that looks perfectly random, and public algorithms offered to win the biggest gains require so much computing power, that for all purposes, they are useless -- except to comply with gaming laws and to establish trustworthiness.
We use private algorithms to determine the winning numbers, and while they produce the exact same results as the public algorithms (we tested this extensively), they are incredibly more efficient, by many orders of magnitude. Also, it can be mathematically proved that the public and private algorithms are equivalent, and we actually proved it. We go through this verification process for any new algorithm introduced in our system. 
In the last section, we offer a competition: can you use the public algorithm to identify the winning numbers computed with the private (secret) algorithm? If yes, the system is breakable, and a more sophisticated approach is needed, to make it work. I don't think anyone can find the winning numbers (you are welcome to prove me wrong), so the award will be offered to the contestant providing the best insights on how to improve the robustness of this system. And if by chance you manage to identify those winning numbers, great, you'll get a bonus! But it is not a requirement to win the award.
1. Description, Main Features and Advantages
2. How it Works: the Secret Sauce
  • Public Algorithm
  • The Winning Numbers
  • Using Seeds to Find the Winning Numbers
  • ROI Tables
3. Business Model and Applications
  • Managing the Money Flow
4. Challenge and Statistical Results
  • Data Science / Math Competition
  • Controlling the Variance of the Portfolio Value
  • Probability of Cracking the System

Thursday, April 4, 2019

Most Popular Content on DSC

We have been in existence for over 10 years now, with content in many different places, lists, categories, and databases. This is an attempt to put everything together in one place, and help our readers (re-)discover some great articles and resources that were lost on the Internet over the years, but still sit on our web servers. We are making them come back to life. We are in the process of organizing it in a way that is user-friendly.  Some of the resources below are very recent, and some are pretty old, but we only kept what is still relevant and useful today. To not miss this type of content in the future, subscribe to our newsletter.
Non Technical
Articles from top bloggers
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Follow us: Twitter | Facebook

Monday, April 1, 2019

Long-range Correlations in Time Series: Modeling, Testing, Case Study

We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they  can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or short-range correlations between successive values. Applications are numerous, and we focus here on a case study arising from some interesting number theory problem. In particular, we show that one of the times series investigated in my article on randomness theory [see here, read section 4.1.(c)] is not Brownian despite the appearance. It has important implications regarding the problem in question. Applied to finance or economics, it makes the difference between an efficient market, and one that can be gamed.
This article it accessible to a large audience, thanks to its tutorial style, illustrations, and easily replicable simulations. Nevertheless, we discuss modern, advanced, and state-of-the-art concepts. This is an area of active research. 
1. Introduction and time series deconstruction
  • Example
  • Deconstructing time series
  • Correlations, Fractional Brownian motions
2. Smoothness, Hurst exponent, and Brownian test
  • Our Brownian tests of hypothesis
  • Data
3. Results and conclusions
  • Charts and interpretation
  • Conclusions
Read the full article, here

Thursday, March 21, 2019

Fascinating Developments in the Theory of Randomness

I present here some innovative results in my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available here) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.
None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. 
Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity.  .
Interesting connections with the golden ratio, special polynomials, and other special mathematical constants, are discussed in section 2. Finally, all the analyses performed during this work were done in Excel. I share my spreadsheets in this article, as well as many illustration, and all the results are replicable.
Content of this article
1. General framework, notations and terminology
  • Finding the equilibrium distribution
  • Auto-correlation and spectral analysis
  • Ergodicity, convergence, and attractors
  • Space state, time state, and Markov chain approximations
  • Examples
2. Case study
  • First fundamental theorem
  • Second fundamental theorem
  • Convergence to equilibrium: illustration
3. Applications
  • Potential application domains
  • Example: the golden ratio process
  • Finding other useful b-processes
4. Additional research topics
  • Perfect stochastic processes
  • Characterization of equilibrium distributions (the attractors)
  • Probabilistic calculus and number theory, special integrals
5. Appendix
  • Computing the auto-correlation at equilibrium
  • Proof of the first fundamental theorem
  • How to find the exact equilibrium distribution
6. Additional Resources

Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations avai...