Sunday, June 23, 2019

Free Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes

This book is intended for busy professionals working with data of any kind: engineers, BI analysts, statisticians, operations research, AI and machine learning professionals, economists, data scientists, biologists, and quants, ranging from beginners to executives. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach. The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.
New ideas, advanced topics and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (machine learning, statistics, computer science, operations research, dynamical systems, number theory), broadening the knowledge and interest of the reader in ways that are not found in any other book. This book contains a large amount of condensed material that would typically be covered in 1,000 pages in traditional publications, including data sets, source code, business applications, and Excel spreadsheets. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.
Chapters are organized and grouped by themes: natural language processing (NLP), re-sampling, time series, central limit theorem, statistical tests, boosted models (ensemble methods), tricks and special topics, appendices, and so on. The text in blue consists of clickable links to provide the reader with additional references.  Source code and Excel spreadsheets summarizing computations, are also accessible as hyperlinks for easy copy-and-paste or replication purposes. The most recent version is accessible from this link, accessible to DSC members only.
About the author
Vincent Granville is a start-up entrepreneur, patent owner, author, investor, pioneering data scientist with 30 years of corporate experience in companies small and large (eBay, Microsoft, NBC, Wells Fargo, Visa, CNET) and a former VC-funded executive, with a strong academic and research background including Cambridge University.
Download the book (members only) 
Click here to get the book. For Data Science Central members only. If you have any issues accessing the book please contact us at info@datasciencecentral.com. To become a member, click here
Content
Part 1 - Machine Learning Fundamentals and NLP
We introduce a simple ensemble technique (or boosted algorithm) known as Hidden Decision Trees, combining robust regression with unusual decision trees, useful in the context of transaction scoring. We then describe other original and related machine learning techniques for clustering large data sets, structuring unstructured data via indexation (a natural language processing or NLP technique), and perform feature selection, with Python code and even an Excel implementation.
  • Multi-use, Robust, Pseudo Linear Regression -- page 12
  • A Simple Ensemble Method, with Case Study (NLP) -- page 15
  • Excel Implementation -- page 24
  • Fast Feature Selection -- page 31
  • Fast Unsupervised Clustering for Big Data (NLP) -- page 36
  • Structuring Unstructured Data -- page 40
Part 2 - Applied Probability and Statistical Science
We discuss traditional statistical tests to detect departure from randomness (the null hypothesis) with applications to sequences (the observations) that behave like stochastic processes. The central limit theorem (CLT) is revisited and generalized with applications to time series (both univariate and multivariate) and Brownian motions. We discuss how weighted sums of random variables and stable distributions are related to the CLT, and then explore mixture models -- a better framework to represent a rich class of phenomena. Applications are numerous, including optimum binning for instance. The last chapter summarizes many of the statistical tests used earlier.
  • Testing for Randomness -- page 42
  • The Central Limit Theorem Revisited -- page 48
  • More Tests of Randomness -- page 55
  • Random Weighted Sums and Stable Distributions -- page 63
  • Mixture Models, Optimum Binning and Deep Learning -- page 73
  • Long Range Correlations in Time Series -- page 87
  • Stochastic Number Theory and Multivariate Time Series -- page 95
  • Statistical Tests: Summary -- page 101
Part 3 - New Foundations of Statistical Science
We set the foundations for a new type of statistical methodology fit for modern machine learning problems, based on generalized resampling. Applications are numerous, ranging from optimizing cross-validation to computing confidence intervals, without using classic statistical theory, p-values, or probability distributions. Yet we introduce a few new fundamental theorems, including one regarding the asymptotic properties of generic, model-free confidence intervals.
  • Modern Resampling Techniques for Machine Learning -- page 107
  • Model-free, Assumption-free Confidence Intervals -- page 121
  • The Distribution of the Range: A Beautiful Probability Theorem -- page 133
Part 4 - Case Studies, Business Applications
These chapters deal with real life business applications. Chapter 18 is peculiar in the sense that it features a very original business application (in gaming) described in details with all its components, based on the material from the previous three chapters. Then we move to more traditional machine learning use cases. Emphasis is on providing sound business advice to data science managers and executives, by showing how data science can be successfully leveraged to solve problems. The presentation style is compact, focusing on strategy rather than technicalities. 
  • Gaming Platform Rooted in Machine Learning and Deep Math -- page 136
  • Digital Media: Decay-adjusted Rankings -- page 148
  • Building a Website Taxonomy -- page 153
  • Predicting Home Values -- page 158
  • Growth Hacking -- page 161
  • Time Series and Growth Modeling -- page 169
  • Improving Facebook and Google Algorithms -- page 179
Part 5 - Additional Topics
Here we cover a large number of topics, including sample size problems, automated exploratory data analysis, extreme events, outliers, detecting the number of clusters, p-values, random walks, scale-invariant methods, feature selection, growth models, visualizations, density estimation, Markov chains, A/B testing, polynomial regression, strong correlation and causation, stochastic geometry, K nearest neighbors, and even the exact value of an intriguing integral computed using statistical science, just to name a few.
  • Solving Common Machine Learning Challenges -- page 187
  • Outlier-resistant Techniques, Cluster Simulation, Contour Plots -- page 214
  • Strong Correlation Metric -- page 225
  • Special Topics -- page 229
Appendix
  • Linear Algebra Revisited -- page 266
  • Stochastic Processes and Organized Chaos -- page 272
  • Machine Learning and Data Science Cheat Sheet  -- page 297

Thursday, June 6, 2019

Machine Learning and Data Science Cheat Sheet

Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in just five years!
Written for people who have never seen a computer in their life, it starts with the very beginning: buying a laptop! You can skip the first half and jump to sections 5 and 6 if you are already familiar with UNIX. This new cheat sheet will be included in my upcoming book Machine Learning: Foundations, Toolbox, and Recipes to be published in September 2019, and available (for free) to Data Science Central members exclusively. This cheat sheet is 14 pages long.
Content
1. Hardware
2. Linux environment on Windows laptop
3. Basic UNIX commands
4. Scripting languages
5. Python, R, Hadoop, SQL, DataViz
6. Machine Learning
  • Algorithms
  • Getting started
  • Applications
  • Data sets and sample projects
This new cheat sheet is available here

Tuesday, June 4, 2019

7 Simple Tricks to Handle Complex Machine Learning Issues

We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.
1. Eliminating sample size effectsMany statistics, such as correlations or R-squared, depend on the sample size, making it difficult to compare values computed on two data sets of different sizes. Based on re-sampling techniques, use this easy trick, to compare apples with other apples, not with oranges. Read more here
2. Sample size determination, and simple, model-free confidence intervals. We propose a generic methodology, also based on re-sampling techniques, to compute any confidence interval and for testing hypotheses, without using any statistical theory. Also, it is easy to implement, even in Excel. Read more here
3. Determining the number of clusters in non-supervised clustering. This modern version of the elbow rule also tells you how strong the global optimum is, and can help you identify local optima too. It can also be automated. Read more here
4. Fixing issues in regression models when the assumptions are violated. If your data has serial correlation, unequal variances and other similar problems, this simple trick will remove the issue and allows you to perform more meaningful regressions, or to detect flaws in your data set. Read more here.  
5. Performing joins on poor quality data. This 40 year old trick allows you to perform a join when your data is infested with typos, multiple names representing the same entity, and other similar issues. In short, it performs a fuzzy join. Read more here
6. Scale invariant techniques. Sometimes, transforming your data, even changing the scale of one feature, say from meters to feet, have a dramatic impact on the results. Sometimes, you want your conclusions to be scale-independent. This trick solves this problem. Read more here
7. Blending data sets with incompatible data, adding consistency to your metrics. We are all too familiar with metrics that change over time and result in inconsistencies when comparing the past to the present, or when comparing different segments with incompatible measurements. This trick will allow you to design systems where again, apples are compared to other apples, not to oranges. Read more here.
To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

Wednesday, May 29, 2019

Gentle Approach to Linear Algebra, with Machine Learning Applications

This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of matrix normalization. Our approach is more intuitive and thus appealing to professionals who do not have a strong mathematical background, or who have forgotten what they learned in math textbooks. It will also appeal to physicists and engineers. Finally, it leads to simple algorithms, for instance for matrix inversion. The classical statistician or data scientist will find our approach somewhat intriguing. 
Content
1. Power of a matrix
2. Examples, Generalization, and Matrix Inversion
  • Example with a non-invertible matrix
  • Fast computations
3. Application to Machine Learning Problems
  • Markov chains
  • Time series
  • Linear regression

Tuesday, May 7, 2019

Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism  to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non-normal, and even non-stationary. 
No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers. 
In particular, we provide a confidence interval (CI) for the width of confidence intervals without using Bayesian statistics. The width is modeled as L = A / n^B and we compute, using Excel alone, a 95% CI for B in the classic case where B = 1/2. We also exhibit an artificial data set where L = 1 / (log n)^Pi. Here n is the sample size.

Despite the apparent simplicity of our approach, we are dealing here with martingales. But you don't need to know what a martingale is to understand the concepts and use our methodology. 

Saturday, May 4, 2019

Re-sampling: Amazing Results and Applications

This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum k in k-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data.
Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.
In particular, the concept of p-value is not explicitly included in this tutorial. Instead, following the new trend after the recent p-value debacle (addressed by the president of the American Statistical Association), it is replaced with a range of values computed on multiple sub-samples. 
Our algorithms are suitable for inclusion in black-box systems, batch processing, and automated data science. Our technology is data-driven and model-free. Finally, our approach to this problem shows the contrast between the data science unified, bottom-up, and computationally-driven perspective, and the traditional top-down statistical analysis consisting of a collection of disparate results that emphasizes the theory. 
Contents
1. Re-sampling and Statistical Inference
  • Main Result
  • Sampling with or without Replacement
  • Illustration
  • Optimum Sample Size 
  • Optimum K in K-fold Cross-Validation
  • Confidence Intervals, Tests of Hypotheses
2. Generic, All-purposes Algorithm
  • Re-sampling Algorithm with Source Code
  • Alternative Algorithm
  • Using a Good Random Number Generator
3. Applications
  • A Challenging Data Set
  • Results and Excel Spreadsheet
  • A New Fundamental Statistics Theorem
  • Some Statistical Magic
  • How does this work?
  • Does this contradict entropy principles?
4. Conclusions

Thursday, April 25, 2019

Some Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number Theory

So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.
Lag-1 auto-correlation in digit distribution of good seeds, for b-processes
I show here how I used the golden ratio for a new number guessing game (to generate chaos and randomness in ergodic time series) as well as new intriguing results, in particular:
  • Proof that the rabbit constant it is not normal in any base; this might be the first instance of a non-artificial mathematical constant for which the normalcy status is formally established.
  • Beatty sequences, pseudo-periodicity, and infinite-range auto-correlations for the digits of irrational numbers in the numeration system derived from perfect stochastic processes
  • Properties of multivariate b-processes, including integer or non-integer bases.
  • Weird behavior of auto-correlations for the digits of normal numbers (good seeds) in the numeration system derived from stochastic b-processes
  • A strange recursion that generates all the digits of the rabbit constant
Content of this article
1. Some Definitions
2. Digits Distribution in b-processes
3. Strange Facts and Conjectures about the Rabbit Constant
4. Gaming Application
  • De-correlating Using Mapping and Thinning Techniques
  • Dissolving the Auto-correlation Structure Using Multivariate b-processes
5. Related Articles
Read full articles, here

Free Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes

This book is intended for busy professionals working with data of any kind: engineers, BI analysts, statisticians, operations research, AI ...