Wednesday, June 20, 2018

Simple Solution to Feature Selection Problems

We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised learning such as linear regression or supervised clustering, it is possible to test the predicting power of a set of features (also called independent variables by statisticians, or predictors) using metrics such as goodness of fit with the response (the dependent variable), for instance using the R-squared coefficient. This makes the process of feature selection rather easy.
Here this is not feasible. The context could be pure clustering, with no training sets available, for instance in a fraud detection problem. We are also dealing with discrete and continuous variables, possibly including dummy variables that represent categories, such as gender. We assume that no simple statistical model explains the data, so the framework here is model-free, data-driven. In this context, traditional methods are based on information theory metrics to determine which subset of features brings the largest amount of information.
A classic approach consists of identifying the most information-rich feature, and then grow the set of selected features by adding new ones that maximize some criterion. There are many variants to this approach, for instance adding more than one feature at a time, or removing some features during the iterative feature selection algorithm. The search for an optimal solution to this combinatorial problem is not computationally feasible if the number of features is large, so an approximate solution (local optimum) is usually acceptable, and accurate enough for business purposes.
Content of this article:
  • Review of popular methods
  • New, simple idea for feature selection
  • Testing on a dataset with known theoretical entropy (and conclusions)
Read the full article, here.

Tuesday, June 12, 2018

Powerful, Hybrid Machine Learning Algorithm with Excel Implementation

In this article, we discuss a general machine learning technique to make predictions or score transactional data, applicable to very big, streaming data. This hybrid technique combines different algorithms to boost accuracy, outperforming each algorithm taken separately, yet it is simple enough to be reliably automated It is illustrated in the context of predicting the performance of articles published in media outlets or blogs, and has been used by the author to build an AI (artificial intelligence) system to detect articles worth curating, as well as to automatically schedule tweets and other postings in social networks.for maximum impact, with a goal of eventually fully automating digital publishing. This application is broad enough that the methodology can be applied to most NLP (natural language processing) contexts with large amounts of unstructured data. The results obtained in our particular case study are also very interesting. 
The algorithmic framework described here applies to any data set, text or not, with quantitative, non-quantitative (gender, race) or a mix of variables. It consists of several components; we discuss in details those that are new and original, The other, non original components are briefly mentioned, with references provided for further reading. No deep technical expertise and no mathematical knowledge is required to understand the concepts and methodology described here. The methodology, though state-of-the-art, is simple enough that it can even be implemented in Excel, for small data sets (one million observations.)
The technique presented here blends non-standard, robust versions of decision trees and regression. It has been successfully used in black-box ML implementations.
Read full article here
For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

DSC Resources

Saturday, June 9, 2018

The impact of a change of scale, for instance using years instead of days as the unit of measurement for one variable in a clustering problem, can be dramatic. It can result in a totally different cluster structure. Frequently, this is not a desirable property, yet it is rarely mentioned in textbooks. I think all clustering software should state in their user guide, that the algorithm is sensitive to scale.
We illustrate the problem here, and propose a scale-invariant methodology for clustering. It applies to all clustering algorithms, as it consists of normalizing the observations before classifying the data points. It is not a magic solution, and it has its own drawbacks as we will see. In the case of linear regression, there is indeed no problem, and this is one of the few strengths of this technique.
Scale-invariant clustering
The problem may not be noticeable at first glance, especially in Excel, as charts are by default always re-scaled in spreadsheets (or when using charts in R or Python, for that matter). For simplicity, we consider here two clusters, see figure below.
Original data (left), X-axis re-scaled (middle), scale-invariant clustering (right)
The middle chart is obtained after re-scaling the X-axis, and as a result, the two-clusters structure is lost. Or maybe it is the one on the left-hand side that is wrong. Or both. Astute journalists and even researchers actually exploit this issue to present misleading, usually politically motivated, analyses. Students working on a clustering problem might not even be aware of the issue.
Read the full article here.

Saturday, June 2, 2018

Free Book: Applied Stochastic Processes

Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)
This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.
New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.
This book is available for Data Science Central members exclusively. The text in blue consists of clickable links to provide the reader with additional references.  Source code and Excel spreadsheets summarizing computations, are also accessible as hyperlinks for easy copy-and-paste or replication purposes. The most recent version of this book is available from this link, accessible to DSC members only. 
About the author
Vincent Granville is a start-up entrepreneur, patent owner, author, investor, pioneering data scientist with 30 years of corporate experience in companies small and large (eBay, Microsoft, NBC, Wells Fargo, Visa, CNET) and a former VC-funded executive, with a strong academic and research background including Cambridge University.
Click here to get the book. For Data Science Central members only.
Content
The book covers the following topics: 
1. Introduction to Stochastic Processes
We introduce these processes, used routinely by Wall Street quants, with a simple approach consisting of re-scaling  random  walks to make them time-continuous, with a finite variance, based on the central limit theorem.
  • Construction of Time-Continuous Stochastic Processes
  • From Random Walks to Brownian Motion
  • Stationarity, Ergodicity, Fractal Behavior
  • Memory-less or Markov Property
  • Non-Brownian Process
2. Integration, Differentiation, Moving Averages
We introduce more advanced concepts about stochastic processes. Yet we make these concepts easy to understand even to the non-expert. This is a follow-up to Chapter 1.
  • Integrated, Moving Average and Differential Process
  • Proper Re-scaling and Variance Computation
  • Application to Number Theory Problem
3. Self-Correcting Random Walks
We investigate here a breed of stochastic processes that are different from the Brownian motion, yet are better models in many contexts, including Fintech. 
  • Controlled or Constrained Random Walks
  • Link to Mixture Distributions and Clustering
  • First Glimpse of Stochastic Integral Equations
  • Link to Wiener Processes, Application to Fintech
  • Potential Areas for Research
  • Non-stochastic Case
4. Stochastic Processes and Tests of Randomness
In this transition chapter, we introduce a different type of stochastic process, with number theory and cryptography applications, analyzing statistical properties of numeration systems along the way -- a recurrent theme in the next chapters, offering many research opportunities and applications. While we are dealing with deterministic sequences here, they behave very much like stochastic processes, and are treated as such. Statistical testing is central to this chapter, introducing tests that will be also used in the last chapters.
  • Gap Distribution in Pseudo-Random Digits
  • Statistical Testing and Geometric Distribution
  • Algorithm to Compute Gaps
  • Another Application to Number Theory Problem
  • Counter-Example: Failing the Gap Test
5. Hierarchical Processes
We start discussing random number generation, and numerical and computational issues in simulations, applied to an original type of stochastic process. This will become a recurring theme in the next chapters, as it applies to many other processes.
  • Graph Theory and Network Processes
  • The Six Degrees of Separation Problem
  • Programming Languages Failing to Produce Randomness in Simulations
  • How to Identify and Fix  the Previous Issue
  • Application to Web Crawling
6. Introduction to Chaotic Systems
While typically studied in the context of dynamical systems, the logistic map can be viewed  as a stochastic process, with an equilibrium distribution and probabilistic properties, just like numeration systems (next chapters) and processes introduced in the first four chapters.
  • Logistic Map and Fractals
  • Simulation: Flaws in Popular Random  Number  Generators
  • Quantum Algorithms
7. Chaos, Logistic Map and Related Processes
We study processes related to the logistic map, including a special logistic map discussed here for the first time, with a simple equilibrium distribution. This chapter offers a transition between chapter 6, and the next chapters on numeration system (the logistic map being one of them.)
  • General Framework
  • Equilibrium Distribution and Stochastic Integral Equation
  • Examples of Chaotic Sequences
  • Discrete, Continuous Sequences and Generalizations
  • Special Logistic Map
  • Auto-regressive Time Series
  • Literature
  • Source Code with Big Number Library
  • Solving the Stochastic Integral Equation: Example
8. Numerical and Computational Issues
These issues have been mentioned in chapter 7, and also appear in chapters 9, 10 and 11. Here we take a deeper dive and offer solutions, using high precision computing with BigNumber libraries. 
  • Precision Issues when Simulating, Modeling, and Analyzing Chaotic Processes
  • When Precision Matters, and when it does not
  • High Precision Computing (HPC)
  • Benchmarking HPC Solutions
  • How to Assess the Accuracy of your Simulation Tool
8. Digits of Pi, Randomness, and Stochastic Processes
Deep mathematical and data science research (including a result about the randomness of  p, which is just a particular case) are presented here, without using arcane terminology or complicated equations.  Numeration systems discussed here are a particular case of deterministic sequences behaving just like the stochastic process investigated earlier, in particular the logistic map, which is a particular case.
  • Application: Random Number Generation
  • Chaotic Sequences Representing Numbers
  • Data Science and Mathematical Engineering
  • Numbers in Base 2, 10, 3/2 or p
  • Nested Square Roots and Logistic Map
  • About the Randomness of the Digits of p
  • The Digits of p are Randomly Distributed in the Logistic Map System
  • Paths to Proving Randomness in the Decimal System
  • Connection with Brownian Motions
  • Randomness and the Bad Seeds Paradox
  • Application to Cryptography, Financial Markets, Blockchain, and HPC
  • Digits of p in Base p
10. Numeration Systems in One Picture
Here you will find a summary of much of the material previously covered on chaotic systems, in the context of numeration systems (in particular, chapters 7 and  9.)
  • Summary Table: Equilibrium Distribution, Properties
  • Reverse-engineering Number Representation Systems
  • Application to Cryptography
11. Numeration Systems: More Statistical Tests and Applications
In addition to featuring new research results and building on the previous chapters, the topics discussed here offer a great sandbox for data scientists and mathematicians. 
  • Components of Number Representation Systems
  • General Properties of these Systems
  • Examples of Number Representation Systems
  • Examples of Patterns in Digits Distribution
  • Defects found in the Logistic Map System
  • Test of Uniformity
  • New Numeration System with no Bad Seed
  • Holes, Autocorrelations, and Entropy (Information Theory)
  • Towards a more General, Better, Hybrid System
  • Faulty Digits, Ergodicity, and High Precision Computing
  • Finding the Equilibrium Distribution with the Percentile Test
  • Central Limit Theorem, Random Walks, Brownian Motions, Stock Market Modeling
  • Data Set and Excel Computations
12. The Central Limit Theorem Revisited
The central limit theorem explains the convergence of discrete stochastic processes to Brownian motions, and has been cited a few times in this book. Here we also explore a version that applies to deterministic sequences. Such sequences and treated as stochastic processes in this book.
  • A Special Case of the Central Limit Theorem
  • Simulations, Testing, and Conclusions
  • Generalizations
  • Source Code
13. How to Detect if Numbers are Random or Not
We explore here some deterministic sequences of numbers, behaving like stochastic processes or chaotic systems, together with another interesting application of the central limit theorem.
  • Central Limit Theorem for Non-Random Variables
  • Testing Randomness: Max Gap, Auto-Correlations and More
  • Potential Research Areas
  • Generalization to Higher Dimensions
14. Arrival Time of Extreme Events in Time Series
Time series, as discussed in the first chapters, are also stochastic processes. Here we discuss a topic rarely investigated in the literature: the arrival times, as opposed to the extreme values (a classic topic), associated with extreme events in time series.
  • Simulations
  • Theoretical Distribution of Records over Time
15. Miscellaneous Topics
We investigate topics related to time series as well as other popular stochastic processes such as spatial processes.
  • How and Why: Decorrelate Time Series
  • A Weird Stochastic-Like, Chaotic Sequence
  • Stochastic Geometry, Spatial Processes, Random Circles: Coverage Problem
  • Additional Reading (Including Twin Points in Point Processes)
16. Exercises

Friday, May 25, 2018

Mathematical Olympiads for Undergrad Students

Mathematical Olympiads are popular among high school students. However, there is nothing similar for college students, except maybe IMC. Even IMC is not popular. It focuses mostly on the same kind of problems as high school Olympiads, and you can not participate if you are over 23 years old. In addition, it is organized by country, as opposed to globally, thus favoring countries with a large population. Topics such as probability are never considered.
This is an opportunity to create Mathematical Olympiads for college students, with no age or country restrictions. It could be organized online, offering interesting, varied, and challenging problems, allowing participants to read literature about the problems, and have a few weeks to submit a solution. In short, something like Kaggle competitions, except that Kaggle focuses exclusively on machine learning, coding, and data processing. Not sure where the funding could come from, but if I decided to organize this kind of competition, I would be able to fund it myself. 
Below are examples of problems that I would propose. They do not require knowledge beyond advanced undergrad level in math, statistics, or probabilities. They are are more difficult, and more original, than typical exam questions. Participants are encouraged to use tools such as WolframAlpha to automatically compute integrals or solve systems of equations involved in these problems.
Is anyone interested in this new initiative? I could see this helping students not enrolled in a top university, though the majority of winners would probably come from a top school.
To read suggested problems with solution, visit this webpage

The First Things you Should Learn as a Data Scientist - Not what you Think

The list below is a (non-comprehensive) selection of what I believe should be taught first, in data science classes, based on 30 years of business experience. This is a follow up to my article Why logistic regression should be taught last.
I am not sure whether these topics below are even discussed in data camps or college classes. One of the issue is the way teachers are recruited. The recruitment process favors individuals famous for their academic achievements, or for their "star" status, and they tend to teach the same thing over and over, for decades. Successful professionals have little interest in becoming a teacher (as the saying goes: if you can't do it, you write about it, if you can't write about it, you teach it.)
It does not have to be that way. Plenty of qualified professionals, even though not being a star, would be perfect teachers and are not necessarily motivated by money. They come with tremendous experience gained in the trenches, and could be fantastic teachers, helping students deal with real data. And they do not need to be a data scientist, many engineers are entirely capable (and qualified) to provide great data science training.
This article has three parts:
  • Topics that should be taught very early on in a data science curriculum
  • Topics taught in a traditional curriculum
  • Topics that should also be included in a data science curriculum
Read the full article here

Sunday, May 13, 2018

Selection of Great Data Science Articles still Worth Reading

These articles are between 3 and 5 year old, but are still valuable today. The methodology used in these articles is modern, and still state-of-the-art today. Some discuss immense data sets still available to the public, and that resulted in designing new machine learning techniques to handle them. 
I am in the process of organizing these articles (written by myself) to eventually self-publish data science tutorials, in a few separate booklets, that are easy to understand for the layman with one year of data camp or college education in data science. The material will eventually be accessible to Data Science Central members, but not published in a traditional book. 
My writing style has evolved over time: I have moved away from writing academic papers long ago, to most recently share advanced knowledge in a way that is accessible to beginners, sometimes even ground-breaking material, such as this one. Most of what I write today is not taught in data camps or college textbooks. It provides an off-the-beaten-path introduction and expert advise in data science, in simple English, and even features advanced topics such as stochastic integral equations (the Wall Street's holy grail) or spatial random processes, yet accessible to professionals familiar with data sets but with little mathematical training. In short, this is a great next step after attending a standard statistics, machine learning, or data science curriculum.
My book
Typically, the applications discussed are exciting, and the writing style is designed to make the reader willing to read more, as opposed to the dry writing style that plagues our profession. These articles cover topics such as quantum algorithms, high precision computing, Fintech, number theory, fake news / fake profile / fake reviews detection, cryptography, designing a better search engine, attribution modeling, cataloguing / taxonomy algorithms (NLP), clustering massive data sets, outliers handling, how to differentiate between correlation and causation, how to set up a business to sell data, and much more. 
Currently, these articles are spread as follows:
To access this selection of 150 older articles, click here

Simple Solution to Feature Selection Problems

We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised...