Wednesday, February 13, 2019

A Plethora of Original, Non-standard Statistical Tests

Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some of them are illustrated here, with examples that do not require statistical knowledge for understanding or implementation.
This material should appeal to managers, executives, industrial engineers, software engineers, operations research professionals, economists, and to anyone dealing with data, such as biometricians, analytical chemists, astronomers, epidemiologists, journalists, or physicians. Statisticians with a different perspective are invited to discuss my methodology and the tests described here, in the comment section at the bottom of this article. In my case, I used these tests mostly in the context of experimental mathematics, which is a branch of data science that few people talk about. In that context, the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid.
I provide here a general overview, as well as my simple approach to statistical testing, accessible to professionals with little or no formal statistical training. Detailed applications of these tests are found in my recent book and in this article. Precise references to these documents are provided as needed, in this article.
Examples of traditional tests
1. General Methodology
Despite my strong background in statistical science, over the years, I moved away from relying too much on traditional statistical tests and statistical inference. I am not the only one: these tests have been abused and misused, see for instance this article on p-hacking. Instead, I favored a methodology of my own, mostly empirical, based on simulations, data- rather than model-driven. It is essentially a non-parametric approach. It has the advantage of being far easier to use, implement, understand, and  interpret, especially to the non-initiated. It was initially designed to be integrated in black-box, automated decision systems. Here I share some of these tests, and many can be implemented easily in Excel. 

Monday, December 31, 2018

Announcement: Winner of the Data Science Central Competition

Back in 2017, we posted a problem related to stochastic processes and controlled random walks, offering a $2,000 award for a sound solution, see here for full details. The problem, which had a FinTech flavor, was only solved recently (December 2018) by Victor Zurkowski.
About the problem:
Let's start with X(1) = 0, and define X(k) recursively as follows, for k > 1:
and let's define U(k), Z(k), and Z as follows:
where the V(k)'s are deviates from independent uniform variables on [0, 1].
So there are two positive parameters in this problem, a and b, and U(k) is always between 0 and 1. When b = 1, the U(k)'s are just standard uniform deviates, and if b = 0, then U(k) = 1. The case a = b = 0 is degenerate and should be ignored. The case a > 0 and b = 0 is of special interest, and it is a number theory problem in itself, related to this problem when a = 1. Also, just like in random walks or Markov chains, the X(k)'s are not independent; they are indeed highly auto-correlated.
Prove that if a < 1, then  X(k) converges to 0 as k increases. Under the same condition, prove that the limiting distribution Z
  • always exists, (Note: if a > 1, X(k) may not converge to zero, causing a drift and asymmetry)
  • always takes values between -1 and +1, with min(Z) = -1 and max(Z) = +1,
  • is symmetric, with mean and median equal to 0
  • and does not depend on a, but only on b.
For instance, for b =1, even a = 0 yields the same triangular distribution for Z, as any a  > 0.
Main question: In general, what is the limiting distribution of Z? I guessed, using empirical data science techniques such as model fitting, simulations, and goodness-of-fit tests,  that the solution (which implied solving a stochastic integral solution) was, with z in [-1. 1]:
About the author and the solution:
Victor not only confirmed that the above density function is a solution to this problem, but also that the solution is unique, focusing on convergence issues, in a 27-page long paper. One detail still needs to be worked out: whether or not scaled Z visits the neighborhood of every point in [-1,1] infinitely often. Victor believes that the answer is positive. You can read his solution here, and we hope it will result in a publication in a scientific journal.
Victor Zurkowski, PhD, is a predictive modeling, machine learning, and optimization expert with 20+ years of experience, with deep expertise developing pricing models and optimization engines across industries, including Retail, Financial Services. He published various academic papers in Mathematics and Statistics across numerous topics, and is currently VP of Data Science at Polymatiks. Victor holds a Ph.D. in Mathematics from the University of Minnesota and an M.Sc. in Statistics from the University of Toronto.

Thursday, December 27, 2018

Why You Should be a Data Science Generalist - and How to Become One

The new advice today for data scientists is not to become a generalist. You can read recent articles on this topic, for instance here.  In this blog, I explain why I believe it should be the opposite. I wrote about this here not long ago, and provide additional arguments in this article, as to why it helps to be a generalist.  
Of course, it is difficult, and probably impossible to become a data science generalist just after graduating. It takes years to acquire all the skills, yet you don't need to master all of them. It might be easier for a physicist, engineer, or biostatistician currently learning data science, after years of corporate experience, than it is for a data scientist with no business experience. Possibly the easiest way to become one is to work for start-up's or small companies, taking on many hats as you will probably be the only data scientist in your company, and will have to change jobs more frequently than if you work for a big company. To the contrary, for a big company, you are expected to work in a very specialized area, though it does not hurt to be a generalist, as I will illustrate shortly. Being a specialized data scientist could put you on a very predictable path that limits your career growth and flexibility, especially if you want to create your company down the line. Let's start with explaining what a data science generalist is.
The data science generalist
The generalist has experience working in different roles and different environments, for instance, over a period of 15 years, having worked as a
  • Business analyst or BI professional, communicating insights to decision makers, mastering tools such as Tableau, SQL and Excel; or maybe being the decision maker herself
  • Statistician / data analyst with expertise in predictive modeling
  • Expert in algorithm design and optimization
  • Researcher in an academic-like setting, or experience in testing / prototyping new data science systems and proofs of concept (POC)
  • Builder / architect: designing APIs, dashboards, databases, and deploying/maintaining yourself some modest systems in production mode
  • Programmer (statistical or scientific programmer with exposure to high performance computing and parallel architectures - you might even have designed your own software)
  • Consultant, directly working with clients, or adviser
  • Manager or director role rather than individual contributor
  • Professional with roles in various industries (IT, media, Internet, finance, health care, smart cities) in both big and small companies, in various domains ranging from fraud detection, to optimizing sales or marketing, with proven, measurable accomplishments
In short, the generalist has been involved at one time or another, in all phases of the data science project lifecycle
The generalist might not command a higher salary, but has more flexibility career-wise. Even in a big company, when downsizing occurs, it is easier for the generalist to make a lateral move (get transferred to a different department), than it is for the "one-trick pony". 
Timing is important too. If you become a generalist at age 50 (as opposed to age 45) it might not help as getting hired becomes more difficult as you get past 45. Still, even if 50 or more, it opens up some possibilities, for instance starting your own business. And if you can prove that you have been consistently broadening your skills throughout your career cycle, as generalists do by definition, it will be easier to land a job, especially if your salary expectations are reasonable, and your health is not an issue for your future employer.  
To read the full article, click here

Sunday, November 18, 2018

New Books and Resources for DSC Members

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning. In the upcoming months, the following will be added:
  • The Machine Learning Coding Book
  • Off-the-beaten-path Statistics and Machine Learning Techniques 
  • Encyclopedia of Statistical Science
  • Original Math, Stat and Probability Problems - with Solutions
  • Number Theory for Data Scientists
  • Randomness, Pattern Recognition, Signal Processing
We invite you to sign up here to not miss these free books.  Previous material (also for members only) can be found here.
Currently, the following content is available:
1. Book: Enterprise AI - An Application Perspective 
Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.
The table of content is available here
2. Book: Applied Stochastic Processes
Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)
This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.
New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.
The table of content is available here
DSC Resources

Wednesday, October 31, 2018

New Book: Enterprise AI - An Applications Perspective

Now published. Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI. Download this eBook (PDF).

Contents
Introduction 
  • Machine Learning, Deep Learning and AI 
  • The Data Science Process 
  • Categories of Machine Learning algorithms 
  • How to learn rules from Data? 
  • An introduction to Deep Learning 
  • What problem does Deep Learning address? 
  • How Deep Learning Drives Artificial Intelligence 
Deep Learning and neural networks
  • Perceptrons – an artificial neuron 
  • MLP - How do Neural networks make a prediction? 
  • Spatial considerations - Convolutional Neural Networks 
  • Temporal considerations - RNN/LSTM 
  • The significance of Deep Learning 
  • Deep Learning provides better representation for understanding the world 
  • Deep Learning a Universal Function Approximator 
What functionality does AI enable for the Enterprise? 
  • Technical capabilities of AI
  • Functionality enabled by AI in the Enterprise value chain 
Enterprise AI applications 
  • Creating a business case for Enterprise AI 
  • Four Quadrants of the Enterprise AI business case 
  • Types of AI problems 
Enterprise AI – Deployment considerations 
  • A methodology to deploy AI applications in the Enterprise 
  • DevOps and the CI/CD philosophy
Conclusions
This book is available for free to DSC members exclusively, here
DSC Resources

Wednesday, October 24, 2018

29 Statistical Concepts Explained in Simple English

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC
29 Statistical Concepts Explained in Simple English

A Plethora of Original, Non-standard Statistical Tests

Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help ans...