Sunday, November 18, 2018

New Books and Resources for DSC Members

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning. In the upcoming months, the following will be added:
  • The Machine Learning Coding Book
  • Off-the-beaten-path Statistics and Machine Learning Techniques 
  • Encyclopedia of Statistical Science
  • Original Math, Stat and Probability Problems - with Solutions
  • Number Theory for Data Scientists
  • Randomness, Pattern Recognition, Signal Processing
We invite you to sign up here to not miss these free books.  Previous material (also for members only) can be found here.
Currently, the following content is available:
1. Book: Enterprise AI - An Application Perspective 
Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.
The table of content is available here
2. Book: Applied Stochastic Processes
Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)
This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.
New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.
The table of content is available here
DSC Resources

Wednesday, October 31, 2018

New Book: Enterprise AI - An Applications Perspective

Now published. Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI. Download this eBook (PDF).

Contents
Introduction 
  • Machine Learning, Deep Learning and AI 
  • The Data Science Process 
  • Categories of Machine Learning algorithms 
  • How to learn rules from Data? 
  • An introduction to Deep Learning 
  • What problem does Deep Learning address? 
  • How Deep Learning Drives Artificial Intelligence 
Deep Learning and neural networks
  • Perceptrons – an artificial neuron 
  • MLP - How do Neural networks make a prediction? 
  • Spatial considerations - Convolutional Neural Networks 
  • Temporal considerations - RNN/LSTM 
  • The significance of Deep Learning 
  • Deep Learning provides better representation for understanding the world 
  • Deep Learning a Universal Function Approximator 
What functionality does AI enable for the Enterprise? 
  • Technical capabilities of AI
  • Functionality enabled by AI in the Enterprise value chain 
Enterprise AI applications 
  • Creating a business case for Enterprise AI 
  • Four Quadrants of the Enterprise AI business case 
  • Types of AI problems 
Enterprise AI – Deployment considerations 
  • A methodology to deploy AI applications in the Enterprise 
  • DevOps and the CI/CD philosophy
Conclusions
This book is available for free to DSC members exclusively, here
DSC Resources

Wednesday, October 24, 2018

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC
29 Statistical Concepts Explained in Simple English

Monday, September 10, 2018

New Perspective on the Central Limit Theorem and Statistical Testing

You won't learn this in textbooks, college classes, or data camps. Some of the material in this article is very advanced yet presented in simple English, with an Excel implementation for various statistical tests, and no arcane theory, jargon, or obscure theorems. It has a number of applications, in finance in particular. This article covers several topics under a unified approach, so it was not easy to find a title. In particular, we discuss:
  • When the central limit theorem fails: what to do, and case study
  • Various original statistical tests, some unpublished, for instance to test if an empirical statistical distribution (based on observations) is symmetric or not, or whether two distributions are identical
  • The power and mysteries of stable (also called divisible) statistical distributions
  • Dealing with weighted sums of random variables, especially with decaying weights
  • Fun number theory problems and algorithms associated with these statistical problems
  • Decomposing a (theoretical or empirical / observed) statistical distribution into elementary components, just like decomposing a complex molecule into atoms
The focus is on principles, methodology, and techniques applicable to, and useful in many applications. For those willing to do a deeper dive on these topics, many references are provided. This article, written as a tutorial, is accessible to professionals with elementary statistical knowledge, like stats 101. It is also written in a compact style, so that you can grasp all the material in hours rather than days. This simple article covers topics that you could learn in MIT, Stanford, Berkeley, Princeton or Harvard classes aimed at PhD students. Some is state-of-the-art research results published here for the first time, and made accessible to the data science of data engineer novice. I think mathematicians (being one myself) will also enjoy it. Yet, emphasis is on applications rather than theory. 
Finally, we focus here on sums of random variables.  The next article will focus on mixtures rather than sums, providing more flexibility for modeling purposes, or to decompose a complex distribution in elementary components. In both cases, my approach is mostly non-parametric, and based on robust statistical techniques, capable of handling outliers without problems, and not subject to over-fitting.
Content
1. Central Limit Theorem: New Approach
2. Stable and Attractor Distributions
  • Using decaying weights
  • More about stable distributions and their applications
3. Non CLT-compliant Weighted Sums, and their Attractors
  • Testing for normality
  • Testing for symmetry and dependence on kernel
  • Testing for semi-stability
  • Conclusions
Read full article here

Wednesday, June 20, 2018

Simple Solution to Feature Selection Problems

We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised learning such as linear regression or supervised clustering, it is possible to test the predicting power of a set of features (also called independent variables by statisticians, or predictors) using metrics such as goodness of fit with the response (the dependent variable), for instance using the R-squared coefficient. This makes the process of feature selection rather easy.
Here this is not feasible. The context could be pure clustering, with no training sets available, for instance in a fraud detection problem. We are also dealing with discrete and continuous variables, possibly including dummy variables that represent categories, such as gender. We assume that no simple statistical model explains the data, so the framework here is model-free, data-driven. In this context, traditional methods are based on information theory metrics to determine which subset of features brings the largest amount of information.
A classic approach consists of identifying the most information-rich feature, and then grow the set of selected features by adding new ones that maximize some criterion. There are many variants to this approach, for instance adding more than one feature at a time, or removing some features during the iterative feature selection algorithm. The search for an optimal solution to this combinatorial problem is not computationally feasible if the number of features is large, so an approximate solution (local optimum) is usually acceptable, and accurate enough for business purposes.
Content of this article:
  • Review of popular methods
  • New, simple idea for feature selection
  • Testing on a dataset with known theoretical entropy (and conclusions)
Read the full article, here.

Tuesday, June 12, 2018

Powerful, Hybrid Machine Learning Algorithm with Excel Implementation

In this article, we discuss a general machine learning technique to make predictions or score transactional data, applicable to very big, streaming data. This hybrid technique combines different algorithms to boost accuracy, outperforming each algorithm taken separately, yet it is simple enough to be reliably automated It is illustrated in the context of predicting the performance of articles published in media outlets or blogs, and has been used by the author to build an AI (artificial intelligence) system to detect articles worth curating, as well as to automatically schedule tweets and other postings in social networks.for maximum impact, with a goal of eventually fully automating digital publishing. This application is broad enough that the methodology can be applied to most NLP (natural language processing) contexts with large amounts of unstructured data. The results obtained in our particular case study are also very interesting. 
The algorithmic framework described here applies to any data set, text or not, with quantitative, non-quantitative (gender, race) or a mix of variables. It consists of several components; we discuss in details those that are new and original, The other, non original components are briefly mentioned, with references provided for further reading. No deep technical expertise and no mathematical knowledge is required to understand the concepts and methodology described here. The methodology, though state-of-the-art, is simple enough that it can even be implemented in Excel, for small data sets (one million observations.)
The technique presented here blends non-standard, robust versions of decision trees and regression. It has been successfully used in black-box ML implementations.
Read full article here
For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

DSC Resources

Saturday, June 9, 2018

Scale-Invariant Clustering and Regression

The impact of a change of scale, for instance using years instead of days as the unit of measurement for one variable in a clustering problem, can be dramatic. It can result in a totally different cluster structure. Frequently, this is not a desirable property, yet it is rarely mentioned in textbooks. I think all clustering software should state in their user guide, that the algorithm is sensitive to scale.
We illustrate the problem here, and propose a scale-invariant methodology for clustering. It applies to all clustering algorithms, as it consists of normalizing the observations before classifying the data points. It is not a magic solution, and it has its own drawbacks as we will see. In the case of linear regression, there is indeed no problem, and this is one of the few strengths of this technique.
Scale-invariant clustering
The problem may not be noticeable at first glance, especially in Excel, as charts are by default always re-scaled in spreadsheets (or when using charts in R or Python, for that matter). For simplicity, we consider here two clusters, see figure below.
Original data (left), X-axis re-scaled (middle), scale-invariant clustering (right)
The middle chart is obtained after re-scaling the X-axis, and as a result, the two-clusters structure is lost. Or maybe it is the one on the left-hand side that is wrong. Or both. Astute journalists and even researchers actually exploit this issue to present misleading, usually politically motivated, analyses. Students working on a clustering problem might not even be aware of the issue.
Read the full article here.

New Books and Resources for DSC Members

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple Engli...