Sunday, May 13, 2018

Selection of Great Data Science Articles still Worth Reading

These articles are between 3 and 5 year old, but are still valuable today. The methodology used in these articles is modern, and still state-of-the-art today. Some discuss immense data sets still available to the public, and that resulted in designing new machine learning techniques to handle them. 
I am in the process of organizing these articles (written by myself) to eventually self-publish data science tutorials, in a few separate booklets, that are easy to understand for the layman with one year of data camp or college education in data science. The material will eventually be accessible to Data Science Central members, but not published in a traditional book. 
My writing style has evolved over time: I have moved away from writing academic papers long ago, to most recently share advanced knowledge in a way that is accessible to beginners, sometimes even ground-breaking material, such as this one. Most of what I write today is not taught in data camps or college textbooks. It provides an off-the-beaten-path introduction and expert advise in data science, in simple English, and even features advanced topics such as stochastic integral equations (the Wall Street's holy grail) or spatial random processes, yet accessible to professionals familiar with data sets but with little mathematical training. In short, this is a great next step after attending a standard statistics, machine learning, or data science curriculum.
My book
Typically, the applications discussed are exciting, and the writing style is designed to make the reader willing to read more, as opposed to the dry writing style that plagues our profession. These articles cover topics such as quantum algorithms, high precision computing, Fintech, number theory, fake news / fake profile / fake reviews detection, cryptography, designing a better search engine, attribution modeling, cataloguing / taxonomy algorithms (NLP), clustering massive data sets, outliers handling, how to differentiate between correlation and causation, how to set up a business to sell data, and much more. 
Currently, these articles are spread as follows:
To access this selection of 150 older articles, click here

Thursday, May 10, 2018

Deep Dive into Polynomial Regression and Overfitting

In this article, we show that the issue with polynomial regression is not over-fitting, but numerical precision. Even if done right, numerical precision still remains an insurmountable challenge. We focus here on step-wise polynomial regression, which is supposed to be more stable than the traditional model. In step-wise regression, we estimate one coefficient at a time, using the classic least square technique. 
Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section. 
This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.
Content of this article:
1. Polynomial regression for Taylor series
  • Stepwise polynomial regression: algorithm
  • Convergence theorem
2.Application to Real Life Regression Models
  • Recommendations for practical model implementation
Read the full article, here.

Selection of Great Data Science Articles still Worth Reading

These articles are between 3 and 5 year old, but are still valuable today. The methodology used in these articles is modern, and still stat...