Friday, May 25, 2018

Mathematical Olympiads for Undergrad Students

Mathematical Olympiads are popular among high school students. However, there is nothing similar for college students, except maybe IMC. Even IMC is not popular. It focuses mostly on the same kind of problems as high school Olympiads, and you can not participate if you are over 23 years old. In addition, it is organized by country, as opposed to globally, thus favoring countries with a large population. Topics such as probability are never considered.
This is an opportunity to create Mathematical Olympiads for college students, with no age or country restrictions. It could be organized online, offering interesting, varied, and challenging problems, allowing participants to read literature about the problems, and have a few weeks to submit a solution. In short, something like Kaggle competitions, except that Kaggle focuses exclusively on machine learning, coding, and data processing. Not sure where the funding could come from, but if I decided to organize this kind of competition, I would be able to fund it myself. 
Below are examples of problems that I would propose. They do not require knowledge beyond advanced undergrad level in math, statistics, or probabilities. They are are more difficult, and more original, than typical exam questions. Participants are encouraged to use tools such as WolframAlpha to automatically compute integrals or solve systems of equations involved in these problems.
Is anyone interested in this new initiative? I could see this helping students not enrolled in a top university, though the majority of winners would probably come from a top school.
To read suggested problems with solution, visit this webpage

The First Things you Should Learn as a Data Scientist - Not what you Think

The list below is a (non-comprehensive) selection of what I believe should be taught first, in data science classes, based on 30 years of business experience. This is a follow up to my article Why logistic regression should be taught last.
I am not sure whether these topics below are even discussed in data camps or college classes. One of the issue is the way teachers are recruited. The recruitment process favors individuals famous for their academic achievements, or for their "star" status, and they tend to teach the same thing over and over, for decades. Successful professionals have little interest in becoming a teacher (as the saying goes: if you can't do it, you write about it, if you can't write about it, you teach it.)
It does not have to be that way. Plenty of qualified professionals, even though not being a star, would be perfect teachers and are not necessarily motivated by money. They come with tremendous experience gained in the trenches, and could be fantastic teachers, helping students deal with real data. And they do not need to be a data scientist, many engineers are entirely capable (and qualified) to provide great data science training.
This article has three parts:
  • Topics that should be taught very early on in a data science curriculum
  • Topics taught in a traditional curriculum
  • Topics that should also be included in a data science curriculum
Read the full article here

Sunday, May 13, 2018

Selection of Great Data Science Articles still Worth Reading

These articles are between 3 and 5 year old, but are still valuable today. The methodology used in these articles is modern, and still state-of-the-art today. Some discuss immense data sets still available to the public, and that resulted in designing new machine learning techniques to handle them. 
I am in the process of organizing these articles (written by myself) to eventually self-publish data science tutorials, in a few separate booklets, that are easy to understand for the layman with one year of data camp or college education in data science. The material will eventually be accessible to Data Science Central members, but not published in a traditional book. 
My writing style has evolved over time: I have moved away from writing academic papers long ago, to most recently share advanced knowledge in a way that is accessible to beginners, sometimes even ground-breaking material, such as this one. Most of what I write today is not taught in data camps or college textbooks. It provides an off-the-beaten-path introduction and expert advise in data science, in simple English, and even features advanced topics such as stochastic integral equations (the Wall Street's holy grail) or spatial random processes, yet accessible to professionals familiar with data sets but with little mathematical training. In short, this is a great next step after attending a standard statistics, machine learning, or data science curriculum.
My book
Typically, the applications discussed are exciting, and the writing style is designed to make the reader willing to read more, as opposed to the dry writing style that plagues our profession. These articles cover topics such as quantum algorithms, high precision computing, Fintech, number theory, fake news / fake profile / fake reviews detection, cryptography, designing a better search engine, attribution modeling, cataloguing / taxonomy algorithms (NLP), clustering massive data sets, outliers handling, how to differentiate between correlation and causation, how to set up a business to sell data, and much more. 
Currently, these articles are spread as follows:
To access this selection of 150 older articles, click here

Thursday, May 10, 2018

Deep Dive into Polynomial Regression and Overfitting

In this article, we show that the issue with polynomial regression is not over-fitting, but numerical precision. Even if done right, numerical precision still remains an insurmountable challenge. We focus here on step-wise polynomial regression, which is supposed to be more stable than the traditional model. In step-wise regression, we estimate one coefficient at a time, using the classic least square technique. 
Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section. 
This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.
Content of this article:
1. Polynomial regression for Taylor series
  • Stepwise polynomial regression: algorithm
  • Convergence theorem
2.Application to Real Life Regression Models
  • Recommendations for practical model implementation
Read the full article, here.

Simple Solution to Feature Selection Problems

We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised...