About this course
If you want to break into a data science competition, then this course is for you. Participating in predictive modeling competitions can help you gain practical experience, improve and harness your data modeling skills in various domains such as credit, insurance, marketing, natural language processing, sales forecasting, and computer vision to name a few. At the same time, you will do it in a competitive context against thousands of participants where each one tries to build the most predictive algorithm. Pushing each other to the limit can result in better performance and smaller prediction errors. Being able to achieve high ranks consistently can help you accelerate your career in data science.
Data Science Competition: What did it take to become the top Kagglers?
- You need to understand the problem well and also the metric you are being tested on as well as the dynamics of your training and test data. Is your test data in the future? Is it a time series? Does your test data contain new entities (e.g. new customers, products)? All these questions need to be answered and define the way you need to validate your solutions internally in order to get reliable and accurate results.
- Be disciplined. When defining that internal reliable validation environment, exploit it within reason. Never try something that you cannot actually replicate for the test data. You need to treat your validation data like test data. This helps to avoid leakage.
- Try problem-specific things. For instance, in image classification you need CNNs and for text data, you might need tf-idf, stemming, spell checking, etc. You need to know what works best for each problem.
- To generalize on, you need to know the tools, programming languages, libraries, techniques. You also need to make certain you update your arsenal constantly with new tools.
- Good hardware to try many things. Image classification competitions need GPUs, while in tabular datasets CPUs with multiple cores would also do.
- Collaborating with other people and forming teams. This works well for various reasons. People tend to seize the problem from different angles, ultimately uncovering more information about the target variable. At the same time, you could sub-divide tasks among team members to cover more ground.
- Ensembling, by means of combining many different (ideally diverse), approaches together in order to get a better result.
What skills you will gain from this Course?
- You will learn to solve predictive modeling for the data science competition efficiently and learn which of the skills obtained can be applicable to real-world tasks.
- Also, you will understand how to preprocess the data and generate new features from various sources such as text and images.
- You will be taught advanced feature engineering techniques like generating mean-encodings, using aggregated statistical measures or finding nearest neighbors as a means to improve your predictions.
- Be able to form reliable cross validation methodologies that help you benchmark your solutions and avoid overfitting or underfitting when tested with unobserved (test) data.
- Gain experience in analyzing and interpreting the data. You will become aware of inconsistencies, high noise levels, errors, and other data-related issues such as leakages and you will learn how to overcome them.
- Acquire knowledge of different algorithms and learn how to efficiently tune their hyperparameters and achieve top performance.
- Master the art of combining different machine learning models and learn how to ensemble.
Syllabus on How to Win Data Science Competition
(Overall content rating 94%)
Data Science Competition: Introduction
This week you will learn about competitive data science. You will learn about competitions’ mechanics, the difference between competitions and real-life data science, hardware, and software that people usually use in competitions. You will also briefly recap major ML models frequently used in competitions.
- Introduction, Course overview, Competition Mechanics, and Kaggle Overview .
- Real-World Application vs Competitions and Recap of main ML algorithms.
- Software/Hardware Requirements.
Feature Preprocessing and Generation with Respect to Models
In this module, you will summarize approaches to work with features: preprocessing, generation, and extraction. You will see the choice of the machine learning model impacts both preprocessing you apply to the features and our approach to the generation of new ones. You will also discuss feature extraction from text with Bag of Words and Word2vec, and feature extraction from images with Convolution Neural Networks.
- Numeric features, Categorical and ordinal features.
- Date, time and coordinates, Handling missing values, Bag of words and Word2vec, CNN.
Data Science Competition: Overview of Final Project
This is just a reminder, that the final project in this course is better to start soon! The final project is, in fact, a competition, in this module, you can find information about it.
Exploratory Data Analysis
You will begin this week with Exploratory Data Analysis (EDA). It is a very broad and exciting topic and an essential component of the solving process. Besides regular videos, you will find a walkthrough EDA process for Springleaf competition data and an example of prolific EDA for NumerAI competition with extraordinary findings.
- Exploratory Data Analysis.
- Building intuition about the data, Exploring anonymized data, and Visualizations.
- Springleaf competition EDA 1 and 2 or Numerai competition EDA.
In this module, you will discuss various validation strategies. You will see that the strategy you choose depends on the competition setup and that the correct validation scheme is one of the bricks for any winning solution.
- Validation strategies.
- Data splitting strategies.
- Problems occurring during validation.
Finally, in this module, you will cover something very unique to data science competitions. That is, you will see examples of how it is sometimes possible to get a top position in a competition with very little machine learning, just by exploiting a data leakage.
- Basics of Leaks.
- Leaderboard probing and examples of rare data leaks.
- Expedia challenge.
This week you will first study another component of the competitions: the evaluation metrics. You will recap the most prominent ones and then see, how you can efficiently optimize a metric given in a competition.
- Regression metrics review 1 and 2 or Classification metrics review.
- General approaches for metrics optimization and Regression metrics optimization.
- Classification metrics optimization 1 and 2.
Data Science Competition: Advanced Feature Engineering 1
In this module, you will study a very powerful technique for feature generation. It has a lot of names, but here we call it “mean encodings”. You will see the intuition behind them, how to construct them, regularize and extend them.
- Regularization and Extensions and generalizations.
In this module, you will talk about the hyperparameter optimization process. You will also have a special video with practical tips and tricks, recorded by four instructors.
- Hyperparameter tuning 1, 2, and 3.
- Practical guide.
- KazAnova’s competition pipeline, part 1, and part 2.
Data Science Competition: Advanced feature engineering 2
In this module, you will learn about a few more advanced feature engineering techniques.
- Statistics and distance-based features.
- Matrix factorizations and Feature Interactions
Nowadays it is hard to find a competition won by a single model! Every winning solution incorporates ensembles of models. In this module you will talk about the main ensembling techniques in general, and, of course, how it is better to ensemble the models in practice.
- Introduction to ensembling.
- Bagging, Boosting, Stacking, and StackNet.
- Ensembling Tips and Tricks.
- CatBoost 1 and 2.
Data Science Competition: Go through
For the 5th week, you have prepared for your several “walk-through” videos. In these videos, you discuss solutions to competitions you took prizes at. The video content is quite short this week to let you spend more time on the final project. Good luck!
- Crowd Flower Competition and Springleaf Marketing Response.
- Microsoft Malware Classification Challenge and Walmart: Trip Type Classification.
- Acquire Valued Shoppers Challenge, part 1, and part 2.
This course on “How to win data science competition?” is from the Advanced Machine Learning Specialization, lets see a bit on this specialization.
About Advanced Machine Learning Specialization
This specialization gives an introduction to deep learning, reinforcement learning, natural language understanding, computer vision, and Bayesian methods. Top Kaggle machine learning practitioners and CERN scientists will share their experience of solving real-world problems and help you to fill the gaps between theory and practice. Upon completion of 7 courses, you will be able to apply modern machine learning methods in enterprise and understand the caveats of real-world data and settings.
Note: Your review matters
If you have already done this course, kindly drop your review in our reviews section. It would help others to get useful information and better insight into the course offered.
- NRU Higher School of Economics
- Online Course
- 1-4 Weeks
- Paid Course (Paid certificate)
- Machine Learning Basics Proficiency in Python
- Data Analysis Data Science Feature Engineering Predictive Modelling