In this article, I am going to cover the difference between Kaggle & Real-world projects.
And it’s not just about Kaggle but any data science competition or hackathon for that matter…
Kaggle is a really good platform to get started with Data Science and even to keep yourself updated with state-of-the-art algorithms and frameworks in data science space.
But is it really enough to make you a real-world data scientist? Let’s find out…
First of all, let’s have a look at the agenda today, we will see what Kaggle platform looks like, how Kagglers compete there, what a typical real-world project looks like.
A head-to-head comparison between Kaggle competitions and real-world data science project, some general differences as well, and concluding thoughts towards the end.
Kaggle platform has five main sections, you can take part in competitions, some sample competitions are also there for beginners.
You can take part in discussion forums, you can also have a look at kernels or notebooks from other Kagglers to review their approach.
There is a dataset section as well, where you can hone your data skills by munging & exploring the dataset.
And if you are a complete beginner, you can find introductory to intermediate level courses as well.
In a Kaggle competition, you get a well-defined problem statement, with evaluation and competition timelines.
You build a baseline, review others’ approaches if they have made their notebooks public.
You can engage in discussion forums to understand the competition better.
Based on these findings, you iterate your data science pipeline until the deadline.
You also get partially exposed test dataset to give you an idea of how good is your solution.
There are public and private leaderboards, public leader-board is real-time and based on the exposed part of test dataset, private leader-board is made public after the end of the competition, which is based on the hidden part of the test dataset.
In a typical real-world project, you identify and evaluate the opportunity, you develop a business understanding of the problem you are looking to solve.
You fetch, qualify and analyse available data, build a prototype or POC to get the buy-in from project sponsors to convert it into a full-fledged solution or product.
You follow CRISP-DM methodology to build the model, and deploy, host and monitor the model in production.
In a head-to-head comparison, you can see that in Kaggle competitions, problem statement is well-defined, while in real-world projects you may need to first identify an opportunity and then formulate the problem statement.
In Kaggle, datasets are available, while in real-world, you need to identify and fetch relevant data.
In Kaggle, Train-Test-Real datasets are already segregated, while in a real-world project, you need to analyse what kind of split best suits for your project.
In Kaggle, there are restrictions to using outside data, while in real-world, you can keep looking for additional relevant data.
In Kaggle, evaluation criteria is already defined for you, while in real-world, you need to explore which evaluation criteria suits the best for your use case.
There are differences between evaluation criteria and business KPIs, so you need to be aware of and take care of that as well, I will talk about it in detail later.
In Kaggle, you need to submit your results in a specific format, while in real-world, you generally deploy and host the model for business stakeholders.
In Kaggle, you get a deadline for the competitions, while in real-world, you can continue as long as the project has funds are there or project sponsor wants you to.
Now let’s have a look at the general differences.
In Kaggle competitions, you have a leader-board to know where you stand compared to other participants, while in real-world projects, you are the best as long as you are not challenged.
In Kaggle, the expectation is to move higher on the leader-board, while in real-world, you need to manage the expectations of the project stakeholders.
In Kaggle, you can use all the resources you can, while in real-world you take every decision with respect to its business value.
In Kaggle, competition timelines are important, in real-world, time to market is an important aspect.
In Kaggle, you can collaborate with other participants to form a team, in real-world, you and other team members need to develop a T-shaped skills-set.
Where horizontal line represents the data literacy shared across the team and vertical line represents depth in a particular area.
Like a cloud engineer’s horizontal skill is his data literacy and the end-to-end understanding of the work being done in the project, while his vertical skill is cloud computing.
In Kaggle, models can be as complex as they can, in order to improve accuracy, while in real-world, practical deployment aspects are also considered.
So what is the takeaway of this session?
In my view, Kaggle is a useful platform, provided it has other components as well apart from competitions.
Real-world projects are different in many aspects as we discussed earlier.
Regarding the effectiveness of the Kaggle platform as a learning tool is really subjective.
A data science beginner may find it very useful and can also use this platform to showcase his skills.
While an experienced data scientist may not have enough time to take part in live competitions.
Still, he can visit the platform intermittently to find any competition which may relate to the real-world problem he is solving.
So this is it for now, I hope you found this episode useful. Let me know your views in the comments section.
If you liked this article, please subscribe to my channel to get an update whenever I upload the new content.
Stay tuned, bye for now.
Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.
If you have any questions or comments, click the "Go To Discussion" button below!