What do Kaggle surveys say about beginners?

Khurazov Ruslan
7 min readMay 6, 2022
Photo by Firmbee.com on Unsplash

In recent years the area of data science and machine learning has exploded and it seems like everyone is somehow involved in it today. Obviously, it didn’t pass me and it’s been a little over two years since I had heard of it and started exploring. It feels almost like a super power to me when you can predict certain events or reactions only by putting numbers together. So, when I found out that kaggle.com hosts a survey for data science and machine learning enthusiasts and professionals from all over the world I thought it would be very interesting to see how the industry has changed since I first heard of it in 2019. I decided to focus on DS and ML newcomers as I am one of them and I wanted to see how I, 29-year old male with only bachelor’s degree and less than three years of experience in the field, fit into that world.

Let’s dig into those surveys. Photo by Scott Graham on Unsplash

Data

The data for this analysis is taken from kaggle hosted surveys (2019 Kaggle Machine Learning & Data Science Survey and 2021 Kaggle Machine Learning & Data Science Survey) and DOES NOT represent the whole data science and machine learning community as a whole but only that part of it that uses the website and participates in its surveys. Although, I imagine that majority of data scientists and machine learning specialists use kaggle, so it could be an industry indicator to some extend.

For this project I created created different dataframes leaving only 2019 and 2021 DS and ML newcomers. Here a newcomer is someone whith less than 3 years experience in both programming and machine learning.

The surveys consisted of 34 questions in 2019 and 38 questions in 2021. And questions ranged from age and gender of a user to technologies used and duties at work. I selected similar questions in both surveys and split them into three big categories: demographics, eduaction and work, technologies. Donut shaped charts were chosen for the project. For single answer questions the graphs show top 2–5 user choices for selected questions, some of them also have “other” value which is a sum of all the answers outside of top 2–5. For multiple answer questions the charts show top 3 options that users selected. You can check out data preprocessing (there was a lot) and code in my Github repository.

Demographics

Let’s start with basic demographics: age, gender and country of residence of the respondents.

Pic. 1 Age

The plot clearly shows that new data sience enthusiasts became even younger over the two years. 18–21 age group was third in 2019 with 19.4% but in 2021 there were 28.2% of survey participants in the same age group. I’m in 25–29 age group at the moment as were about 21% of respondents in 2021.

Pic. 2 Gender

As we can see majority of machine learning and data science beginners are male. Although number of females increased from 16.6% in 2019 up to 21.4% in 2021.

Pic. 3 Country of residence

This one was a surprise for me because I expected the US to be on top here. But new data scientists from India confidently hold the top and even increased their lead from 31.2% to 35.5%. Let’s go, India!

Education and work

Pic. 4 Education

We can also see that educational barrier to enter the industry seems to be going down with almost half of newbies in 2021 had only bachalor’s degree and the number of doctorate degree holders didn’t even make to top 3. I see that as a good news for myself as I only hold bachelor’s degree at the moment.

Pic. 5 Online courses

Most popular online education platforms didn’t change much over a 2-year period: coursera is first, followed by Kaggle courses and udemy. I personally used all three of them at different times and also finished Data Scientist nanodegree on udacity.com.

Pic. 6 Employment

As expected majority of respondents were students and that number increased a bit over two years. And in 2019 a little over 15% of industry newcomers were transitioning from software engineering but by 2021 they lost the second spot and didn’t make it to top 3. I’m in “Currently not employed” group as were about 10% of respondents in 2021.

Pic. 7 Employer size

Out of those respondents who were employed at the time of the survey, vast majority worked for small companies of up to 50 employees. And that number is growing from 31.2% in 2019 to almost 39% in 2021. I guess, I should be looking for a job in smaller companies as well.

Pic. 8 Salary

Most of respondents didn’t make much money which make sense considering that majority of them were still students.

Technologies

Pic. 9 Programming languages

Python is by far the most popular programming language amoung the newcomers. It’s followed by SQL with a little less than 40%. Both languages are widely used in industry, so no surprises here. But R which quarter of respondents checked as their programming langugage in 2019 dropped out of top 3 and was replaced by C++. With rich library support and very fast run-time it’s understandable why C++ gains more popularity. I, personally, only use python for my study projects but also tried to learn SQL as it is very important tool while working with lots of data.

Pic. 10 Algorithms
Pic. 11 Data vizualisation libraries

Regressions and tree based models are the most popular algorithms amoung the newcomers sharing first and second spot respectively. CNN, however, lost it’s third spot to boosting in 2021. As for data vizualisation libraries, matplotlib is the most used one (this project was mainly done using it as well) followed by seaborn, another very famous python library. Ggplot was replaced by plotly in 2021 though.

Pic. 12 IDEs
Pic. 13 Cloud environment

Jupyter is an absolute leader amoung IDEs for newcomers. I believe every one who ever started any data science or machine learning project used Jupyter notebook at least once. However, according to the surveys, less kaggle users used in 2021 than in 2019, almost 77% and 67% respectively. The second place, VScode, was a surprice for me as I have never heard of people using it for ds or ml projects. I think I’ll check it out to see for myself why it’s becoming more popular. I, actually, actually found an interesting article by

about using VSCode for data science. PyCharm kept the third place over a 2-year period with a little growth. Since this report is based on kaggle surveys, it’s obvious that Kaggle notebooks was on top of cloud environments list with about 40% of respondents using it. Although, Goggle’s colab gained some more followers over a 2-year period. Interestingly, almost 30% of respondents in both surveys didn’t use any cloud environments. As for me, I mainly use Jupyter notebooks for projects but sometimes I also use both Kaggle notebooks and Google’s colab.

Pic. 14 Social media

Finally, I wanted to see what platforms ds and ml newcomers use to share and exchange information. Again, in Kaggle hosted surveys kaggle.com was the first choice both in 2019 and 2021. However, it dropped from a little less than 70% down to about 44%. In general, all top 3 platforms lost some followers as social media platforms for ds and ml beginners. I mainly watch YouTube videos and read articles on Medium for my projects.

According to the surveys, we can see that data science and machine learning beginners on kaggle.com became younger over a 2-year period and more diverse. Most of them are from India and they use popular online education platforms such as Coursera, Kaggle courses and Udemy. Many are students with a bachelor’s degree. Majority of them use Python as a programming language and Jupyter for their ds projects. I think I fit good into this group of data science and machine learning beginners and, hopefully, sometime in the future I’ll participate in the survey as a seasoned and experienced professional.

Thanks to

for showing how to plot very cool donut charts and for inspiration!

Check out my Github to see my other projects or talk to me on LinkedIn or Instagram. Any comments are deeply appreciated!

--

--