I always enjoyed watching the Olympic games, watching the best people at each sport compete with each other is inspiring to me and makes me wonder how my life would look like if I would pursue Volleyball (my current favorite sport) instead of Software Engineering.
Since that’s not the case, and I stuck to Software Engineering, we are here today to ask some questions regarding the Olympic games using a dataset containing 120 years of competitions.
Note that I will be using python, mostly with the following packages — pandas, NumPy, plotly.
So if you want to mess with the code I’ll share, make sure you’ve installed these packages.
You can download the dataset I’ll be using from Kaggle.
Before we start asking questions about our data, we have to understand what kind of data we have.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
olympic_df = pd.read_csv('athlete_events.csv')
The first two commands I always use on a dataframe to better understand it are
From this alone, we can understand that on average, Olympics participants are around 25 years old, with an average height of 175cm and weight of 70kg.
We can also see some interesting facts such as that the youngest Olympic participant was 10 years old and the oldest — 97
As I was working on this, I was interested in seeing the min, max, mean stats for each sport, and not for all of them combined to get a more detailed look at the data.
sports_df_grouped = olympic_df.groupby("Sport")
sports_df_agg = sports_df_grouped[["Age", "Height",
"Weight"]].agg([np.max, np.min, np.mean]).reset_index()
Looking at this table, we can see that Aeronautics is not a very popular sport — there was only one participant from Switzerland (I actually googled to verify).
Something that surprised me is the max weight of archers compared to the max weight of tug of war participants, but we can see that it does make sense on average.
After briefly introducing ourselves to the data, let’s try to come up with some interesting questions to answer.
How height, weight, and age are distributed across different sports?
To answer this question, we can try a few approaches.
The first one would be plotting a scatterplot for all sports together
fig = px.scatter(olympic_df, x="Weight", y="Height", color="Sport")
fig.update_layout(title="Distribution of weight and height in each sport")
In the above code, we could have added
size="Age" to the scatter function arguments in order to see how age is distributed as well but I thought the plot was just too full with information.
In the second approach, we can hand-pick a single or a few sports and investigate their weight, height, or age distribution.
I decided to consider 4 sports and look at their age distribution but feel free to tweak it as you wish.
For the next plot, we will need to filter these 4 sports I selected into 4 different dataframes
basketball_df = olympic_df[olympic_df["Sport"] == "Basketball"]
football_df = olympic_df[olympic_df["Sport"] == "Football"]
volleyball_df = olympic_df[olympic_df["Sport"] == "Volleyball"]
beach_volleyball_df = olympic_df[olympic_df["Sport"] == "Beach Volleyball"]
Now since we wanted to show the distribution of age, let’s get a list of ages for each sport
basketball_age = list(basketball_df["Age"])
football_age = list(football_df["Age"])
volleyball_age = list(volleyball_df["Age"])
beach_volleyball_age = list(beach_volleyball_df["Age"])
Lastly, let’s plot the distribution
hist_age_data = [basketball_age, football_age, volleyball_age, beach_volleyball_age] labels = ['Basketball', 'Football', 'Volleyball', 'Beach Volleyball'] colors = ['#393E46', '#2BCDC1', '#F66095', 'orange'] fig = ff.create_distplot(hist_age_data, labels, colors=colors, bin_size=1.2) fig.show()
How likely is it to win a medal at 70?
To answer that, we will first filter for the participants who won any medal.
medals = olympic_df.dropna(subset=["Medal"])
then we can create a simple histogram
fig = px.histogram(medals, x="Age")
fig.update_layout(title="Distribution of medals per age", yaxis_title="Medals Count")
Looking at the histogram, I was surprised that a few 70 years old actually won a medal and I was curious to know at what sports, assuming that you are too here’s what I found out
medals_above_70 = medals[medals["Age"] > 70]
Let’s try to be more specific and plot the distribution of medals for each type of medal.
Just as we did before, we will get the ages of bronze, silver, and gold medalists and plot the distribution.
bronze_medals = list(medals[medals["Medal"] == 'Bronze']["Age"])
silver_medals = list(medals[medals["Medal"] == 'Silver']["Age"])
gold_medals = list(medals[medals["Medal"] == 'Gold']["Age"])
fig = ff.create_distplot([bronze_medals, silver_medals, gold_medals],
['Bronze', 'Silver', 'Gold'],
colors=['#CD7F32', '#C0C0C0', '#FFD700'],
As we can see, most medalists are around 25 regardless of the medal.
How participation rate of women and men changed over the years?
To answer that, first let’s create a dataframe where each row will contain the year, number of female participants, and male participants.
accumulate_M_F_df = olympic_df.groupby(['Sex', 'Year']).size().reset_index().pivot(columns='Sex', index='Year', values=0).reset_index()
Now let’s plot this data
years = list(accumulate_M_F_df["Year"]) female_participation = list(accumulate_M_F_df["F"]) male_participation = list(accumulate_M_F_df["M"]) fig = make_subplots() fig.add_trace(go.Scatter(x=years, y=female_participation, mode="lines+markers", name='Female')) fig.add_trace(go.Scatter(x=years, y=male_participation, mode="lines+markers", name="Male")) fig.show()
The reason for the spikes is that fewer athletes participate in the winter Olympics.
To see a clearer trend, we can filter out the winter olympics.
We will first have to create a dataframe that contains only the summer Olympics
summer_olympic_df = olympic_df[olympic_df['Season'] == 'Summer']
And then redefine
accumulate_M_F_df = summer_olympic_df.groupby(['Sex', 'Year']).size().reset_index().pivot(columns='Sex', index='Year', values=0).reset_index()
Then repeat the same lines of code to plot the graph
Although there are many more questions to answer, I think that I will let you go on your own way by asking these questions now.
If you like this format, let me know. I would love to explore more datasets together.
Subscribe with your email to get the latest posts to your inbox