Computer Science Data Analytics Data Science

Analyzing The Olympic Games


I always enjoyed watching the Olympic games, watching the best people at each sport compete with each other is inspiring to me and makes me wonder how my life would look like if I would pursue Volleyball (my current favorite sport) instead of Software Engineering.

Since that’s not the case, and I stuck to Software Engineering, we are here today to ask some questions regarding the Olympic games using a dataset containing 120 years of competitions.


Note that I will be using python, mostly with the following packages — pandas, NumPy, plotly.
So if you want to mess with the code I’ll share, make sure you’ve installed these packages.

You can download the dataset I’ll be using from Kaggle.


Data Exploration

Before we start asking questions about our data, we have to understand what kind of data we have.

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
olympic_df = pd.read_csv('athlete_events.csv')

The first two commands I always use on a dataframe to better understand it are headand describe

olympic_df.head()
olympic_df.describe()

From this alone, we can understand that on average, Olympics participants are around 25 years old, with an average height of 175cm and weight of 70kg.

We can also see some interesting facts such as that the youngest Olympic participant was 10 years old and the oldest — 97


As I was working on this, I was interested in seeing the min, max, mean stats for each sport, and not for all of them combined to get a more detailed look at the data.

sports_df_grouped = olympic_df.groupby("Sport")
sports_df_agg = sports_df_grouped[["Age", "Height",
"Weight"]].agg([np.max, np.min, np.mean]).reset_index()

Looking at this table, we can see that Aeronautics is not a very popular sport — there was only one participant from Switzerland (I actually googled to verify).

Something that surprised me is the max weight of archers compared to the max weight of tug of war participants, but we can see that it does make sense on average.


Asking Questions

After briefly introducing ourselves to the data, let’s try to come up with some interesting questions to answer.

How height, weight, and age are distributed across different sports?

To answer this question, we can try a few approaches.

The first one would be plotting a scatterplot for all sports together

fig = px.scatter(olympic_df, x="Weight", y="Height", color="Sport")
fig.update_layout(title="Distribution of weight and height in each sport")
fig.show()

In the above code, we could have added size="Age" to the scatter function arguments in order to see how age is distributed as well but I thought the plot was just too full with information.

In the second approach, we can hand-pick a single or a few sports and investigate their weight, height, or age distribution.

I decided to consider 4 sports and look at their age distribution but feel free to tweak it as you wish.


For the next plot, we will need to filter these 4 sports I selected into 4 different dataframes

basketball_df = olympic_df[olympic_df["Sport"] == "Basketball"]
football_df = olympic_df[olympic_df["Sport"] == "Football"]
volleyball_df = olympic_df[olympic_df["Sport"] == "Volleyball"]
beach_volleyball_df = olympic_df[olympic_df["Sport"] == "Beach Volleyball"]

Now since we wanted to show the distribution of age, let’s get a list of ages for each sport

basketball_age = list(basketball_df["Age"])
football_age = list(football_df["Age"])
volleyball_age = list(volleyball_df["Age"])
beach_volleyball_age = list(beach_volleyball_df["Age"])

Lastly, let’s plot the distribution

hist_age_data = [basketball_age, football_age, volleyball_age, beach_volleyball_age]

labels = ['Basketball', 'Football', 'Volleyball', 'Beach Volleyball']
colors = ['#393E46', '#2BCDC1', '#F66095', 'orange']

fig = ff.create_distplot(hist_age_data, labels, colors=colors, bin_size=1.2)
fig.show()

How likely is it to win a medal at 70?

To answer that, we will first filter for the participants who won any medal.

medals = olympic_df.dropna(subset=["Medal"])

then we can create a simple histogram

fig = px.histogram(medals, x="Age")
fig.update_layout(title="Distribution of medals per age", yaxis_title="Medals Count")
fig.show()

Looking at the histogram, I was surprised that a few 70 years old actually won a medal and I was curious to know at what sports, assuming that you are too here’s what I found out

medals_above_70 = medals[medals["Age"] > 70]

Let’s try to be more specific and plot the distribution of medals for each type of medal.
Just as we did before, we will get the ages of bronze, silver, and gold medalists and plot the distribution.

bronze_medals = list(medals[medals["Medal"] == 'Bronze']["Age"])
silver_medals = list(medals[medals["Medal"] == 'Silver']["Age"])
gold_medals = list(medals[medals["Medal"] == 'Gold']["Age"])
fig = ff.create_distplot([bronze_medals, silver_medals, gold_medals],
['Bronze', 'Silver', 'Gold'],
colors=['#CD7F32', '#C0C0C0', '#FFD700'],
bin_size=5)
fig.show()

As we can see, most medalists are around 25 regardless of the medal.


How participation rate of women and men changed over the years?

To answer that, first let’s create a dataframe where each row will contain the year, number of female participants, and male participants.

accumulate_M_F_df = olympic_df.groupby(['Sex', 'Year']).size().reset_index().pivot(columns='Sex', index='Year', values=0).reset_index()

Now let’s plot this data

years = list(accumulate_M_F_df["Year"])
female_participation = list(accumulate_M_F_df["F"])
male_participation = list(accumulate_M_F_df["M"])

fig = make_subplots()

fig.add_trace(go.Scatter(x=years, y=female_participation, mode="lines+markers", name='Female'))

fig.add_trace(go.Scatter(x=years, y=male_participation, mode="lines+markers", name="Male"))

fig.show()

The reason for the spikes is that fewer athletes participate in the winter Olympics.
To see a clearer trend, we can filter out the winter olympics.

We will first have to create a dataframe that contains only the summer Olympics

summer_olympic_df = olympic_df[olympic_df['Season'] == 'Summer']

And then redefine accumulate_M_F_df 

accumulate_M_F_df = summer_olympic_df.groupby(['Sex', 'Year']).size().reset_index().pivot(columns='Sex', index='Year', values=0).reset_index()

Then repeat the same lines of code to plot the graph


Although there are many more questions to answer, I think that I will let you go on your own way by asking these questions now.

If you like this format, let me know. I would love to explore more datasets together.


Subscribe with your email to get the latest posts to your inbox

3 comments on “Analyzing The Olympic Games

  1. Nice blog! Is your theme custom made or did you download it from somewhere? A theme like yours with a few simple adjustements would really make my blog jump out. Please let me know where you got your design. Bless you

    Like

  2. Hello there! This post could not be written any better! Looking through this post reminds me of my previous roommate! He always kept talking about this. I’ll forward this information to him. Pretty sure he’ll have a great read. Thanks for sharing!

    Liked by 1 person

Leave a Reply to CodingKaiser Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: