# Analyzing The Olympic Games

I always enjoyed watching the Olympic games, watching the best people at each sport compete with each other is inspiring to me and makes me wonder how my life would look like if I would pursue Volleyball (my current favorite sport) instead of Software Engineering.

Since that’s not the case, and I stuck to Software Engineering, we are here today to ask some questions regarding the Olympic games using a dataset containing 120 years of competitions.

Note that I will be using python, mostly with the following packages — pandas, NumPy, plotly.
So if you want to mess with the code I’ll share, make sure you’ve installed these packages.

### Data Exploration

Before we start asking questions about our data, we have to understand what kind of data we have.

`import pandas as pdimport numpy as npimport plotly.express as pximport plotly.graph_objects as goimport plotly.figure_factory as ff`
`olympic_df = pd.read_csv('athlete_events.csv')`

The first two commands I always use on a dataframe to better understand it are `head`and `describe`

`olympic_df.head()`
`olympic_df.describe()`

From this alone, we can understand that on average, Olympics participants are around 25 years old, with an average height of 175cm and weight of 70kg.

We can also see some interesting facts such as that the youngest Olympic participant was 10 years old and the oldest — 97

As I was working on this, I was interested in seeing the min, max, mean stats for each sport, and not for all of them combined to get a more detailed look at the data.

`sports_df_grouped = olympic_df.groupby("Sport")sports_df_agg = sports_df_grouped[["Age", "Height","Weight"]].agg([np.max, np.min, np.mean]).reset_index()`

Looking at this table, we can see that Aeronautics is not a very popular sport — there was only one participant from Switzerland (I actually googled to verify).

Something that surprised me is the max weight of archers compared to the max weight of tug of war participants, but we can see that it does make sense on average.

After briefly introducing ourselves to the data, let’s try to come up with some interesting questions to answer.

#### How height, weight, and age are distributed across different sports?

To answer this question, we can try a few approaches.

The first one would be plotting a scatterplot for all sports together

`fig = px.scatter(olympic_df, x="Weight", y="Height", color="Sport")fig.update_layout(title="Distribution of weight and height in each sport")fig.show()`

In the above code, we could have added `size="Age"` to the scatter function arguments in order to see how age is distributed as well but I thought the plot was just too full with information.

In the second approach, we can hand-pick a single or a few sports and investigate their weight, height, or age distribution.

I decided to consider 4 sports and look at their age distribution but feel free to tweak it as you wish.

For the next plot, we will need to filter these 4 sports I selected into 4 different dataframes

`basketball_df = olympic_df[olympic_df["Sport"] == "Basketball"]football_df = olympic_df[olympic_df["Sport"] == "Football"]volleyball_df = olympic_df[olympic_df["Sport"] == "Volleyball"]beach_volleyball_df = olympic_df[olympic_df["Sport"] == "Beach Volleyball"]`

Now since we wanted to show the distribution of age, let’s get a list of ages for each sport

`basketball_age = list(basketball_df["Age"])football_age = list(football_df["Age"])volleyball_age = list(volleyball_df["Age"])beach_volleyball_age = list(beach_volleyball_df["Age"])`

Lastly, let’s plot the distribution

```hist_age_data = [basketball_age, football_age, volleyball_age, beach_volleyball_age]

labels = ['Basketball', 'Football', 'Volleyball', 'Beach Volleyball']
colors = ['#393E46', '#2BCDC1', '#F66095', 'orange']

fig = ff.create_distplot(hist_age_data, labels, colors=colors, bin_size=1.2)
fig.show()```

#### How likely is it to win a medal at 70?

To answer that, we will first filter for the participants who won any medal.

`medals = olympic_df.dropna(subset=["Medal"])`

then we can create a simple histogram

`fig = px.histogram(medals, x="Age")fig.update_layout(title="Distribution of medals per age", yaxis_title="Medals Count")fig.show()`

Looking at the histogram, I was surprised that a few 70 years old actually won a medal and I was curious to know at what sports, assuming that you are too here’s what I found out

`medals_above_70 = medals[medals["Age"] > 70]`

Let’s try to be more specific and plot the distribution of medals for each type of medal.
Just as we did before, we will get the ages of bronze, silver, and gold medalists and plot the distribution.

`bronze_medals = list(medals[medals["Medal"] == 'Bronze']["Age"])silver_medals = list(medals[medals["Medal"] == 'Silver']["Age"])gold_medals = list(medals[medals["Medal"] == 'Gold']["Age"])`
`fig = ff.create_distplot([bronze_medals, silver_medals, gold_medals],                         ['Bronze', 'Silver', 'Gold'],                         colors=['#CD7F32', '#C0C0C0', '#FFD700'],                         bin_size=5)fig.show()`

As we can see, most medalists are around 25 regardless of the medal.

#### How participation rate of women and men changed over the years?

To answer that, first let’s create a dataframe where each row will contain the year, number of female participants, and male participants.

`accumulate_M_F_df = olympic_df.groupby(['Sex', 'Year']).size().reset_index().pivot(columns='Sex', index='Year', values=0).reset_index()`

Now let’s plot this data

```years = list(accumulate_M_F_df["Year"])
female_participation = list(accumulate_M_F_df["F"])
male_participation = list(accumulate_M_F_df["M"])

fig = make_subplots()

fig.show()```

The reason for the spikes is that fewer athletes participate in the winter Olympics.
To see a clearer trend, we can filter out the winter olympics.

We will first have to create a dataframe that contains only the summer Olympics

`summer_olympic_df = olympic_df[olympic_df['Season'] == 'Summer']`

And then redefine `accumulate_M_F_df`

`accumulate_M_F_df = summer_olympic_df.groupby(['Sex', 'Year']).size().reset_index().pivot(columns='Sex', index='Year', values=0).reset_index()`

Then repeat the same lines of code to plot the graph

Although there are many more questions to answer, I think that I will let you go on your own way by asking these questions now.

If you like this format, let me know. I would love to explore more datasets together.

Subscribe with your email to get the latest posts to your inbox

## 3 comments on “Analyzing The Olympic Games”

1. Nice blog! Is your theme custom made or did you download it from somewhere? A theme like yours with a few simple adjustements would really make my blog jump out. Please let me know where you got your design. Bless you

Like

2. Hello there! This post could not be written any better! Looking through this post reminds me of my previous roommate! He always kept talking about this. I’ll forward this information to him. Pretty sure he’ll have a great read. Thanks for sharing!

Liked by 1 person

• Thank you, hope your previous roommate enjoyed it as well

Like