Computer Science Data Analytics Data Science

Iterating a DataFrame in Python Pandas

Which way is the best in terms of performance?


Pandas is one of the most popular Python packages for handling data.
In case you are a data analyst \ scientist or whatever role that requires you to manipulate data — you probably use Pandas daily.

Assuming that Github stars and forks model somewhat accurately how popular a package is, let’s check how pandas is doing compared to some of its alternatives

  • Pandas — 30.7k Stars, 12.9k Forks
  • Spark — 30.6k Stars, 24.3k Forks
  • Dask — 8.7k Stars, 1.3k Forks
  • Polars — 2.2k Stars, 120 Forks
  • Vaex — 6.5k Stars, 505 Forks

So yeah, pandas is pretty popular, we can move on now.


Creating a DataFrame is quite easy, the following line will be enough

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

Now let’s say we actually don’t need these two columns, we just care about the result of their multiplication.

The most naive solution that comes to the mind of the ‘dataframes untrained’ is — let’s iterate over each row and multiply between the column values.

For this solution, let’s introduce ourselves with iterrows() .
From the docs, “Iterate over DataFrame rows as (index, Series) pairs.”,
So at each iteration, we get both the index of the row and the row itself.

def df_iterrows(df):
result = []
for index, row in df.iterrows():
result.append(row['col1'] * row['col2'])
return result

That’s not the only way of course, and actually, that’s the worst way.


DataFrame.apply

A second approach would be to apply a function over the rows of the dataframe.
In this case, we won’t have to iterate over the rows by ourselves which is already nicer.

def df_apply(df):
func_to_apply = lambda row: row['col1'] * row['col2']
return df.apply(func_to_apply, axis=1)

(Yes, you could’ve done it in one line but I find this version more readable)


List Comprehension

I would argue that list comprehension is one of the best features in Python and it helps us in our case as well

def list_comp(df):
return [x * y for x, y in zip(df['col1'], df['col2'])]

Vectorization

Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.

All the approaches we discussed so far are scalar functions — they perform an operation on two operands alone.

Instead of multiplying to elements from each column every row, we are going to gather all column values of ‘col1’ and ‘col2’ into a vector and then perform our operation.

def vectorize(df):
return df['col1'] * df['col2']

The next and last approach I am going to introduce includes vectorization with a small change — we will convert the Series object to NumPy array.

def vectorize_numpy(df):
return df['col1'].to_numpy() * df['col2'].to_numpy()

Performance Test

All that remains now is to check how each of these approaches performs.

On the Pandas documentation, it is written that

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided

So we won’t be very surprised to see the iteration approach as the bottom performer.

Both axis in log scale

Conclusion

As suggested in the Pandas documentation, you should try and avoid iterating over a dataframe.
Always try to look for a vectorized solution, and in case you have a function that can’t work on the full dataframe, use apply() instead of iterating.

0 comments on “Iterating a DataFrame in Python Pandas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: