December 12, 2018 · Python

What's the future of the pandas library?

pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit.

I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years:

Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability of a library. (Yes, pandas is both mature and reliable!) Rather, version numbers communicate the stability of the API.

In particular, version 1.0 signals to the user: "We've figured out what the API should look like, and so API-breaking changes will only occur with major releases (2.0, 3.0, etc.)" In other words, version 1.0 marks the point at which your code should never break just by upgrading to the next minor release.

So the question remains: What's coming in pandas 1.0, and when is it coming?

Towards pandas 1.0

I recently watched a talk from PyData London called Towards pandas 1.0, given by pandas core developer Marc Garcia. It was an enlightening talk about the future of pandas, and so I wanted to highlight and comment on a few of the items that were mentioned:

Method chaining πŸ‘

The pandas core team now encourages the use of "method chaining". This is a style of programming in which you chain together multiple method calls into a single statement. This allows you to pass intermediate results from one method to the next rather than storing the intermediate results using variables.

Here's the example Marc used that does not use method chaining:

import pandas
df = pandas.read_csv('data/titanic.csv.gz')
df = df[df.Age < df.Age.quantile(.99)]
df['Age'].fillna(df.Age.median(), inplace=True)
df['Age'] = pandas.cut(df['Age'],
                       bins=[df.Age.min(), 18, 40, df.Age.max()],
                       labels=['Underage', 'Young', 'Experienced'])
df['Sex'] = df['Sex'].replace({'female': 1, 'male': 0})
df = df.pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')
df = df.rename_axis('', axis='columns')
df = df.rename('Class {}'.format, axis='columns')
df.style.format('{:.2%}')

Here is the equivalent code that uses method chaining:

import pandas
(pandas.read_csv('data/titanic.csv.gz')
       .query('Age < Age.quantile(.99)')
       .assign(Sex=lambda df: df['Sex'].replace({'female': 1, 'male': 0}),
               Age=lambda df: pandas.cut(df['Age'].fillna(df.Age.median()),
                                         bins=[df.Age.min(), 18, 40, df.Age.max()],
                                         labels=['Underage', 'Young', 'Experienced']))
       .pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')
       .rename_axis('', axis='columns')
       .rename('Class {}'.format, axis='columns')
       .style.format('{:.2%}'))

Their primary reasons for preferring method chains are:

Here are my thoughts:

Tom Augspurger, another pandas core developer, also noted:

"One drawback to excessively long chains is that debugging can be harder. If something looks wrong at the end, you don't have intermediate values to inspect."

To be clear, method chaining has always been available in pandas, but support for chaining has increased through the addition of new "chain-able" methods. For example, the query() method (used in the chain above) was previously tagged as "experimental" in the documentation, which is why I haven't been using it or teaching it. That tag was removed in pandas 0.23, which may indicate that the core team is now encouraging the use of query().

I don't think you will ever be required to use method chains, but I presume that the documentation may eventually migrate to using that style.

For a longer discussion of this topic, see Tom Augspurger's Method Chaining post, which was part 2 of his Modern pandas series.

inplace πŸ‘Ž

The pandas core team discourages the use of the inplace parameter, and eventually it will be deprecated (which means "scheduled for removal from the library"). Here's why:

Personally, I'm a fan of inplace and I happen to prefer writing df.reset_index(inplace=True) instead of df = df.reset_index(), for example. That being said, lots of beginners do get confused by inplace, and it's nice to have one clear way to do things in pandas, so ultimately I'd be fine with deprecation.

If you'd like to learn more about how memory is managed in pandas, I recommend watching this 5-minute section of Marc's talk.

Apache Arrow πŸ‘

Apache Arrow is a "work in progress" to become the pandas back-end. Arrow was created in 2015 by Wes McKinney, the founder of pandas, to resolve many of the underlying limitations of the pandas DataFrame (as well as similar data structures in other languages).

The goal of Arrow is to create an open standard for representing tabular data that natively supports complex data formats and is highly optimized for performance. Although Arrow was inspired by pandas, it's designed to be a shared computational infrastructure for data science work across multiple languages.

Because Arrow is an infrastructure layer, its eventual use as the pandas back-end (likely coming after pandas 1.0) will ideally be transparent to pandas end users. However, it should result in much better performance as well as support for working with "larger-than-RAM" datasets in pandas.

For more details about Arrow, I recommend reading Wes McKinney's 2017 blog post, Apache Arrow and the "10 Things I Hate About pandas", as well as watching his talk (with slides) from SciPy 2018. For details about how Arrow will be integrated into pandas, I recommend watching Jeff Reback's talk (with slides) from PyData NYC 2017.

Extension Arrays πŸ‘

Extension Arrays allow you to create custom data types for use with pandas. The documentation provides a nice summary:

Pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy arrays as columns in a DataFrame or values in a Series. This allows third-party libraries to implement extensions to NumPy’s types, similar to how pandas implemented categoricals, datetimes with timezones, periods, and intervals.

In other words, previously the pandas team had to write a lot of custom code to implement data types that were not natively supported by NumPy (such as categoricals). With the release of Extension Arrays, there is now a generalized interface for creating custom types that anyone can use.

The pandas team has already used this interface to write an integer data type that supports missing values, also known as "NA" or "NaN" values. Previously, integer columns would be converted to floats if you marked any values as missing. The development documentation indicates that the "Integer NA" type will be available in the next release (version 0.24).

Another compelling use for this interface would be a native string type, since strings in pandas are currently represented using NumPy's "object" data type. The fletcher library has already used the interface to enable a native string type in pandas, though the pandas team may eventually build its own string type directly into pandas.

For a deeper look into this topic, check out the following resources:

Other deprecations πŸ‘Ž

Here are a few other deprecations which were discussed in the talk:

Roadmap

According to the talk, here's the roadmap to pandas 1.0:

More details about the roadmap are available in the pandas sprint notes from July 2018, though all of these plans are subject to change.

Learning pandas?

If you're new to pandas, I recommend watching my video tutorial series, Easier data analysis in Python with pandas.

If you're an intermediate pandas user, I recommend watching my tutorial from PyCon 2019, Data science best practices with pandas.

Let me know your thoughts or questions in the comments section below! There is also a discussion of this post on Reddit.

Comments powered by Disqus