December 12, 2018

What's the future of the pandas library?

pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit.

I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years:

"Is pandas reliable?"
"Will it keep working in the future?"
"Is it buggy? They haven't even released version 1.0!"

Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability of a library. (Yes, pandas is both mature and reliable!) Rather, version numbers communicate the stability of the API.

In particular, version 1.0 signals to the user: "We've figured out what the API should look like, and so API-breaking changes will only occur with major releases (2.0, 3.0, etc.)" In other words, version 1.0 marks the point at which your code should never break just by upgrading to the next minor release.

So the question remains: What's coming in pandas 1.0, and when is it coming?

Towards pandas 1.0

I recently watched a talk from PyData London called Towards pandas 1.0, given by pandas core developer Marc Garcia. It was an enlightening talk about the future of pandas, and so I wanted to highlight and comment on a few of the items that were mentioned:

Method chaining 👍
inplace 👎
Apache Arrow 👍
Extension Arrays 👍
Other deprecations 👎
Roadmap

Method chaining 👍

The pandas core team now encourages the use of "method chaining". This is a style of programming in which you chain together multiple method calls into a single statement. This allows you to pass intermediate results from one method to the next rather than storing the intermediate results using variables.

Here's the example Marc used that does not use method chaining:

import pandas
df = pandas.read_csv('data/titanic.csv.gz')
df = df[df.Age < df.Age.quantile(.99)]
df['Age'].fillna(df.Age.median(), inplace=True)
df['Age'] = pandas.cut(df['Age'],
                       bins=[df.Age.min(), 18, 40, df.Age.max()],
                       labels=['Underage', 'Young', 'Experienced'])
df['Sex'] = df['Sex'].replace({'female': 1, 'male': 0})
df = df.pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')
df = df.rename_axis('', axis='columns')
df = df.rename('Class {}'.format, axis='columns')
df.style.format('{:.2%}')

Here is the equivalent code that uses method chaining:

import pandas
(pandas.read_csv('data/titanic.csv.gz')
       .query('Age < Age.quantile(.99)')
       .assign(Sex=lambda df: df['Sex'].replace({'female': 1, 'male': 0}),
               Age=lambda df: pandas.cut(df['Age'].fillna(df.Age.median()),
                                         bins=[df.Age.min(), 18, 40, df.Age.max()],
                                         labels=['Underage', 'Young', 'Experienced']))
       .pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')
       .rename_axis('', axis='columns')
       .rename('Class {}'.format, axis='columns')
       .style.format('{:.2%}'))

Their primary reasons for preferring method chains are:

readability: In their opinion, method chains are more readable.
performance: Since the method chain tells pandas everything you want to do ahead of time, pandas can plan its operations more efficiently.

Here are my thoughts:

I've been writing short method chains for years, and I find them to be more readable than the alternative. For example, I would never break df.isnull().sum() or ser.value_counts().sort_index() into multiple lines of code by using intermediate variables.
However, I actually find long method chains (Marc's second example) to be less readable than the alternative, but maybe that's because I'm not used to writing them. Specifically, it's hard for me to follow the lambda functions inside the assign() method.

Tom Augspurger, another pandas core developer, also noted:

"One drawback to excessively long chains is that debugging can be harder. If something looks wrong at the end, you don't have intermediate values to inspect."

To be clear, method chaining has always been available in pandas, but support for chaining has increased through the addition of new "chain-able" methods. For example, the query() method (used in the chain above) was previously tagged as "experimental" in the documentation, which is why I haven't been using it or teaching it. That tag was removed in pandas 0.23, which may indicate that the core team is now encouraging the use of query().

I don't think you will ever be required to use method chains, but I presume that the documentation may eventually migrate to using that style.

For a longer discussion of this topic, see Tom Augspurger's Method Chaining post, which was part 2 of his Modern pandas series.

inplace 👎

The pandas core team discourages the use of the inplace parameter, and eventually it will be deprecated (which means "scheduled for removal from the library"). Here's why:

inplace won't work within a method chain.
The use of inplace often doesn't prevent copies from being created, contrary to what the name implies.
Removing the inplace option would reduce the complexity of the pandas codebase.

Personally, I'm a fan of inplace and I happen to prefer writing df.reset_index(inplace=True) instead of df = df.reset_index(), for example. That being said, lots of beginners do get confused by inplace, and it's nice to have one clear way to do things in pandas, so ultimately I'd be fine with deprecation.

If you'd like to learn more about how memory is managed in pandas, I recommend watching this 5-minute section of Marc's talk.

Apache Arrow 👍

Apache Arrow is a "work in progress" to become the pandas back-end. Arrow was created in 2015 by Wes McKinney, the founder of pandas, to resolve many of the underlying limitations of the pandas DataFrame (as well as similar data structures in other languages).

The goal of Arrow is to create an open standard for representing tabular data that natively supports complex data formats and is highly optimized for performance. Although Arrow was inspired by pandas, it's designed to be a shared computational infrastructure for data science work across multiple languages.

Because Arrow is an infrastructure layer, its eventual use as the pandas back-end (likely coming after pandas 1.0) will ideally be transparent to pandas end users. However, it should result in much better performance as well as support for working with "larger-than-RAM" datasets in pandas.

For more details about Arrow, I recommend reading Wes McKinney's 2017 blog post, Apache Arrow and the "10 Things I Hate About pandas", as well as watching his talk (with slides) from SciPy 2018. For details about how Arrow will be integrated into pandas, I recommend watching Jeff Reback's talk (with slides) from PyData NYC 2017.

Extension Arrays 👍

Extension Arrays allow you to create custom data types for use with pandas. The documentation provides a nice summary:

Pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy arrays as columns in a DataFrame or values in a Series. This allows third-party libraries to implement extensions to NumPy’s types, similar to how pandas implemented categoricals, datetimes with timezones, periods, and intervals.

In other words, previously the pandas team had to write a lot of custom code to implement data types that were not natively supported by NumPy (such as categoricals). With the release of Extension Arrays, there is now a generalized interface for creating custom types that anyone can use.

The pandas team has already used this interface to write an integer data type that supports missing values, also known as "NA" or "NaN" values. Previously, integer columns would be converted to floats if you marked any values as missing. The development documentation indicates that the "Integer NA" type will be available in the next release (version 0.24).

Another compelling use for this interface would be a native string type, since strings in pandas are currently represented using NumPy's "object" data type. The fletcher library has already used the interface to enable a native string type in pandas, though the pandas team may eventually build its own string type directly into pandas.

For a deeper look into this topic, check out the following resources:

Extension Arrays for pandas: Blog post about the development of cyberpandas using Extension Arrays.
Extending pandas with Custom Types: Video (with slides) about the development of the "Integer NA" type.
Extending pandas using Apache Arrow and Numba: Video (with slides) about the development of a string type via fletcher, using Arrow for storage and Numba for better performance (as described here).

Other deprecations 👎

Here are a few other deprecations which were discussed in the talk:

The ix accessor was already deprecated in version 0.20, in favor of loc (label-based access) and iloc (position-based access). Learn how to use loc and iloc in my video tutorial.
The Panel data structure for 3-dimensional data was also deprecated in version 0.20, in favor of a DataFrame with a MultiIndex. Learn how to use the MultiIndex in my video tutorial.
The SparseDataFrame, useful when a DataFrame mostly contains missing values, may be deprecated in an upcoming release. (However, you should be able to store sparse data in a regular DataFrame instead.)
Python 2 support will be dropped from pandas in January 2019!

Roadmap

According to the talk, here's the roadmap to pandas 1.0:

0.23.4 was the most recent pandas release (August 2018).
0.24 is targeted for the end of 2018, according to the GitHub milestone.
0.25 is targeted for early 2019, and it will warn about all of the deprecations coming in 1.0.
1.0 will be the same as 0.25, except all the deprecated features will be removed.

More details about the roadmap are available in the pandas sprint notes from July 2018, though all of these plans are subject to change.

Learning pandas?

If you're new to pandas, I recommend watching my video tutorial series, Easier data analysis in Python with pandas.

If you're an intermediate pandas user, I recommend watching my tutorial from PyCon 2019, Data science best practices with pandas.

Let me know your thoughts or questions in the comments section below! There is also a discussion of this post on Reddit.

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School