Should you use "dot notation" or "bracket notation" with pandas?
If you've ever used the pandas library in Python, you probably know that there are two ways to select a Series (meaning a column) from a DataFrame:
# dot notation
df.col_name
# bracket notation
df['col_name']
Which method should you use? I'll make the case for each, and then you can decide...
Why use bracket notation?
The case for bracket notation is simple: It always works.
Here are the specific cases in which you must use bracket notation, because dot notation would fail:
# column name includes a space
df['col name']
# column name matches a DataFrame method
df['count']
# column name matches a Python keyword
df['class']
# column name is stored in a variable
var = 'col_name'
df[var]
# column name is an integer
df[0]
# new column is created through assignment
df['new'] = 0
In other words, bracket notation always works, whereas dot notation only works under certain circumstances. That's a pretty compelling case for bracket notation!
As stated in the Zen of Python:
There should be one-- and preferably only one --obvious way to do it.
Why use dot notation?
If you've watched any of my pandas videos, you may have noticed that I use dot notation. Here are four reasons why:
Reason 1: Dot notation is easier to type
Dot notation is three fewer characters to type than bracket notation. And in terms of finger movement, typing a single period is much more convenient than typing brackets and quotes.
This might sound like a trivial reason, but if you're selecting columns dozens (or hundreds) of times a day, it makes a real difference!
Reason 2: Dot notation is easier to read
Most of my pandas code is a made up of chains of selections and methods. By using dot notation, my code is mostly adorned with periods and parentheses (plus an occasional quotation mark):
# dot notation
df.col_one.sum()
df.col_one.isna().sum()
df.groupby('col_two').col_one.sum()
If you instead use bracket notation, your code is adorned with periods and parentheses plus lots of brackets and quotation marks:
# bracket notation
df['col_one'].sum()
df['col_one'].isna().sum()
df.groupby('col_two')['col_one'].sum()
I find the dot notation code easier to read, as well as more aesthetically pleasing.
Reason 3: Dot notation is easier to remember
With dot notation, every component in a chain is separated by a period on both sides. For example, this line of code has 4 components, and thus there are 3 periods separating the individual components:
# dot notation
df.groupby('col_two').col_one.sum()
If you instead use bracket notation, some of your components are separated by periods, and some are not:
# bracket notation
df.groupby('col_two')['col_one'].sum()
With bracket notation, I often forget whether there's supposed to be a period before ['col_one']
, after ['col_one']
, or both before and after ['col_one']
.
With dot notation, it's easier for me to remember the correct syntax.
Reason 4: Dot notation limits the usage of brackets
Brackets can be used for many purposes in pandas:
df[['col_one', 'col_two']]
df.iloc[4, 2]
df.loc['row_label', 'col_one':'col_three']
df.col_one['row_label']
df[(df.col_one > 5) & (df.col_two == 'value')]
If you also use bracket notation for Series selection, you end up with even more brackets in your code:
df['col_one']['row_label']
df[(df['col_one'] > 5) & (df['col_two'] == 'value')]
As you use more brackets, each bracket becomes slightly more ambiguous as to its purpose, imposing a higher mental burden on the person reading the code. By using dot notation for Series selection, you reduce bracket usage to only the essential cases.
Conclusion
If you prefer bracket notation, then you can use it all of the time! However, you still have to be familiar with dot notation in order to read other people's code.
If you prefer dot notation, then you can use it most of the time, as long as you are diligent about renaming columns when they contains spaces or collide with DataFrame methods. However, you still have to use bracket notation when creating new columns.
Which do you prefer? Let me know in the comments below!
When selecting a Series (meaning a column) from a #pandas DataFrame, do you generally use "dot notation" or "bracket notation"?
— Kevin Markham (@justmarkham) September 13, 2019
➡️ dot notation: df.col_name
➡️ bracket notation: df['col_name']#Python #DataScience
Addendum
There were some thoughtful comments about this issue on Twitter, mostly in favor of bracket notation:
Dot notation is a strict subset of the brackets. The brackets are also the canonical way to "select subsets of data" from all objects in python. strings, tuples, lists, dictionaries, numpy arrays all use brackets to select subsets of data. https://t.co/AUMwSl0Wmn
— Ted Petrou (@TedPetrou) September 13, 2019
Bracket notation for the clarity spaces allow, for the ability to use f-strings in column references and for the syntax highlighting.
— SupineCabbage (@SublimeKarnage) September 13, 2019
I've never seen any point in dot notation.
I like the dot notation because tab-completion is usually available and I'm lazy, but in certain cases using it is not wise or not possible and I end up with inconsistent notation, so I switched to using brackets everywhere.
— Naïve Bayesian (@naivebayesian) September 13, 2019