100 pandas tricks to save you time and energy
Below you'll find 100 tricks that will save you time and energy every time you use pandas! These the best tricks I've learned from 5 years of teaching the pandas library.
"Soooo many nifty little tips that will make my life so much easier!" - C.K.
"Kevin, these tips are so practical. I can say without hesitation that you provide the best resources for pandas I have ever used." - N.W.
P.S. You can also watch a video of my top 25 tricks! πΌπ€Ή
Categories
- Reading files
- Reading from the web
- Creating example DataFrames
- Creating columns
- Renaming columns
- Selecting rows and columns
- Filtering rows by condition
- Manipulating strings
- Working with data types
- Encoding data
- Extracting data from lists
- Working with time series data
- Handling missing values
- Using aggregation functions
- Using cumulative functions
- Random sampling
- Merging DataFrames
- Styling DataFrames
- Exploring a dataset
- Handling warnings
- Other
Reading files
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 19, 2019
5 useful "read_csv" parameters that are often overlooked:
β‘οΈ names: specify column names
β‘οΈ usecols: which columns to keep
β‘οΈ dtype: specify data types
β‘οΈ nrows: # of rows to read
β‘οΈ na_values: strings to recognize as NaN#Python #DataScience #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 3, 2019
β οΈ Got bad data (or empty rows) at the top of your CSV file? Use these read_csv parameters:
β‘οΈ header = row number of header (start counting at 0)
β‘οΈ skiprows = list of row numbers to skip
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/t1M6XkkPYG
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 21, 2019
Two easy ways to reduce DataFrame memory usage:
1. Only read in columns you need
2. Use 'category' data type with categorical data.
Example:
df = https://t.co/Ib52aQAdkA_csv('file.csv', usecols=['A', 'C', 'D'], dtype={'D':'category'})#Python #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 4, 2019
You can read directly from a compressed file:
df = https://t.co/Ib52aQAdkA_csv('https://t.co/3JAwA8h7FJ')
Or write to a compressed file:https://t.co/ySXYEf6MjY_csv('https://t.co/3JAwA8h7FJ')
Also supported: .gz, .bz2, .xz#Python #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick #99:
β Kevin Markham (@justmarkham) December 18, 2019
Do you sometimes end up with an "Unnamed: 0" column in your DataFrame? π€
Solution: Set the first column as the index (when reading)
Alternative: Don't save the index to the file (when writing)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/WuUJb7fMPZ
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 20, 2019
Are your dataset rows spread across multiple files, but you need a single DataFrame?
Solution:
1. Use glob() to list your files
2. Use a generator expression to read files and concat() to combine them
3. π₯³
See example π#Python #DataScience #pandastricks pic.twitter.com/qtKpzEoSC3
πΌπ€ΉββοΈ pandas trick #78:
β Kevin Markham (@justmarkham) October 10, 2019
Do you need to build a DataFrame from multiple files, but also keep track of which row came from which file?
1. List files w/ glob()
2. Read files w/ gen expression, create new column w/ assign(), combine w/ concat()
See example π#Python #pandastricks pic.twitter.com/kXgXw69pSW
πΌπ€ΉββοΈ pandas trick #100! π
β Kevin Markham (@justmarkham) December 19, 2019
Want to read a HUGE dataset into pandas but don't have enough memory?
Randomly sample the dataset *during file reading* by passing a function to "skiprows"
See example π
Thanks to @TedPetrou for this trick! π#Python #DataScience #pandastricks pic.twitter.com/FOPxURbNgc
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 15, 2019
Need to quickly get data from Excel or Google Sheets into pandas?
1. Copy data to clipboard
2. df = https://t.co/Ib52aQAdkA_clipboard()
3. π₯³
See example π
Learn 25 more tips & tricks: https://t.co/6akbxXG6SI#Python #DataScience #pandas #pandastricks pic.twitter.com/M2Yw0NAXRe
πΌπ€ΉββοΈ pandas trick #71:
β Kevin Markham (@justmarkham) September 30, 2019
Want to extract tables from a PDF into a DataFrame? Try tabula-py!
from tabula import read_pdf
df = read_pdf('test.pdf', pages='all')
Documentation: https://t.co/geQh9u4AEr
Thanks for the trick @Netchose! π#Python #DataScience #pandas #pandastricks
Reading from the web
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 9, 2019
Want to read a JSON file from the web? Use read_json() to read it directly from a URL into a DataFrame! π
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/gei6eeudiq
πΌπ€ΉββοΈ pandas trick #68:
β Kevin Markham (@justmarkham) September 18, 2019
Want to scrape a web page? Try read_html()!
Definitely worth trying before bringing out a more complex tool (Beautiful Soup, Selenium, etc.)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/sPKrea9wk1
πΌπ€ΉββοΈ pandas trick #74:
β Kevin Markham (@justmarkham) October 3, 2019
Are you scraping a webpage using read_html(), but it returns too many tables? π
Use the 'match' parameter to find tables that contain a particular string! π§Ά
See example π
Thanks to @JrMontana08 for the trick! π#Python #DataScience #pandastricks pic.twitter.com/4Ocbv6H3r7
Creating example DataFrames
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 28, 2019
Need to create an example DataFrame? Here are 3 easy options:
pd.DataFrame({'col_one':[10, 20], 'col_two':[30, 40]})
pd.DataFrame(np.random.rand(2, 3), columns=list('abc'))
pd.util.testing.makeMixedDataFrame()
See output π#Python #pandas #pandastricks pic.twitter.com/SSlZsd6OEj
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 10, 2019
Need to create a DataFrame for testing?
pd.util.testing.makeDataFrame() β‘οΈ contains random values
.makeMissingDataframe() β‘οΈ some values missing
.makeTimeDataFrame() β‘οΈ has DateTimeIndex
.makeMixedDataFrame() β‘οΈ mixed data types#Python #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick #91:
β Kevin Markham (@justmarkham) November 22, 2019
Need to create a time series dataset for testing? Use pd.util.testing.makeTimeDataFrame()
Need more control over the columns & data? Generate data with np.random & overwrite index with makeDateIndex()
See example π#Python #DataScience #pandastricks pic.twitter.com/fLrNWf1tsa
Creating columns
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 17, 2019
Want to create new columns (or overwrite existing columns) within a method chain? Use "assign"!
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/y0wEfbz0VA
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 16, 2019
Need to create a bunch of new columns based on existing columns? Use this pattern:
for col in df.columns:
df[f'{col}_new'] = df[col].apply(my_function)
See example π
Thanks to @pmbaumgartner for this trick!#Python #DataScience #pandas #pandastricks pic.twitter.com/7qvKn9UypE
πΌπ€ΉββοΈ pandas trick #73:
β Kevin Markham (@justmarkham) October 2, 2019
Need to remove a column from a DataFrame and store it as a separate Series? Use "pop"! πΎ
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/R45OEMbWVm
πΌπ€ΉββοΈ pandas trick #90:
β Kevin Markham (@justmarkham) November 21, 2019
Want to insert a new column into a DataFrame at a specific location? Use the "insert" method:
df.insert(location, name, value)
See example π
P.S. You can find the other 89 tricks here: https://t.co/TflgUtl6zD#Python #DataScience #pandas #pandastricks pic.twitter.com/zmPvdLq7jG
Renaming columns
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 16, 2019
3 ways to rename columns:
1. Most flexible option:
df = df.rename({'A':'a', 'B':'b'}, axis='columns')
2. Overwrite all column names:
df.columns = ['a', 'b']
3. Apply string method:
df.columns = df.columns.str.lower()#Python #DataScience #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 11, 2019
Add a prefix to all of your column names:
df.add_prefix('X_')
Add a suffix to all of your column names:
df.add_suffix('_Y')#Python #DataScience
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 25, 2019
Need to rename all of your columns in the same way? Use a string method:
Replace spaces with _:
df.columns = df.columns.str.replace(' ', '_')
Make lowercase & remove trailing whitespace:
df.columns = df.columns.str.lower().str.rstrip()#Python #pandastricks
Selecting rows and columns
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 13, 2019
You can use f-strings (Python 3.6+) when selecting a Series from a DataFrame!
See example π#Python #DataScience #pandas #pandastricks @python_tip pic.twitter.com/8qHEXiGBaB
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 3, 2019
Need to select multiple rows/columns? "loc" is usually the solution:
select a slice (inclusive):
df.loc[0:4, 'col_A':'col_D']
select a list:
df.loc[[0, 3], ['col_A', 'col_C']]
select by condition:
df.loc[df.col_A=='val', 'col_D']#Python #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 1, 2019
"loc" selects by label, and "iloc" selects by position.
But what if you need to select by label *and* position? You can still use loc or iloc!
See example π
P.S. Don't use "ix", it has been deprecated since 2017.#Python #DataScience #pandas #pandastricks pic.twitter.com/SpFkjWYEE0
πΌπ€ΉββοΈ pandas trick #82:
β Kevin Markham (@justmarkham) November 7, 2019
Want to select from a DataFrame by label *and* position?
Most readable approach is to chain "loc" (selection by label) and "iloc" (selection by position).
See example π
Thanks to @Dean_La for this trick!#Python #DataScience #pandas #pandastricks pic.twitter.com/FCbkmaG6uD
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 12, 2019
Reverse column order in a DataFrame:
df.loc[:, ::-1]
Reverse row order:
df.loc[::-1]
Reverse row order and reset the index:
df.loc[::-1].reset_index(drop=True)
Want more #pandastricks? Working on a video right now, stay tuned... π₯#Python #DataScience
πΌπ€ΉββοΈ pandas trick #80:
β Kevin Markham (@justmarkham) November 5, 2019
Want to select multiple slices of columns from a DataFrame?
1. Use df.loc to select & pd.concat to combine
2. Slice df.columns & select using brackets
3. Use np.r_ to combine slices & df.iloc to select
See example π#Python #DataScience #pandastricks pic.twitter.com/IhbYbgpLKk
Filtering rows by condition
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 13, 2019
Filter DataFrame by multiple OR conditions:
df[(df.color == 'red') | (df.color == 'green') | (df.color == 'blue')]
Shorter way:
df[df.color.isin(['red', 'green', 'blue'])]
Invert the filter:
df[~df.color.isin(['red', 'green', 'blue'])]#Python #pandastricks
πΌπ€ΉββοΈ pandas tricks is back! π
β Kevin Markham (@justmarkham) November 4, 2019
Want to know the *count* of rows that match a condition?
(condition).sum()
Want to know the *percentage* of rows that match a condition?
(condition).mean()
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/COqZy4EB2S
πΌπ€ΉββοΈ pandas trick #76:
β Kevin Markham (@justmarkham) October 7, 2019
Want to filter a DataFrame to only include the largest categories?
1. Save the value_counts() output
2. Get the index of its head()
3. Use that index with isin() to filter the DataFrame
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/plzO4qesDH
πΌπ€ΉββοΈ pandas trick #77:
β Kevin Markham (@justmarkham) October 9, 2019
Want to combine the smaller categories in a Series into a single category called "Other"?
1. Save the index of the largest values of value_counts()
2. Use where() to replace all other values with "Other"
See example π#Python #DataScience #pandastricks pic.twitter.com/FPxtuzwll4
πΌπ€ΉββοΈ pandas trick #93:
β Kevin Markham (@justmarkham) December 10, 2019
Want to combine the small categories in a Series (<10% frequency) into a single category?
1. Save the normalized value counts
2. Filter by frequency & save the index
3. Replace small categories with "Other"
See example π#Python #pandas #pandastricks pic.twitter.com/z6w1x8s6qg
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 28, 2019
Are you trying to filter a DataFrame using lots of criteria? It can be hard to write βοΈ and to read! π
Instead, save the criteria as objects and use them to filter. Or, use reduce() to combine the criteria!
See example π#Python #DataScience #pandastricks pic.twitter.com/U9NV27RIjQ
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 25, 2019
Want to filter a DataFrame that doesn't have a name?
Use the query() method to avoid creating an intermediate variable!
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/NyUOOSr7Sc
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 13, 2019
Need to refer to a local variable within a query() string? Just prefix it with the @ symbol!
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/PfXcASWDdC
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 30, 2019
If you want to use query() on a column name containing a space, just surround it with backticks! (New in pandas 0.25)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/M5ZSRVr3no
Manipulating strings
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 22, 2019
Want to concatenate two string columns?
Option 1: Use a string method π§Ά
Option 2: Use plus signs β
See example π
Which option do you prefer, and why?#Python #DataScience #pandas #pandastricks pic.twitter.com/SsjBAMqkxB
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 9, 2019
Need to split a string into multiple columns? Use str.split() method, expand=True to return a DataFrame, and assign it to the original DataFrame.
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/wZ4okQZ9Dy
πΌπ€ΉββοΈ pandas trick #89:
β Kevin Markham (@justmarkham) November 20, 2019
Need to split names of variable length into first_name & last_name?
1. Use str.split(n=1) to split only once (returns a Series of lists)
2. Chain str[0] and str[1] on the end to select the list elements
See example π#Python #DataScience #pandastricks pic.twitter.com/fkikdaLkus
πΌπ€ΉββοΈ pandas trick #75:
β Kevin Markham (@justmarkham) October 4, 2019
Need to count the number of words in a Series? Just use a string method to count the spaces and add 1!
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/U6quTmrvNT
Working with data types
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 17, 2019
Numbers stored as strings? Try astype():
df.astype({'col1':'int', 'col2':'float'})
But it will fail if you have any invalid input. Better way:
df.apply(https://t.co/H90jtE9QMp_numeric, errors='coerce')
Converts invalid input to NaN π#Python #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 14, 2019
Select columns by data type:https://t.co/8c3VWfaERD_dtypes(include='number')https://t.co/8c3VWfaERD_dtypes(include=['number', 'category', 'object'])https://t.co/8c3VWfaERD_dtypes(exclude=['datetime', 'timedelta'])#Python #DataScience #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick #94:
β Kevin Markham (@justmarkham) December 11, 2019
Want to save a *massive* amount of memory? Fix your data types:
β‘οΈ 'int8' for small integers
β‘οΈ 'category' for strings with few unique values
β‘οΈ 'Sparse' if most values are 0 or NaN
More info: https://t.co/yEJnaWnGfj by @itamarst#Python #pandastricks pic.twitter.com/jiBrkldFCt
πΌπ€ΉββοΈ pandas trick #81:
β Kevin Markham (@justmarkham) November 6, 2019
Does your object column contain mixed data types? Use df.col.apply(type).value_counts() to check!
See example π
Thanks to @chris1610 for inspiring this trick! Read more: https://t.co/N2vcNWFJ8t#Python #DataScience #pandas #pandastricks pic.twitter.com/56gD5lqB4J
πΌπ€ΉββοΈ pandas trick #92:
β Kevin Markham (@justmarkham) December 9, 2019
Need to clean an object column with mixed data types? Use "replace" (not str.replace) and regex!
See example π
P.S. Not sure when to use "replace" versus "str.replace"? Read this: https://t.co/GF9l1IRzzi#Python #DataScience #pandas #pandastricks pic.twitter.com/qMV17MNvr3
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 8, 2019
Two useful properties of ordered categories:
1οΈβ£ You can sort the values in logical (not alphabetical) order
2οΈβ£ Comparison operators also work logically
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/HeYZ3P3gPP
πΌπ€ΉββοΈ pandas trick #83:
β Kevin Markham (@justmarkham) November 8, 2019
Problem: Your dataset has many columns and you want to ensure the correct data types
Solution:
1. Create CSV of column names & dtypes
2. Read it into a DF
3. Convert it to dict
4. Use dict to specify dtypes of dataset
π Example π#Python #pandastricks pic.twitter.com/10DeKtc6wj
Encoding data
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 2, 2019
Need to convert a column from continuous to categorical? Use cut():
df['age_groups'] = pd.cut(df.age, bins=[0, 18, 65, 99], labels=['child', 'adult', 'elderly'])
0 to 18 β‘οΈ 'child'
18 to 65 β‘οΈ 'adult'
65 to 99 β‘οΈ 'elderly'#Python #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick #72:
β Kevin Markham (@justmarkham) October 1, 2019
Need to convert a column from continuous to categorical?
β‘οΈ Use cut() to specify bin edges
β‘οΈ Use qcut() to specify number of bins (creates bins of approx. equal size)
β‘οΈ Both allow you to label the bins
See example π#Python #DataScience #pandastricks pic.twitter.com/2UhsNEIwDX
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 5, 2019
Want to dummy encode (or "one hot encode") your DataFrame? Use pd.get_dummies(df) to encode all object & category columns.
Want to drop the first level since it provides redundant info? Set drop_first=True.
See example & read thread π#Python #pandastricks pic.twitter.com/g0XjJ44eg2
πΌπ€ΉββοΈ pandas trick #85:
β Kevin Markham (@justmarkham) November 13, 2019
Three useful ways to convert one set of values to another:
1. map() using a dictionary
2. factorize() to encode each value as an integer
3. comparison statement to return boolean values
See example π#Python #DataScience #pandastricks @python_tip pic.twitter.com/9G5vcXW7ci
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 30, 2019
Need to apply the same mapping to multiple columns at once? Use "applymap" (DataFrame method) with "get" (dictionary method).
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/WU4AmeHP4O
Extracting data from lists
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 27, 2019
Has your data ever been TRAPPED in a Series of Python lists? π
Expand the Series into a DataFrame by using apply() and passing it the Series constructor π
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/ZvysqaRz6S
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 12, 2019
Do you have a Series containing lists of items? Create one row for each item using the "explode" method π₯
New in pandas 0.25! See example π
π€―#Python #DataScience #pandas #pandastricks pic.twitter.com/ix5d8CLg57
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 14, 2019
Does your Series contain comma-separated items? Create one row for each item:
βοΈ "str.split" creates a list of strings
β¬ οΈ "assign" overwrites the existing column
π₯ "explode" creates the rows (new in pandas 0.25)
See example π#Python #pandas #pandastricks pic.twitter.com/OqZNWdarP0
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 16, 2019
π₯ "explode" takes a list of items and creates one row for each item (new in pandas 0.25)
You can also do the reverse! See example π
Thanks to @EForEndeavour for this tip π#Python #DataScience #pandas #pandastricks pic.twitter.com/4UBxbzHS51
Working with time series data
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 8, 2019
If you need to create a single datetime column from multiple columns, you can use to_datetime() π
See example π
You must include: month, day, year
You can also include: hour, minute, second#Python #DataScience #pandas #pandastricks pic.twitter.com/0bip6SRDdF
πΌπ€ΉββοΈ pandas trick #97:
β Kevin Markham (@justmarkham) December 16, 2019
Want to convert "year" and "day of year" into a single datetime column? π
1. Combine them into one number
2. Convert to datetime and specify its format
See example π
List of all format codes: https://t.co/SSd0dAWxM7#Python #DataScience #pandastricks pic.twitter.com/S7KlTo7rLE
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 2, 2019
One reason to use the datetime data type is that you can access many useful attributes via "dt", like:
df.column.dt.hour
Other attributes include: year, month, day, dayofyear, week, weekday, quarter, days_in_month...
See full list π#Python #pandastricks pic.twitter.com/z405STKqKY
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 18, 2019
Need to perform an aggregation (sum, mean, etc) with a given frequency (monthly, yearly, etc)?
Use resample! It's like a "groupby" for time series data. See example π
"Y" means yearly. See list of frequencies: https://t.co/oPDx85yqFT#Python #pandastricks pic.twitter.com/nweqbHXEtd
πΌπ€ΉββοΈ pandas trick #87:
β Kevin Markham (@justmarkham) November 15, 2019
Problem: You have time series data that you want to aggregate by day, but you're only interested in weekends.
Solution:
1. resample by day ('D')
2. filter by day of week (5=Saturday, 6=Sunday)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/5yCPLpE6kr
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 27, 2019
Want to calculate the difference between each row and the previous row? Use df.col_name.diff()
Want to calculate the percentage change instead? Use df.col_name.pct_change()
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/5EGYqpNPC3
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 31, 2019
Need to convert a datetime Series from UTC to another time zone?
1. Set current time zone β‘οΈ tz_localize('UTC')
2. Convert β‘οΈ tz_convert('America/Chicago')
Automatically handles Daylight Savings Time!
See example π#Python #DataScience #pandastricks pic.twitter.com/ztzMXcgkFY
Handling missing values
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 19, 2019
Calculate % of missing values in each column:
df.isna().mean()
Drop columns with any missing values:
df.dropna(axis='columns')
Drop columns in which more than 10% of values are missing:
df.dropna(thresh=len(df)*0.9, axis='columns')#Python #pandastricks
πΌπ€ΉββοΈ pandas trick #95:
β Kevin Markham (@justmarkham) December 12, 2019
Want to know the *count* of missing values in a DataFrame?
β‘οΈ df.isna().sum().sum()
Just want to know if there are *any* missing values?
β‘οΈ df.isna().any().any()
β‘οΈ df.isna().any(axis=None)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/BmmYJfk4xo
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 12, 2019
Need to fill missing values in your time series data? Use df.interpolate()
Defaults to linear interpolation, but many other methods are supported!
Want more pandas tricks? Watch this:
π https://t.co/6akbxXXHKg π#Python #DataScience #pandas #pandastricks pic.twitter.com/JjH08dvjMK
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 15, 2019
Do you need to store missing values ("NaN") in an integer Series? Use the "Int64" data type!
See example π
(New in v0.24, API is experimental/subject to change)#Python #DataScience #pandas #pandastricks pic.twitter.com/mN7Ud53Rls
Using aggregation functions
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 19, 2019
Instead of aggregating by a single function (such as 'mean'), you can aggregate by multiple functions by using 'agg' (and passing it a list of functions) or by using 'describe' (for summary statistics π)
See example π#Python #DataScience #pandastricks pic.twitter.com/Emg3zLAocB
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 9, 2019
Did you know that "last" is an aggregation function, just like "sum" and "mean"?
Can be used with a groupby to extract the last value in each group. See example π
P.S. You can also use "first" and "nth" functions!#Python #DataScience #pandas #pandastricks pic.twitter.com/WKJtNIUxwz
πΌπ€ΉββοΈ pandas trick #86:
β Kevin Markham (@justmarkham) November 14, 2019
Are you applying multiple aggregations after a groupby? Try "named aggregation":
β Allows you to name the output columns
β Avoids a column MultiIndex
New in pandas 0.25! See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/WIVQVcn4re
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 21, 2019
Are you applying multiple aggregations after a groupby? Try "named aggregation":
β Allows you to name the output columns
β Avoids a column MultiIndex
New in pandas 0.25! See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/VXJz6ShZbc
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 4, 2019
Want to combine the output of an aggregation with the original DataFrame?
Instead of: df.groupby('col1').col2.func()
Use: df.groupby('col1').col2.transform(func)
"transform" changes the output shape
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/9dkcAGpTYK
Using cumulative functions
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 6, 2019
Need to calculate a running total (or "cumulative sum")? Use the cumsum() function! Also works with groupby()
See example π
Other cumulative functions: cummax(), cummin(), cumprod()#Python #DataScience #pandas #pandastricks pic.twitter.com/H4whqlV2ky
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 11, 2019
Need to calculate a running count within groups? Do this:
df.groupby('col').cumcount() + 1
See example π
Thanks to @kjbird15 and @EForEndeavour for this trick! π#Python #DataScience #pandas #pandastricks @python_tip pic.twitter.com/jSz231QmmS
Random sampling
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 20, 2019
Randomly sample rows from a DataFrame:
df.sample(n=10)
df.sample(frac=0.25)
Useful parameters:
β‘οΈ random_state: use any integer for reproducibility
β‘οΈ replace: sample with replacement
β‘οΈ weights: weight based on values in a column π#Python #pandastricks pic.twitter.com/j2AyoTLRKb
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 26, 2019
Want to shuffle your DataFrame rows?
df.sample(frac=1, random_state=0)
Want to reset the index after shuffling?
df.sample(frac=1, random_state=0).reset_index(drop=True)#Python #DataScience #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 18, 2019
Split a DataFrame into two random subsets:
df_1 = df.sample(frac=0.75, random_state=42)
df_2 = df.drop(df_1.index)
(Only works if df's index values are unique)
P.S. Working on a video of my 25 best #pandastricks, stay tuned! πΊ#Python #pandas #DataScience
Merging DataFrames
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 23, 2019
When you are merging DataFrames, you can identify the source of each row (left/right/both) by setting indicator=True.
See example π
P.S. Learn 25 more #pandastricks in 25 minutes: https://t.co/6akbxXG6SI#Python #DataScience #pandas pic.twitter.com/tkb2LiV4eh
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 26, 2019
Merging datasets? Check that merge keys are unique in BOTH datasets:
pd.merge(left, right, validate='one_to_one')
β Use 'one_to_many' to only check uniqueness in LEFT
β Use 'many_to_one' to only check uniqueness in RIGHT#Python #DataScience #pandastricks
Styling DataFrames
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 6, 2019
Two simple ways to style a DataFrame:
1οΈβ£ https://t.co/HRqLVf3cWC.hide_index()
2οΈβ£ https://t.co/HRqLVf3cWC.set_caption('My caption')
See example π
For more style options, watch trick #25: https://t.co/6akbxXG6SI πΊ#Python #DataScience #pandas #pandastricks pic.twitter.com/8yzyQYz9vr
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 17, 2019
Want to add formatting to your DataFrame? For example:
- hide the index
- add a caption
- format numbers & dates
- highlight min & max values
Watch π to learn how!
Code: https://t.co/HKroWYVIEs
25 more tricks: https://t.co/6akbxXG6SI#Python #pandastricks pic.twitter.com/AKQr7zVR7S
Exploring a dataset
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 29, 2019
Want to explore a new dataset without too much work?
1. Pick one:
β‘οΈ pip install pandas-profiling
β‘οΈ conda install -c conda-forge pandas-profiling
2. import pandas_profiling
3. df.profile_report()
4. π₯³
See example π#Python #DataScience #pandastricks pic.twitter.com/srq5rptEUj
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) June 24, 2019
Need to check if two Series contain the same elements?
β Don't do this:
df.A == df.B
β Do this:
df.A.equals(df.B)
β Also works for DataFrames:
df.equals(df2)
equals() properly handles NaNs, whereas == does not#Python #DataScience #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick #69:
β Kevin Markham (@justmarkham) September 19, 2019
Need to check if two Series are "similar"? Use this:
pd.testing.assert_series_equal(df.A, df.B, ...)
Useful arguments include:
β‘οΈ check_names=False
β‘οΈ check_dtype=False
β‘οΈ check_exact=False
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/bdJBkiFxne
πΌπ€ΉββοΈ pandas trick #84:
β Kevin Markham (@justmarkham) November 11, 2019
My favorite feature in pandas 0.25: If DataFrame has more than 60 rows, only show 10 rows (saves your screen space!)
You can modify this: pd.set_option('min_rows', 4)
See example π
More info: https://t.co/8vwkHWxnPH#Python #DataScience #pandastricks pic.twitter.com/K7NXJXzIgY
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 24, 2019
Want to examine the "head" of a wide DataFrame, but can't see all of the columns?
Solution #1: Change display options to show all columns
Solution #2: Transpose the head (swaps rows and columns)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/9sw7O7cPeh
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) August 23, 2019
Want to plot a DataFrame? It's as easy as:
df.plot(kind='...')
You can use:
line π
bar π
barh
hist
box π¦
kde
area
scatter
hexbin
pie π₯§
Other plot types are available via pd.plotting!
Examples: https://t.co/fXYtPeVpZX#Python #dataviz #pandastricks pic.twitter.com/kp82wA15S4
πΌπ€ΉββοΈ pandas trick #96:
β Kevin Markham (@justmarkham) December 13, 2019
Want to create interactive plots using pandas 0.25? π
1. Pick one:
β‘οΈ pip install hvplot
β‘οΈ conda install -c conda-forge hvplot
2. pd.options.plotting.backend = 'hvplot'
3. df.plot(...)
4. π₯³
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/HjH9hTQGqD
Handling warnings
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 10, 2019
Did you encounter the dreaded SettingWithCopyWarning? π»
The usual solution is to rewrite your assignment using "loc":
β df[df.col == val1].col = val2
β df.loc[df.col == val1, 'col'] = val2
See example π#Python #DataScience #pandastricks @python_tip pic.twitter.com/6L6IukTpBO
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 12, 2019
Did you get a "SettingWithCopyWarning" when creating a new column? You are probably assigning to a DataFrame that was created from another DataFrame.
Solution: Use the "copy" method when copying a DataFrame!
See example π#Python #DataScience #pandastricks pic.twitter.com/LrRNFyN6Qn
Other
πΌπ€ΉββοΈ pandas trick #88:
β Kevin Markham (@justmarkham) November 19, 2019
Goal: Rearrange the columns in your DataFrame
Options:
1. Specify all column names in desired order
2. Specify columns to move, followed by remaining columns
3. Specify column positions in desired order
See example π#Python #pandastricks @python_tip pic.twitter.com/r739QtBims
πΌπ€ΉββοΈ pandas trick #98:
β Kevin Markham (@justmarkham) December 17, 2019
Problem: Your DataFrame is in "wide format" (lots of columns), but you need it in "long format" (lots of rows)
Solution: Use melt()! ββ‘οΈπ§
See example π
Long format is better for analysis, transformation, merges...#Python #DataScience #pandastricks pic.twitter.com/4mmoiuFUGD
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) September 2, 2019
If you've created a groupby object, you can access any of the groups (as a DataFrame) using the get_group() method.
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/6Ya0kxMpgk
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 1, 2019
Do you have a Series with a MultiIndex?
Reshape it into a DataFrame using the unstack() method. It's easier to read, plus you can interact with it using DataFrame methods!
See example π
P.S. Want a video with my top 25 #pandastricks? πΊ#Python #pandas pic.twitter.com/DKHwN03A7J
πΌπ€Ή pandas trick:
β Kevin Markham (@justmarkham) July 26, 2019
There are many display options you can change:
max_rows
max_columns
max_colwidth
precision
date_dayfirst
date_yearfirst
How to use:
pd.set_option('display.max_rows', 80)
pd.reset_option('display.max_rows')
See all:
pd.describe_option()#Python #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 5, 2019
Show total memory usage of a DataFrame:https://t.co/LkpMP7wWOi(memory_usage='deep')
Show memory used by each column:
df.memory_usage(deep=True)
Need to reduce? Drop unused columns, or convert object columns to 'category' type.#Python #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick #70:
β Kevin Markham (@justmarkham) September 20, 2019
Need to know which version of pandas you're using?
β‘οΈ pd.__version__
Need to know the versions of its dependencies (numpy, matplotlib, etc)?
β‘οΈ https://t.co/84gN00FdzJ_versions()
Helpful when reading the documentation! π#Python #pandas #pandastricks
πΌπ€ΉββοΈ pandas trick:
β Kevin Markham (@justmarkham) July 22, 2019
Want to use NumPy without importing it? You can access ALL of its functionality from within pandas! See example π
This is probably *not* a good idea since it breaks with a long-standing convention. But it's a neat trick π#Python #pandas #pandastricks pic.twitter.com/pZbXwuj6Kz