August 23, 2021 · Python tutorial

How to write a great Stack Overflow question

In my 7 years of teaching data science, I've answered thousands of questions online. I know what makes a great question, because those are the questions that get my attention!

When you have a code question, Stack Overflow is an excellent place to ask. There are tons of skilled developers hungry for questions to answer so that they can earn points and build their reputation.

If you write a great question, it may be answered quickly! If you don't, it will probably be ignored.

So how do you write a great Stack Overflow question? Jake VanderPlas summarizes my thinking nicely:

Below, I'll demonstrate my step-by-step process for writing a great question which you can use any time you need to ask for coding help online!

Here's the initial question:

I'll use this pandas question as an example, which I received from one of my students:

I'm trying to tackle a dataset where each row represents a home transaction with a unique listing id. I know fillna is a column-based function that applies to all nans, but can we apply per row instead?

This dataset has town and architectural style columns, and I want to replace a nan value in architectural style with a groupby town of that row's town, and then apply the most common (highest value count) architectural style. I was going to use df.groupby('town')['architectural style'].value_counts().

I have a feeling I can do this with transform and a lambda function on the whole dataset, but how do I interact with the row, to get its town within the fillna function?

Conceptually, this is an excellent question, but it needs some work in order to be successful on Stack Overflow!

Here's my process for rewriting that question:

  1. Write a brief introduction
  2. Provide a self-contained code example
  3. Detail the expected results and why I expect those results
  4. Add any important notes
  5. Link to any relevant questions
  6. Write a title that summarizes the question

Below I'll rewrite my student's pandas question, and then I'll explain 9 lessons you can take away from this example!

Here's my rewritten question:

Title: How to fill missing values in a DataFrame with the most frequent value of each group?

I have a pandas DataFrame with two columns: toy and color. The color column includes missing values.

How do I fill the missing color values with the most frequent color for that particular toy?

Here's the code to create a sample dataset:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'toy':['car'] * 4 + ['train'] * 5 + ['ball'] * 3 + ['truck'],
    'color':['red', 'blue', 'blue', np.nan, 'green', np.nan,
             'red', 'red', np.nan, 'blue', 'red', np.nan, 'green']

Here's the sample dataset:

      toy  color
0     car    red
1     car   blue
2     car   blue
3     car    NaN
4   train  green
5   train    NaN
6   train    red
7   train    red
8   train    NaN
9    ball   blue
10   ball    red
11   ball    NaN
12  truck  green

Here's the desired result:

Notes about the real dataset:

This question is related, but it doesn't answer my question of how to use the most frequent value to fill in missing values.

Here's what makes this a good question:

  1. I summarized the entire question at the top. If someone might know the answer, I want to give them the question right away in order to motivate them to keep reading. If the question was instead buried at the bottom, the reader might just give up and move on to another question.

  2. I wrote code that can be copied, pasted, and run. Coding questions can be hard to solve in your head, and so I want to make it as easy as possible for the reader to work on this problem on their own machine. I even included the imports so that they can run the code from a new environment.

  3. I printed out the dataset. Even though the reader may end up running my code on their machine, I don't want to require them to do so in order to start thinking through a solution. This is also important because they will need something to look at when I explain the expected results. (In case you're wondering, I generated this output by running the sample code in IPython.)

  4. I used short and simple object names. This makes it easier for the reader to read the code, interact with it on their own machine, and write out an answer.

  5. I added proper formatting. This makes it easier for the reader to understand your question quickly.

  6. I provided the expected results and why I expect those results. Detailing what you expect is important because it's unlikely that your question alone will be crystal clear to every reader. Detailing why you expect it is equally important because you don't want to receive solutions that arrive at the right result but for the wrong reason.

  7. I covered many different cases with the example dataset. My dataset included a toy with one missing value, a toy with two missing values, a toy for which there was a "tie" in terms of most frequent color, and a toy with no missing values. This is important for ensuring that the solution you receive will work for all of the cases present in your real dataset.

  8. I added any special notes that wouldn't be obvious from the example dataset. In this case, I wanted to ensure that the solution "scaled" to more than four toy types, and I didn't want the reader to worry about handling an edge case that doesn't exist in the real dataset. Thus, I'm trying to increase the likelihood that I get a useful answer as well as eliminate any questions that may have come up in the reader's mind.

  9. I linked to a related question. This demonstrates that you have already searched for an answer, and it gives you a chance to make the case for why your question is not a duplicate (in which case it may be closed by a moderator). It also gives the reader a "head start" by linking to a resource that might help them to come up with an answer to your question.

Here are some additional things you might want to include:

My guiding principle is to make it easy for the reader by telling them everything they need to know and nothing else. Here are a few things that I didn't include in my question, but may be worth including in some cases:

Note: Including code that didn't work is often recommended as a way to "prove" that you have put in enough effort solving your own problem, but I don't universally recommend this since it makes your question longer without necessarily adding useful information.

Here are a few more tips:

Related resources:

Good luck!

If you used this guide to help you write a great Stack Overflow question, feel free to share a link in the comments below and I'll take a look! 👇

P.S. I posted my pandas question on Stack Overflow and received a helpful answer within 5 minutes! 🙌 This is by no means guaranteed, but by writing a high-quality question, you will greatly increase the likelihood of a useful response!

Comments powered by Disqus