In your data project, you need to write some interesting questions and then analyze the data to try to answer them.
Let’s say we’re looking at the U.S. Cities data. Here are some simple questions:
Simple (Boring) Question Examples:
How many cities are there?
Which city has the biggest population?
What is the total population of the cities?
We should ask questions that have more interesting and complicated answers.
Examples:
How is population related to crime?
Which cities have grown the most over time?
What are some differences between cities in Iowa compared to cities in other states like California or Connecticut?
Do we see a difference in education level between larger and smaller cities?
What type of cities have high or low unemployment?
What causes some cities to grow while others shrink?
General Patterns for Asking Interesting Questions
Instead of asking “what’s the biggest number,” here are some tips for asking better questions.
1) Think about the relationships between two different variables. How are housing costs related to crime rates?
This can involve looking for correlations between variables. Stated simply, we try to see if one variable goes up while the other variable also goes up. Sometimes one variable goes down while the other one goes up.
2) Consider how something has changed over time. Line graphs help us with this.
3) Compare different subcategories of data. Do cities on the coasts have different characteristics than states in the midwest? Sometimes bar graphs are useful here, but we can also plot different subcategories on the same axes. For example, a scatter plot or line graph could have multiple categories plotted in different colors.
4) Look for interesting cases or outliers, and then dig into them when you find them.
An outlier is a data point that is different than most of the other data. It could be a spike in a line graph over time, or a data point all by itself in a scatter plot.
5) Look for clusters of data.
In a scatter plot, maybe the dots form a few distinct groups. When that happens, you should ask what might cause that, and try to figure out the difference between the groups.
6) Look at the distribution of the data, which is similar to looking for clusters. A histogram can help you do this. Distribution means looking at how many of the data points are in certain ranges, and which ranges are the most frequent.
We often see a normal distribution (bell curve shape), but not always. Sometimes we get other types, like a bimodal distribution (two peaks in the data). In that case you would want to dig into why this is happening. Other times you might see a long tail, which means there are a number of less frequent data values that are much larger or smaller than the average.