SAT Math Problem Solving

## Statistics

It's important to understand the graphs and charts often used in statistics based questions before we explain the core concepts tested on the Math section.

## 1. Histograms

Like a line plot a histogram shows the frequency of data. Instead of marking the data items with Xs, however, a histogram shows them as a graph.

Example:This histogram shows the heights of trees on each street in my town. In this case, the frequency is the number of trees, and the characteristic is the height of the trees.

#### From the graph, we see:

1. There are 3 trees whose heights are between 30 and 35 feet
2. There are 3 trees whose heights are between 36 and 40 feet
3. There are 8 trees whose heights are between 41 and 45 feet etc.

## 2. Scatterplot

A scatter plot is a type of graph that shows the relationship between two sets of data. Scatter plots graph data as ORDERED PAIRS (this is simply a pair of numbers but the order in which they appear together matters).

#### To show Tammy's data, mark the point whose horizontal value is 4.5 and whose vertical value is 90.

By graphing the data on a scatter plot, we can see if there is a relationship between the number of hours studied and test scores. The scores generally go up as the hours of studying go up, so this shows that there IS a relationship between test scores and studying. We can draw a line on the graph that roughly describes the relationship between the number of hours studied and test scores. This line is known as the LINE OF BEST FIT.

As you can see, none of the points lie on the line of best fit, but that's okay! This is because the line of best fit is the best line that describes the relationship of ALL the points on the graph

Scatterplots show three types of relationships, called CORRELATIONS:

#### NO CORRELATION: The values have no relationship.Example: A person's IQ is not related to his/her shoe size, so there is no correlation.

Apart from these graphs, random graphs like pie charts, box plots, cumulative frequency, etc., can be used. They are rare and therefore we wont describe each one of them in detail. If they do show up on the test, they are straightforward to answer. A couple of examples are below.

#### Solved Example 1

What is the area of the pie chart that is represented by those who chose hamburgers as their favorite food?
A. 97.2°
B. 97.4°
C. 98.4°
D. 98.8°

Solution: We know that 27% of those in the survey chose hamburgers as their favorite food. This corresponds to 27% of the area of the circle, which means that the central angle of the circle accounts for 27% of the measure of the sum of all central angles in a circle (360°).

In other words, we are looking for the angle that makes up 0.27 of the entire circle.
0.27 x 360°
= 97.2°

#### Solved Example 2

At what pH is the enzyme activity at its maximum?
A. 1
B. 7
C. 10
D. 14

Solution: Enzyme activity is at its maximum in the earliest stages of the experiment at about time = 0. The pH at this time is about 1.5. The closest correct answer is A.

## Statistical Measures

Statistics have been a notoriously feared SAT topic for many test takers. The fear inducing notoriety and intimidating reputation associated with these mathematical operations have been the product of a lack of understanding of the somewhat vague and confusing mathematical operations used to calculate statistical measures.

The most used statistical measures are described below:

### 1. The Mean (also called the average)

The mean is a calculated central value of a set of numbers. To calculate the mean, add all of the numbers, then divide the sum by how the number of items.

Solved Example 3
If x is the average (arithmetic mean) of m and 9, y is the average of 2m and 15, and z is the average of 3m and 18, what is the average of x,y, and z in terms of m?
A. m + 6
B. m + 7
C. 2m + 14
D. 3m + 21

Solution: There are a lot of variables in this equation, but don't let them confuse you. We already know that the average of two numbers is the sum of those two numbers divided by 2.

That means that:
$$x=\frac{m+9}{2}$$
$$y=\frac{2m+15}{2}$$
$$z=\frac{3m+18}{2}$$

Now we need to find the average of x, y, and z. Substituting the previous expressions for m gives us:

$$\frac{m+9+2m+15+3m+18}{3\times 2}$$
$$\frac{6m+42}{6}$$
m + 7

#### Solved Example 4

The mean score of 8 players in a basketball game was 14.5 points. If the highest individual score is removed, the mean score of the remaining 7 players become 12 points. What was the highest score?
A. 20
B. 24
C. 32
D. 36

Solution: If the mean score of 8 players is 14.5, then the total of those 8 scores is 14.5 x 8 = 116. If the mean of 7 scores is 12, then the total of those 7 scores is 12 x 7 = 84.

Since the set of 7 scores was created by removing the highest score from the set of 8 scores, the difference between the total of all 8 scores and the set of 7 scores is equal to the removed score.

116 - 84 = 32

### 2. Weighted Averages

The basic formula for averages applies only to sets of data consisting of individual values, all of which are equally weighted (i.e., none of the values “counts” toward the average any more than any other value does). When you consider sets in which some data are more heavily weighted than other data - whether weighted by percent, frequencies, ratios, or fractions - you need to use special techniques for WEIGHTED AVERAGES.

A weighted average of only two values will fall closer to whichever value is weighted more heavily. For instance, if a drink is made by mixing 2 shots of a liquor containing 15% alcohol with 3 shots of a liquor containing 20% alcohol, then the alcohol content of the mixed drink will be closer to 20% than to 15%.

Example: A mixture of "lean" ground beef (10% fat) and "super-lean" ground beef (3% fat) has a total fat content of 8%. What is the ratio of "lean" ground beef to "superlean" ground beef?

Fortunately, you do not need to complete any complicated formulas here. Instead, you need to look at the difference between the fat contents of “lean” and “super lean” ground beef and the fat content of the final mixture.

“Lean” ground beef has a fat content that is 2% higher than the fat content of the final mixture. You can say that “lean” ground beef has a +2 differential.

Similarly, “super lean” ground beef has a fat content that is 5% lower than the fat content of the final mixture. You can say that “super lean” ground beef has a -5 differential. You need to make these differentials cancel out, so you should multiply both differentials by different numbers so that the positive will cancel out with the negative. If you were to set up an equation, it would look something like this:

x(+2) + y(-5) = 0

Now all you have to do is pick values for x and y. If x = 5 and y = 2, the equation will be true (10 + (-10) = 0). That means that for every 5 parts “lean” ground beef, you have 2 parts “super-lean” ground beef. The ratio is 5:2.

This relationship holds whenever two groups are averaged together. Suppose that A and B are averaged together. If they are in a ratio of a:b, then you can multiply the differential of A by a, and it will cancel out with the differential of B times b.

Example: A group of men and women in a ratio of 2:3. If the men have an average age of 50, and the average age of the group is 56, you can easily figure out the average age of the women in the group. Men have a -6 differential, and there are 2 of them for every 3 women. If the average age of women is w, then:

2 x (-6) + 3 x (w) = 0
-12 + 3w = 0
w = 4

Women have a +4 differential. The average age of the women in the group is 56 + 4 = 60 years old.

### 3. Median

A median is the middle number of a data set when all of the items are written in order, from least to greatest (or greatest to least).

#### The following are the steps to calculate the median of a list:

1. Arrange the list in ascending or descending order
2. If there are an odd number of items on the list, the middle item equals the median
$$Median(odd\: number\: of\: Items)=\frac{(n+1)^{th}}{2}term$$
3. If there are an even number of items on the list, then the median is the average of the two middle numbers
$$Median(odd\: number\: of\: Items)=\frac{n^{th}term+(n+1)^{th}}{2}term$$

#### For Example:

• In the 7-element set { 3, 5, 7, 9, 13, 15, 17 }, the median is 9
• In the 8-element set { 3, 5, 7, 9, 13, 15, 17, 17 }, the median is 11 (the average of the fourth and fifth entries, 9 and 13).

#### Note:

• When the number of items on the list is even, the median can equal a number not on the list.
• An absurdly large number, far away from the rest of the set, such as 312 in this last set, has zero effect on the median, although it would have a big effect on the mean
• As long as the middle numbers stay in the middle, changes to the values of the outer numbers have no effect on the median; by contrast, changing any number in the set changes the mean.

Solved Example 4
A sociologist chose 300 students at random from each of two schools and asked each student how many siblings he or she has. The results are shown in the table below:

There are a total of 2400 students at Lincoln School and 3,300 students at Washington School. What is the median number of siblings for all the students surveyed?
A. 0
B. 1
C. 2
D. 3

Solution: There were a total of 600 data points collected (300 from each school) which means the median will be between the 300th and 301st numbers.

Fortunately, there's a way to solve the problem without having to write out 600 numbers! You can put the numbers into groups based on the information you're given in the chart.

#### For each "number of siblings" value, add the number of respondents from each of the two schools together. For example, 120 students from Lincoln School and 140 students from Washington School said they had no siblings, and 120 + 140 = 260. So a total of 260 students have 0 siblings. Do this for each of the sibling values.

1. 260 students have 0 siblings
2. 190 students have 1 sibling
3. 90 students have 2 siblings
4. 40 students have 3 siblings
5. 20 students have 4 siblings.

Both the 300th and the 301st values will come in the second category (190 students have 1 sibling). The correct answer is B.

Solved Example 5
A survey was taken regarding the value of homes in a county, and it was found that the mean home value was $165,000, and the median home value was$125,000. Which of the following could explain the difference between the mean and median home values in the country?
A. The homes have values that are close to each other
B. There are a few homes that are valued much less than the rest
C. There are a few homes that are valued much more than the rest
D. Many of the homes have values between $125,000 and$165,000

Solution: The mean and median of a set of data are equal when the data has a perfectly symmetrical distribution (such as a normal distribution). If the mean and median aren't equal to each other, that means the data isn't symmetrical and that there are outliers.

When there are outliers in the data, the mean will be pulled in the outliers’ direction (either smaller or larger) while the median remains the same. In this problem, the mean is larger than the median. That means the outliers are several homes that are significantly more expensive than the rest, since these outliers push the mean to be larger without affecting the median.

Therefore, the correct answer is C, There are a few homes that are valued much more than the rest.

Median of sets containing unknown values: Unlike the arithmetic mean, the median of a set depends only on the one or two values in the middle of the ordered set. Therefore, you may be able to determine a specific value for the median of a set even if one or more unknowns are present.

For instance, consider the unordered set {x, 2, 5, 11, 11, 12, 33}. No matter whether x is less than 11, equal to 11, or greater than 11, the median of the resulting set will be 11. (Try substituting different values of x to see why the median does not change.)

By contrast, the median of the unordered set {x, 2, 5, 11, 12, 12, 33} depends on x. If x is 11 or less, the median is 11. If x is between 11 and 12, the median is x. Finally, if x is 12 or more, the median is 12.

Example: Sarah spends 2 hours on Tuesday, Thursday, and Saturday practicing her violin. On Monday she practices for 90 minutes, and on Friday she practices for 1 hour. What is the median time she spends practicing her violin in a given week?

STEP 1: Convert all the given time values into the same unit of measure.
Monday = 90 mins (1.5 hours)
Tuesday = 2 hours
Thursday = 2 hours
Friday = 1 hour
Saturday = 2 hours

STEP 2: Organize the times in ascending order: 1, 1.5, 2, 2, 2
Since there is an odd number of data entries, 5, in the set, the median will be the 3rd set in the list. Therefore, the median is 2.

Solved Example 6
In Billy’s 5th grade class, the test scores for the final math exam were 87, 54, 77, 92, 95, 91, x,
and 90. What is the value of x if the median of the test scores is 82?
A.
54
B. 74
C. 77
D. 92

Solution:
STEP 1: Reorder the data set in ascending order: 54, 77, 87, 90, 91, 92, 95, x
STEP 2: Since the set has an even number of data entries, the median will be the average of
the two middle numbers.

Therefore, the median will be the average of the 4th and 5th terms. $$The\: median\: is\: \frac{90+91}{2}=90.5\neq 82$$

STEP 3: Now, let’s reorder our data set with the x on the other end of the set: x, 54, 77, 87, 90,91, 92, 95.
The median will still be the average of the 4th and 5th terms, but those terms will be different
now: The median is $$frac{87+90}{2}=88.5\neq 82$$

Now we know that the x-value will need to be one of the middle numbers that the average is being taken from; therefore, we can rewrite the set as follows: 54, 77, 87, x, 90, 91, 92, 95

$$The\: median\: is\: \frac{x+90}{2}=82$$
x + 90 = 164
x = 74

### 4. Mode

The MODE represents the value or values in a data set that is/are repeated the most. It is possible for a data set to have one, multiple, or no modes.

#### For Example:

• There is no mode in the data set {1, 2, 3, 4, 5} since each number appears an equal number of times
• In the data set {2, 4, 5, 2, 3, 1, 2, 2}, 2 is the mode, as it is seen four times
• In the data set {1, 1, 2, 2, 3, 4}, there are two modes, 1 and 2, since each is seen twice in the set

Example: Given the set {72, 75, 85, 90, 90, x}, what is the value of x if a mode of the set is 90 and its median is 85?

#### There are three possible places x can be located:

1. {72, 75, 85, 90, 90, x}
2. {x ,72, 75, 85, 90, 90}
3. {72, 75, x, 85, 90, 90}

#### The median of each scenario is:

1. = 87.5 ≠ 85
2. = 80 ≠ 85
3. = 85, Therefore, x = 85

Now, write down the set with x as 85: {72, 75, 85, 85, 90, 90}. In this set, the modes are both 85 and 90. Therefore the correct answer for x is 85.

### 5. Range

#### The SAT loves this statistical measure because it’s so simple. The RANGE is the difference between the maximum value and the minimum value.

• In the set {3, 5, 7, 9, 13, 15, 17}, the range = 17 - 3 = 14
• In the set {1, 3, 3, 3, 3, 3, 74, 89, 312}, the range = 312 - 1 = 311

For the range, the only thing that matters is the top and bottom values in the set.

The smaller the range of a set, the closer its data entries are to one another. If the range is large, then either there is more space between the data entries or there are outliers in the data.

Example: What is the value of x if the range is 12: {2, 3, x, 5, 12}

Find the range of the set, assuming x lies in the middle of the set.
Range = 12 − 2 = 10 ≠ 12

This tells us that x must be either greater than 12 or less than 2.

Possibility 1 : {x, 2, 3, 5, 12}
Range = 12 − x = 12
X = 0

Possibility 2: {2, 3, 5, 12, x}
Range = x − 2 = 12
X = 14

In order for the range of this particular data set to be 12, x must be 0 or 14.

### 6. Standard Deviation

The mean and median both give “average” or “representative” values for a set, but they do not tell us the whole story. It is possible for two sets to have the same average but to differ widely in how spread out their values are. To describe the spread, or variation, of the data in a set, you use a different measure: STANDARD DEVIATION.

#### STANDARD DEVIATION (SD) indicates how far from the average (mean) the data points typically fall.

• A small SD indicates that a set is clustered closely around the average (mean) value
• A large SD indicates that the set is spread out widely, with some points appearing far from the mean.

#### Example: Consider the sets {5, 5, 5, 5}, {2, 4, 6, 8}, and {0, 0, 10, 10}. These sets all have the same mean value of 5. You can see at a glance, though, that the sets are very different, and the differences are reflected in their SDs. The first set has an SD of zero (no spread at all), the second set has a moderate SD, and the third set has a large SD.

You might be asking where the √5 comes from in the technical definition of SD for the second set.

#### The good news is that you do not need to know - it is very unlikely that an SAT problem will ask you to calculate an exact SD. If you just pay attention to what the average spread is doing, you’ll be able to answer all SAT standard deviation problems, which involve either

1. Changes in the SD when a set is transformed, or
2. Comparisons of the SDs of two or more sets.

Just remember that the more spread out the numbers, the larger the SD.

If you come across a problem on the test that focuses on changes in the SD, ask yourself whether the changes move the data closer to the mean, farther from the mean, or neither. If you see a problem requiring comparisons, ask yourself which set is more spread out from its mean.

#### Below are some sample problems to help illustrate standard deviation:

1. Which set has the greater standard deviation: {1, 2, 3, 4, 5} or {440, 442, 443, 444, 445}?
2. If each data point in a set is increased by 7, does the set's standard deviation increase, decrease, or remain constant?
3. If each data point in a set is increased by a factor of 7, does the set's standard deviation increase, decrease, or remain constant? (Assume that the set consists of different numbers)

#### Solution:‍

1. The second set has the greater SD. One way to understand this is to observe that the gaps between its numbers are, on average, slightly bigger than the gaps in the first set. Only the spread matters. The numbers in the second set are much more “consistent” in a sense - they are all within about 1% of each other, while the largest numbers in the first set are several times the smallest ones. However, this “percent variation” idea is irrelevant to the SD.
2. The SD will not change. “Increased by 7” means that the number 7 is added to each data point in the set. This transformation will not affect any of the gaps between the data points, and thus it will not affect how far the data points are from the mean.
3. The SD will increase. “Increased by a factor of 7” means that each data point is multiplied by 7. This transformation will make all the gaps between points 7 times as big as they originally were. Thus, each point will fall 7 times as far from the mean. The SD will increase by a factor of 7.

Solved Example 7
Set S has a mean of 10 and a standard deviation of 1.5. We are going to add two additional numbers to Set S. Which pair of numbers would decrease the standard deviation the most?
A. {2, 10}
B. {10, 18}
C. {7, 13}
D. {9, 11}

Solution:
This is a very tricky problem. The starting list has a mean of 10 and a standard deviation of 1.5.
A.
hese two numbers don’t have a mean of 10, so adding them will change the mean; what’s more, one number is “far away,” which will wildly decrease the mean, increasing the deviations from the mean for almost every number on the list, and therefore increasing the standard deviation.
B. These choices don’t have a mean of 10, so adding them will change the mean. One number is “far away,” which will wildly increase the mean, increasing the deviations from the mean for almost every number on the list, and therefore increasing the standard deviation.
C. These options are centered on 10, so adding them will not change the mean. Both of these are a distance of 3 units from the mean, and this is larger than the standard deviation, so the size of the typical deviation from the mean will increase.
D. This is the correct answer. These are centered on 10, so adding them will not change the mean. Both are a distance of 1 unit from the mean, and this is less than the standard deviation, so the size of the typical deviation from the mean will decrease.

Solved Example 8
Set Q consists of the following five numbers: Q = {5, 8, 13, 21, 34}. Which of the following sets has the same standard deviation as Set Q?
I. {35, 38, 43, 51, 64}
II. {10, 16, 26, 42, 68}
III. {46, 59, 67, 72, 75}

A. I only
B. I & II
C. I & III
D. I, II, & III

Solution: Notice that Set I is just every number in Q plus 30. When you add the same number to every number in a set, you simply shift it up without changing the spacing, so this doesn’t change the standard deviation at all. Set I has the same standard deviation as Q.

Notice that Set II is just every number in Q multiplied by 2. Multiplying by a number does change the spacing, so this does change the standard deviation. Set II does not have the same standard deviation as Q.

Set III is very tricky and probably is at the outer limit of what the SAT could ever ask you you to consider. The spacing between the numbers in Set III, from right to left, is the same as the spacing between the numbers in Q from left to right.

The correct combination is I and III, so the answer is C.

Solved Example 9
Consider the following sets:
L = {3, 4, 5, 5, 6, 7}
M = {2, 2, 2, 8, 8, 8}
N = {15, 15, 15, 15, 15, 15}

Rank those three sets from least standard deviation to greatest standard deviation.
A.
L, M, N
B. M, L, N
C. M, N, L
D. N, L, M

Solution: OK, first of all, set N has six numbers that are all the same. When all the members of a set are identical, the standard deviation is zero, which is the smallest possible standard deviation. So, automatically, N must have the lowest. Right away, we can eliminate options A, B and C. Only D remains. The correct answer is D.

## Sampling and Modeling

One of the important jobs of a statistician is to make predictions. For example, the Indian Government may want to find out the average age of a person in India. It is impossible for the government to survey each and every individual in the country. Therefore, it hires statisticians to take random samples of residents and make predictions based on the data they collect. The SAT will often test you on the relationship between such sample data and the predictions you can or cannot make about the entire population.

#### Solved Example 10Manchester United Football Club chose 1,000 of their fans at random and asked each fan how many jerseys he or she has. The results are shown in the table below.

There are a total of 16 million Manchester United fans. Based on the survey data, what is the expected total number of fans who own 3 jerseys?
A. 2 million
B. 4 million
C. 6 million
D. 8 million

Solution: Using the sample data, we can estimate the total number of fans who own 3 jerseys are:
16 million x 250/1000 = 4 million

### 1. Line of Best Fit

A statistician is trying to figure out the relationship between the number of fans football clubs have, and the number of tournaments football clubs have won. Therefore, he collects information regarding the number of fans, and the number of tournaments won for 100 football clubs in Europe. He then plots those points on a scatterplot graph.

The line of best fit refers to a line through that scatter plot that best expresses the relationship between those points: Number of fans of a football club vs. the number of tournaments won. The line of best fit can be used to figure out if there is a relationship between those two points.

#### Solved Example 11 & 12

The scatterplot shows the number of pollinating flowers for 20 different aged Southern Magnolia plants. The line of best fit is also shown.

11. Which of the following is the best interpretation of the slope of the line of best fit in the context of this problem?
A.
The predicted increase in the age of the Southern Magnolia, in years, for every increase of a pollinating flower.
B. The predicted increase in the number of pollinating flowers for every year increase in the age of the Southern Magnolia.
C. The Southern Magnolia predicted age in years when it has 0 pollinating flowers.
D. The Southern Magnolia predicted number of pollinating flowers when it was just born (Age of 0).

12. Which of the following is the best interpretation of the y-intercept of the line of best fit in the context of this problem?
A. The predicted increase in the age of the Southern Magnolia, in years, for every increase of a pollinating flower.
B. The predicted increase in number of pollinating flowers for every year increase in the age of the Southern Magnolia.
C. The Southern Magnolia predicted age in years when it has 0 pollinating flowers.
D. The Southern Magnolia predicted number of pollinating flowers when it was just born (Age of 0).

Solutions:
11. As we learned in the coordinate geometry chapter, slope is the increase in y (age of the Southern Magnolia) for each increase in x (number of pollinating flowers). The only difference now is that it’s a predicted increase. The correct answer is A.
12. The y-intercept is the value of y (age of the Southern Magnolia) when x (number of pollinating flowers) is 0. Therefore, the correct answer is C.

### 2. Margin of Error

The margin of error refers to the room for error we give to an estimate. For example, we could say that England has 6 million Manchester United fans with a margin of error of 1 million. This means that the number of Manchester United fans living in the country of England is between 5 million and 7 million.

#### The margin of error primarily depends on the following two factors:

1. Sample Size - This is common sense. The more the number of people in England we survey, the more accurate (lower margin of error) our predictions are going to be.
2. Variability of the Data - We should only select people from England to check if they are Manchester United fans or not. If we select random people from Europe, our predictions will not be very accurate, and our margin of error increases.

Solved Example 13
A real estate agent randomly surveyed 100 apartments for sale in San Francisco, California and found that the average price of each apartment was \$800,000. Another real estate agent intends to replicate the survey and will attempt to get a smaller margin of error. Which of the following samples will most likely result in a smaller margin of error for the mean price of an apartment in San Francisco, California?
A. 50 randomly selected apartments in San Francisco
B. 50 randomly selected apartments in all of California
C. 100 randomly selected apartments in San Francisco
D. 100 randomly selected apartments in all of California

Solution: As discussed above, the larger the sample size, the lower the margin of error. Therefore, we can eliminate A or B because the sample size is actually smaller. The second rule discussed above is the variability of the data. Since we want to figure out the average selling price of a house in San Francisco, it is better to get the sample data from San Francisco only and not the other cities in California. Therefore, the correct answer is C.

### 3. Confidence Interval

Now, this is where most AP Guru students go bonkers. It’s fine if you do not know what confidence intervals are - most students do not. You’ll never need to calculate one, and the SAT questions that refer to confidence intervals very easy. All you need to understand is what a confidence interval is.

A confidence interval tells you how sure you are of predicting any statistical measure (like mean or standard deviation) for a population whose sample you're measuring. For example, you have a 95% confidence interval that the average age of a French citizen is between 34 and 38 years. The higher the confidence, the more likely the average age will be within the interval.

A 95% confidence interval means that if the same experiment were repeated again and again, each with 100 random individuals from France, 95% of those experiments would end up with a mean average age between 34 to 38

Solved Example 14
Environmentalists are testing pH levels in a forest that is being harmed by acid rain. They analyzed water samples from 40 rainfalls in the past year and found that the mean pH of the water samples has a 95% confidence interval of 3.2 to 3.8. Which of the following conclusions is the most appropriate based on the confidence interval?
A.
95% of all the forest rainfalls in the past year have pH between 3.2 and 3.8
B. 95% of all the forest rainfalls in the past decade have a pH between 3.2 and 3.8
C. It is plausible that the true mean pH of all the forest rainfalls in the past year is between 3.2 and 3.8
D.
It is plausible that the true mean pH of all the forest rainfalls in the past decade is between 3.2 and 3.8

Solution: A confidence interval does NOT say anything about the rainfalls themselves. You cannot say that any one rainfall has a 95% chance of having a pH between 3.2 and 3.8, and you cannot say that 95% of all the forest rainfalls in the past year had a pH between 3.2 and 3.8.

So, in the example above, we can be quite confident that the true mean pH of all the forest rainfalls in the past year is between 3.2 and 3.8. The correct answer is C. The answer is not D because we cannot draw conclusions about the past decade when the samples were gathered from the past year.

### 4. Causation vs Correlation

The difference between correlation and causation is where the SAT stumps a lot of students. For example, just because students who complete more mock tests get better scores on the SAT doesn’t mean that complete more mock tests cause an improvement in SAT scores. Rather, completing mock tests is associated with an improvement in SAT scores.

Perhaps students who attempt more mock tests have more discipline, or they have more demanding parents who make them study harder. But due to the way the experiment was designed, we can’t tell what the underlying factor is.

Solved Example 15
Researchers conducted an experiment to determine whether eating junk food increased body weight. They randomly selected 500 people who eat processed packaged food at least once a week, and 500 people who do not eat processed packaged food at all. After tracking the people’s weight for a year, the researchers found that the people who eat processed packaged food at least once a week had experienced weight gain significantly higher than the people who do not eat processed packaged food at all. Based on the design and results of the study, which of the following is an appropriate conclusion?
A
. Eating processed packaged food least once a week is likely to increase body weight.
B. Eating processed packaged food three times a week increases body weight more than eating processed packaged food just once a week.
C. Any person who starts eating processed packaged food at least once a week will increase his or her body weight.
D. There is a positive association between eating processed packaged food and increased weight gain.

Solution: This question deals with a classic case of correlation vs. causation. Just because people who eat processed packaged food had a higher body weight doesn’t mean that eating processed packaged food causes an increase in body weight.

Therefore, answer A is wrong because it implies causation. Answer B is also wrong because it not only implies causation but also that the frequency of eating processed packaged food matters, something that wasn’t tracked in the experiment. Answer C is wrong because it suggests a completely certain outcome. Even if eating processed packaged food DID increase body weight, not every single person who starts eating processed packaged food will increase his body weight. Any conclusion drawn from sample data is a generalization and should not be regarded as truth for every individual.

The correct answer is D. There is a positive association between eating processed packaged food and an increase in body weight.

### 5. Random Sample

One last important thing to note in this chapter is that the sample has to be random. If a samples is not random, we will not be able to make a prediction.

For example, let’s say we are trying to determine an association between bargoers and football fans. If researchers picked 100 individuals who live close to a football stadium, then the sample is not random. Those people could be football fans because they live in close proximity to a stadium.

What should the researchers have done differently? The answer is random assignment. They should instead randomly select 100 people from all walks of life and from different geographic locations. The more diverse the group, the more accurate the data will be. Of course, conducting this type of experiment can be extremely difficult, which is why proving causation can be such a monumental task.