# Statistics

It's important to understand the graphs and charts often used in statistics based questions before we explain the core concepts tested on the SAT Math section.

August 29, 2020

## 1. Histograms

Like a line plot a histogram shows the frequency of data. Instead of marking the data items with Xs, however, a histogram shows them as a graph.

EXAMPLE: This histogram shows the heights of trees on each street in my town. In this case, the frequency is the number of trees, and the characteristic is the height of the trees.

From the graph, we see:

- There are 3 trees whose heights are between 30 and 35 feet
- There are 3 trees whose heights are between 36 and 40 feet
- There are 8 trees whose heights are between 41 and 45 feet etc.

## 2. Scatterplot

A scatter plot is a type of graph that shows the relationship between two sets of data. Scatter plots graph data as **ORDERED PAIRS** (this is simply a pair of numbers but the order in which they appear together matters).

After a test, Ms. Phinney asked her students how many hours they studied. She recorded their answers, along with their test scores. Create a scatter plot of hours studied and test scores.

To show Tammy's data, mark the point whose horizontal value is 4.5 and whose vertical value is 90.

By graphing the data on a scatter plot, we can see if there is a relationship between the number of hours studied and test scores. The scores generally go up as the hours of studying go up, so this shows that there IS a relationship between test scores and studying. We can draw a line on the graph that roughly describes the relationship between the number of hours studied and test scores. This line is known as the **LINE OF BEST FIT.** As you can see, none of the points lie on the line of best fit, but that's okay! This is because the line of best fit is the best line that describes the relationship of **ALL** the points on the graph

Scatter plots show three types of relationships, called **CORRELATIONS**:

** POSITIVE CORRELATION:** As one set of values increases, the other set increases as well (but not necessarily every value).

EXAMPLE: As the population increases, so does the number of primary schools

** NEGATIVE CORRELATION:** As one set of values increases, the other set decreases (but not necessarily every value).

EXAMPLE: As the price of peaches goes down, the number of peaches sold goes up.

** NO CORRELATION:** The values have no relationship.

EXAMPLE: A person's IQ is not related to his/her shoe size, so there is no correlation.

**Apart from these graphs, random graphs like pie charts, box plots, cumulative frequency, etc., can be used.** They are rare and therefore we wont describe each one of them in detail. If they do show up on the test, they are straightforward to answer. A couple of examples are below.

##### Example 1

What is the area of the pie chart that is represented by those who chose hamburgers as their favorite food?** A.** 97.2°

**97.4°**

*B.***98.4°**

*C.***98.8°**

*D.*** Solution**: We know that 27% of those in the survey chose hamburgers as their favorite food. This corresponds to 27% of the area of the circle, which means that the central angle of the circle accounts for 27% of the measure of the sum of all central angles in a circle (360°).

In other words, we are looking for the angle that makes up 0.27 of the entire circle.

0.27 x 360°

= 97.2°

The correct answer is A.

##### Example 2

At what pH is the enzyme activity at its maximum?** A.** 1

**7**

*B.***10**

*C.***14**

*D.*Solution: Enzyme activity is at its maximum in the earliest stages of the experiment at about time = 0. The pH at this time is about 1.5. The closest correct answer is A.

# STATISTICAL MEASURES

Statistics have been a notoriously feared SAT topic for many test takers. The fear inducing notoriety and intimidating reputation associated with these mathematical operations have been the product of a lack of understanding of the somewhat vague and confusing mathematical operations used to calculate statistical measures.

The most used **statistical** measures are described below:

## 1. The Mean (also called the average)

The mean is a calculated central value of a set of numbers. To calculate the mean, add all of the numbers, then divide the sum by how the number of items.

##### Example 3

If x is the average (arithmetic mean) of m and 9, y is the average of 2m and 15, and z is the average of 3m and 18, what is the average of x,y, and z in terms of m?** A.** m + 6

**m + 7**

*B.***2m + 14**

*C.***3m + 21**

*D.*** Solution**: There are a lot of variables in this equation, but don't let them confuse you. We already know that the average of two numbers is the sum of those two numbers divided by 2. That means that:

\(x=\frac{m+9}{2}\)

\(y=\frac{2m+15}{2}\)

\(z=\frac{3m+18}{2}\)

Now we need to find the average of x, y, and z. Substituting the previous expressions for m gives us:

\(\frac{m+9+2m+15+3m+18}{3\times 2}\)

\(\frac{6m+42}{6}\)

m + 7

The correct answer is B.

##### Example 4

The mean score of 8 players in a basketball game was 14.5 points. If the highest individual score is removed, the mean score of the remaining 7 players become 12 points. What was the highest score?** A. **20

**24**

*B.***32**

*C.***36**

*D.*** Solution**: If the mean score of 8 players is 14.5, then the total of those 8 scores is 14.5 x 8 = 116. If the mean of 7 scores is 12, then the total of those 7 scores is 12 x 7 = 84.

Since the set of 7 scores was created by removing the highest score from the set of 8 scores, the difference between the total of all 8 scores and the set of 7 scores is equal to the removed score.

116 - 84 = 32

The correct answer is C.

## 2. Weighted Averages

The basic formula for averages applies only to sets of data consisting of individual values, all of which are equally weighted (i.e., none of the values “counts” toward the average any more than any other value does). When you consider sets in which some data are more heavily weighted than other data - whether weighted by percent, frequencies, ratios, or fractions - you need to use special techniques for **WEIGHTED AVERAGES.**

A weighted average of only two values will fall closer to whichever value is weighted more heavily. For instance, if a drink is made by mixing 2 shots of a liquor containing 15% alcohol with 3 shots of a liquor containing 20% alcohol, then the alcohol content of the mixed drink will be closer to 20% than to 15%.

EXAMPLE: A mixture of "lean" ground beef (10% fat) and "super-lean" ground beef (3% fat) has a total fat content of 8%. What is the ratio of "lean" ground beef to "superlean" ground beef?

Fortunately, you do not need to complete any complicated formulas here. Instead, you need to look at the difference between the fat contents of “lean” and “super lean” ground beef and the fat content of the final mixture.

“Lean” ground beef has a fat content that is 2% higher than the fat content of the final mixture. You can say that “lean” ground beef has a +2 differential.

Similarly, “super lean” ground beef has a fat content that is 5% lower than the fat content of the final mixture. You can say that “super lean” ground beef has a -5 differential. You need to make these differentials cancel out, so you should multiply both differentials by different numbers so that the positive will cancel out with the negative. If you were to set up an equation, it would look something like this:

x(+2) + y(-5) = 0

Now all you have to do is pick values for x and y. If x = 5 and y = 2, the equation will be true (10 + (-10) = 0). That means that for every 5 parts “lean” ground beef, you have 2 parts “super-lean” ground beef. The ratio is 5:2.

This relationship holds whenever two groups are averaged together. Suppose that A and B are averaged together. If they are in a ratio of a:b, then you can multiply the differential of A by a, and it will cancel out with the differential of B times b.

EXAMPLE: A group of men and women in a ratio of 2:3. If the men have an average age of 50, and the average age of the group is 56, you can easily figure out the average age of the women in the group. Men have a -6 differential, and there are 2 of them for every 3 women. If the average age of women is w, then:

2 x (-6) + 3 x (w) = 0

-12 + 3w = 0

w = 4

Women have a +4 differential. The average age of the women in the group is 56 + 4 = 60 years old.

## 3. Median

A median is the middle number of a data set when all of the items are written in order, from least to greatest (or greatest to least).

The following are the steps to calculate the median of a list:

- Arrange the list in ascending or descending order
- If there are an odd number of items on the list, the middle item equals the median

\(Median(odd\: number\: of\: Items)=\frac{(n+1)^{th}}{2}term\) - If there are an even number of items on the list, then the median is the average of the two middle numbers

\(Median(odd\: number\: of\: Items)=\frac{n^{th}term+(n+1)^{th}}{2}term\)

For EXAMPLE:

- In the 7-element set { 3, 5, 7, 9, 13, 15, 17 }, the median is 9
- In the 8-element set { 3, 5, 7, 9, 13, 15, 17, 17 }, the median is 11 (the average of the fourth and fifth entries, 9 and 13).

**Note:**

- When the number of items on the list is even, the median can equal a number not on the list.
- An absurdly large number, far away from the rest of the set, such as 312 in this last set, has zero effect on the median, although it would have a big effect on the mean
- As long as the middle numbers stay in the middle, changes to the values of the outer numbers have no effect on the median; by contrast, changing any number in the set changes the mean.

##### Example 4

A sociologist chose 300 students at random from each of two schools and asked each student how many siblings he or she has. The results are shown in the table below:

There are a total of 2400 students at Lincoln School and 3,300 students at Washington School. What is the median number of siblings for all the students surveyed?** A. **0

**1**

*B.***2**

*C.***3**

*D.*** Solution**: There were a total of 600 data points collected (300 from each school) which means the median will be between the 300th and 301st numbers.

Fortunately, there's a way to solve the problem without having to write out 600 numbers! You can put the numbers into groups based on the information you're given in the chart.

For each "number of siblings" value, add the number of respondents from each of the two schools together. For example, 120 students from Lincoln School and 140 students from Washington School said they had no siblings, and 120 + 140 = 260. So a total of 260 students have 0 siblings. Do this for each of the sibling values.

- 260 students have 0 siblings
- 190 students have 1 sibling
- 90 students have 2 siblings
- 40 students have 3 siblings
- 20 students have 4 siblings.

Both the 300th and the 301st values will come in the second category (190 students have 1 sibling). The correct answer is B.

##### Example 5

A survey was taken regarding the value of homes in a county, and it was found that the mean home value was $165,000, and the median home value was $125,000. Which of the following could explain the difference between the mean and median home values in the country?** A.** The homes have values that are close to each other

**There are a few homes that are valued much less than the rest**

*B.***There are a few homes that are valued much more than the rest**

*C.***Many of the homes have values between $125,000 and $165,000**

*D.*** Solution**: The mean and median of a set of data are equal when the data has a perfectly symmetrical distribution (such as a normal distribution). If the mean and median aren't equal to each other, that means the data isn't symmetrical and that there are outliers.

When there are outliers in the data, the mean will be pulled in the outliers’ direction (either smaller or larger) while the median remains the same. In this problem, the mean is larger than the median. That means the outliers are several homes that are significantly more expensive than the rest, since these outliers push the mean to be larger without affecting the median.

Therefore, the correct answer is C, There are a few homes that are valued much more than the rest.

*Median of sets containing unknown values*

Unlike the arithmetic mean, the median of a set depends only on the one or two values in the middle of the ordered set. Therefore, you may be able to determine a specific value for the median of a set even if one or more unknowns are present.

For instance, consider the unordered set {x, 2, 5, 11, 11, 12, 33}. No matter whether x is less than 11, equal to 11, or greater than 11, the median of the resulting set will be 11. (Try substituting different values of x to see why the median does not change.)

By contrast, the median of the unordered set {x, 2, 5, 11, 12, 12, 33} depends on x. If x is 11 or less, the median is 11. If x is between 11 and 12, the median is x. Finally, if x is 12 or more, the median is 12.

EXAMPLE: Sarah spends 2 hours on Tuesday, Thursday, and Saturday practicing her violin. On Monday she practices for 90 minutes, and on Friday she practices for 1 hour. What is the median time she spends practicing her violin in a given week?

*STEP 1:* Convert all the given time values into the same unit of measure.

Monday = 90 mins (1.5 hours)

Tuesday = 2 hours

Thursday = 2 hours

Friday = 1 hour

Saturday = 2 hours

*STEP 2:* Organize the times in ascending order: 1, 1.5, 2, 2, 2

Since there is an odd number of data entries, 5, in the set, the median will be the 3rd set in the list. Therefore, the median is 2.

##### Example 6

In Billy’s 5th grade class, the test scores for the final math exam were 87, 54, 77, 92, 95, 91, x,

and 90. What is the value of x if the median of the test scores is 82?

** A. **54

**74**

*B.***77**

*C.***92**

*D.*** Solution**:

*STEP 1:*Reorder the data set in ascending order: 54, 77, 87, 90, 91, 92, 95, x

STEP 2: Since the set has an even number of data entries, the median will be the average of

the two middle numbers.

Therefore, the median will be the average of the 4th and 5th terms.

\(The\: median\: is\: \frac{90+91}{2}=90.5\neq 82\)

*STEP 3:* Now, let’s reorder our data set with the x on the other end of the set: x, 54, 77, 87, 90,91, 92, 95

The median will still be the average of the 4th and 5th terms, but those terms will be different

now:

\(The\: median\: is\: \frac{87+90}{2}=88.5\neq 82\)

Now we know that the x-value will need to be one of the middle numbers that the average is being taken from; therefore, we can rewrite the set as follows: 54, 77, 87, x, 90, 91, 92, 95

\(The\: median\: is\: \frac{x+90}{2}=82\)

x + 90 = 164

x = 74

The correct answer is B.

## 4. Mode

The MODE represents the value or values in a data set that is/are repeated the most. It is possible for a data set to have one, multiple, or no modes.

For EXAMPLE,

- There is no mode in the data set {1, 2, 3, 4, 5} since each number appears an equal number of times
- In the data set {2, 4, 5, 2, 3, 1, 2, 2}, 2 is the mode, as it is seen four times
- In the data set {1, 1, 2, 2, 3, 4}, there are two modes, 1 and 2, since each is seen twice in the set

EXAMPLE: Given the set {72, 75, 85, 90, 90, x}, what is the value of x if a mode of the set is 90 and its median is 85?

** Solution**: There are three possible places x can be located:

- {72, 75, 85, 90, 90, x}
- {x ,72, 75, 85, 90, 90}
- {72, 75, x, 85, 90, 90}

The median of each scenario is:

- = 87.5 ≠ 85
- = 80 ≠ 85
- = 85, Therefore, x = 85

Now, write down the set with x as 85: {72, 75, 85, 85, 90, 90}. In this set, the modes are both 85 and 90. Therefore the correct answer for x is 85.

## 5. Range

The SAT loves this statistical measure because it’s so simple. The **RANGE** is the difference between the maximum value and the minimum value.

- In the set {3, 5, 7, 9, 13, 15, 17}, the range = 17 - 3 = 14
- In the set {1, 3, 3, 3, 3, 3, 74, 89, 312}, the range = 312 - 1 = 311

For the range, the only thing that matters is the top and bottom values in the set.

The smaller the range of a set, the closer its data entries are to one another. If the range is large, then either there is more space between the data entries or there are outliers in the data.

EXAMPLE: What is the value of x if the range is 12: {2, 3, x, 5, 12}

Solution: Find the range of the set, assuming x lies in the middle of the set.

Range = 12 − 2 = 10 ≠ 12

This tells us that x must be either greater than 12 or less than 2.

** POSSIBILITY 1 **{x, 2, 3, 5, 12}

Range = 12 − x = 12

X = 0

** POSSIBILITY 2**{2, 3, 5, 12, x}

Range = x − 2 = 12

X = 14

In order for the range of this particular data set to be 12, x must be 0 or 14.

## 6. Standard Deviation

The mean and median both give “average” or “representative” values for a set, but they do not tell us the whole story. It is possible for two sets to have the same average but to differ widely in how spread out their values are. To describe the spread, or variation, of the data in a set, you use a different measure: **STANDARD DEVIATION.**

**STANDARD DEVIATION (SD)** indicates how far from the average (mean) the data points typically fall.

- A small SD indicates that a set is clustered closely around the average (mean) value
- A large SD indicates that the set is spread out widely, with some points appearing far from the mean.

EXAMPLE: Consider the sets {5, 5, 5, 5}, {2, 4, 6, 8}, and {0, 0, 10, 10}. These sets all have the same mean value of 5. You can see at a glance, though, that the sets are very different, and the differences are reflected in their SDs. The **first** set has an SD of zero (no spread at all), the **second** set has a moderate SD, and the **third** set has a large SD.

You might be asking where the √5 comes from in the technical definition of SD for the second set.

The good news is that you do not need to know - it is very unlikely that an SAT problem will ask you to calculate an exact SD. If you just pay attention to what the average spread is doing, you’ll be able to answer all SAT standard deviation problems, which involve either

- Changes in the SD when a set is transformed, or
- Comparisons of the SDs of two or more sets.

Just remember that the more spread out the numbers, the larger the SD.

If you come across a problem on the test that focuses on changes in the SD, ask yourself whether the changes move the data closer to the mean, farther from the mean, or neither. If you see a problem requiring comparisons, ask yourself which set is more spread out from its mean.

Below are some sample problems to help illustrate standard deviation:

- Which set has the greater standard deviation: {1, 2, 3, 4, 5} or {440, 442, 443, 444, 445}?
- If each data point in a set is increased by 7, does the set's standard deviation increase, decrease, or remain constant?
- If each data point in a set is increased by a factor of 7, does the set's standard deviation increase, decrease, or remain constant? (Assume that the set consists of different numbers)

** Solution:**

- The second set has the greater SD. One way to understand this is to observe that the gaps between its numbers are, on average, slightly bigger than the gaps in the first set. Only the spread matters. The numbers in the second set are much more “consistent” in a sense - they are all within about 1% of each other, while the largest numbers in the first set are several times the smallest ones. However, this “percent variation” idea is irrelevant to the SD.
- The SD will not change. “Increased by 7” means that the number 7 is added to each data point in the set. This transformation will not affect any of the gaps between the data points, and thus it will not affect how far the data points are from the mean.
- The SD will increase. “Increased by a factor of 7” means that each data point is multiplied by 7. This transformation will make all the gaps between points 7 times as big as they originally were. Thus, each point will fall 7 times as far from the mean. The SD will increase by a factor of 7.

##### Example 7

Set S has a mean of 10 and a standard deviation of 1.5. We are going to add two additional numbers to Set S. Which pair of numbers would decrease the standard deviation the most?** A. **{2, 10}

**{10, 18}**

*B.***{7, 13}**

*C.***{9, 11}**

*D.*** Solution**: This is a very tricky problem. The starting list has a mean of 10 and a standard deviation of 1.5.

** A.** hese two numbers don’t have a mean of 10, so adding them will change the mean; what’s more, one number is “far away,” which will wildly decrease the mean, increasing the deviations from the mean for almost every number on the list, and therefore increasing the standard deviation.

**These choices don’t have a mean of 10, so adding them will change the mean. One number is “far away,” which will wildly increase the mean, increasing the deviations from the mean for almost every number on the list, and therefore increasing the standard deviation.**

*B.***These options are centered on 10, so adding them will not change the mean. Both of these are a distance of 3 units from the mean, and this is larger than the standard deviation, so the size of the typical deviation from the mean will increase.**

*C.***This is the correct answer. These are centered on 10, so adding them will not change the mean. Both are a distance of 1 unit from the mean, and this is less than the standard deviation, so the size of the typical deviation from the mean will decrease.**

*D.*##### Example 8

Set Q consists of the following five numbers: Q = {5, 8, 13, 21, 34}. Which of the following sets has the same standard deviation as Set Q?

I. {35, 38, 43, 51, 64}

II. {10, 16, 26, 42, 68}

III. {46, 59, 67, 72, 75}

** A.** I only

**I & II**

*B.***I & III**

*C.***I, II, & III**

*D.*** Solution**: Notice that

*Set I*is just every number in Q plus 30. When you add the same number to every number in a set, you simply shift it up without changing the spacing, so this doesn’t change the standard deviation at all.

*Set I*has the same standard deviation as Q.

Notice that *Set II* is just every number in Q multiplied by 2. Multiplying by a number does change the spacing, so this **does** change the standard deviation. *Set II* does **not** have the same standard deviation as Q.

*Set III* is very tricky and probably is at the outer limit of what the SAT could ever ask you you to consider. The spacing between the numbers in *Set III*, from right to left, is the same as the spacing between the numbers in Q from left to right.

The correct combination is *I and III*, so the answer is C.

##### Example 9

Consider the following sets:

L = {3, 4, 5, 5, 6, 7}

M = {2, 2, 2, 8, 8, 8}

N = {15, 15, 15, 15, 15, 15}

Rank those three sets from least standard deviation to greatest standard deviation.

** A. **L, M, N

**M, L, N**

*B.***. M, N, L**

*C***N, L, M**

*D.*** Solution**: OK, first of all, set N has six numbers that are all the same. When all the members of a set are identical, the standard deviation is zero, which is the smallest possible standard deviation. So, automatically, N must have the lowest. Right away, we can eliminate options A, B and C. Only D remains. The correct answer is D.

## 7. Error

**Error** is a way to describe the accuracy of a data set. When dealing with error in statistics, it is important to pay special attention to the wordings that will differentiates between ‘**EXPECTED VALUE**’ and ‘**ACTUAL VALUES**’.

\(Error%=\frac{Actual\: value-Expected\: value}{Expected value}\times 100\)

EXAMPLE A researcher studies a certain species of fish. He finds that the size of the fish population is limited by the size of the lakes in which they live, and derives an equation to model the expected population size, P, based on surface area, A, of the lake in square feet:

P = 5 + 0.83A

If the researcher finds that a particular lake has a surface area of \(342ft^{2}\) and a population of 310 fish, what is the percent error from the predicted value?

Solution: The actual value is 310 fish, given in the question. The expected value can be calculated using the researcher’s equation to predict the expected fish population in a lake with surface area \(342ft^{2}\).

P = 5 + 0.83(342)

P = 288.86

The theoretical value is 288.86 and the actual value is 310 fish. Therefore,

\(Error%=\frac{Actual\: value-Expected\: value}{Expected value}\times 100=\frac{310-288.86}{288.86}\times 100=7.32%\)

# SAMPLING AND MODELING

One of the important jobs of a statistician is to make predictions. For example, the Indian Government may want to find out the average age of a person in India. It is impossible for the government to survey each and every individual in the country. Therefore, it hires statisticians to take random samples of residents and make predictions based on the data they collect. The SAT will often test you on the relationship between such sample data and the predictions you can or cannot make about the entire population.

##### Example 10

Manchester United Football Club chose 1,000 of their fans at random and asked each fan how many jerseys he or she has. The results are shown in the table below.

There are a total of 16 million Manchester United fans. Based on the survey data, what is the expected total number of fans who own 3 jerseys?** A.** 2 million

**4 million**

*B.***6 million**

*C.***8 million**

*D.*** Solution**: Using the sample data, we can estimate the total number of fans who own 3 jerseys are:

16 million x 250/1000 = 4 million

The correct answer is B.

In order to accurately predict, the SAT usually tests on the following key concepts:

## 1. Line of Best Fit

A statistician is trying to figure out the relationship between the number of fans football clubs have, and the number of tournaments football clubs have won. Therefore, he collects information regarding the number of fans, and the number of tournaments won for 100 football clubs in Europe. He then plots those points on a scatterplot graph.

The line of best fit refers to a line through that scatter plot that best expresses the relationship between those points: Number of fans of a football club vs. the number of tournaments won. **The line of best fit can be used to figure out if there is a relationship between those two points.**

##### Example 11 & 12

The scatterplot shows the number of pollinating flowers for 20 different aged Southern Magnolia plants. The line of best fit is also shown.

11. Which of the following is the best interpretation of the slope of the line of best fit in the context of this problem? The predicted increase in the age of the Southern Ma

A.