Noise
“Why isn’t having a small sample size a source of bias?”
Noise
“Why isn’t having a small sample size a source of bias?”
Read in the data customerData.csv
customerData <- read.csv("customerData.csv")
Let’s treat this full data set of 1000 observations as the population – the entire set of people we are interested in.
In the population, what is the mean of income
?
mean(customerData$income)
Noise
Imagine we did not know this number for the population
- We could only estimate it by surveying a sample of people.
- Take a random subset of 10 observations by running this code:
sample.size <- 10
customerDataSmall <- customerData[
sample(x = 1:nrow(customerData),
size = sample.size,
replace = F),
]
Noise
If you did not know the population average of income
, how would you estimate it with your survey sample?
- You would see what the mean in the sample is!
mean(customerDataSmall$income)
That sample estimate is going to be noisy
- It’s going to vary from sample to sample around the population average
Noise
That sample estimate is going to be noisy
It’s going to vary from sample to sample around the population average.
If everyone in the class had their own sample (which you do), what might your different estimates look like?
Noise
Are these estimates biased?
- Are they more likely to be above or below the true mean?
% above:
sum( # Sum of the logical argument
sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
nsims
% below:
sum( # Sum of the logical argument
sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
nsims
Noise
With a small sample, our results are not more likely to fall on one side of the true mean than the other.
- As long as our data don’t have crazy outliers!!
What is the benefit of larger samples, then?
- Precision!
- How far off were our estimates, on average?
mean(abs(sample.results$mean - mean(customerData$income))) |>
round(2)
Noise
Let’s try with samples of 20:
sample.size <- 20
customerDataSmall <- customerData[
sample(x = 1:nrow(customerData),
size = sample.size,
replace = F),
]
mean(customerDataSmall$income)
Noise
Let’s try with samples of 20:
Noise
Let’s try with samples of 20:
- How far off was our estimate, on average?
mean(abs(sample.results[sample.results$sample.size==20,'mean'] - mean(customerData$income))) |>
round(2)
Noise
Let’s try with samples of 100:
sample.size <- 100
customerDataSmall <- customerData[
sample(x = 1:nrow(customerData),
size = sample.size,
replace = F),
]
mean(customerDataSmall$income)
Noise
Let’s try with samples of 100:
Noise
Let’s try with samples of 100:
- How far off was our estimate, on average?
mean(abs(sample.results[sample.results$sample.size==100,'mean'] - mean(customerData$income))) |>
round(2)
Noise
Conclusion:
- Noise is random error
- It causes our estimates to bounce around the true/population mean
- Makes our estimates imprecise
- But it doesn’t push them in one direction or another
Bias on the other hand…
Bias
- Systematic errors that consistently push measurements or estimates away from the true value
- It is predictable, and often stems from flaws in the measurement process or the sampling method.
- Bias tends to push measurements or estimates in a particular direction, either overestimating or underestimating the true value.
- Is not made better by increasing sample sizes
Bias
Let’s imagine that instead of estimating income
with a random sample of 10/20/100 people, we sent out a survey, and got 10 responses.
- But young people were more likely to respond
sample.size <- 10
under30 <- customerData[customerData$age <= 29,]
customerDataSmall <- under30[
sample(x = 1:nrow(under30),
size = sample.size,
replace = F),
]
Bias
What is our estimate of income
from these samples?
Bias
Are these estimates biased?
- Are they more likely to be above or below the true mean?
% above:
sum( # Sum of the logical argument
sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
nrow(sample.results)
% below:
sum( # Sum of the logical argument
sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
nrow(sample.results)
Bias
Let’s try with samples of 20:
sample.size <- 20
customerDataSmall <- under30[
sample(x = 1:nrow(under30),
size = sample.size,
replace = F),
]
mean(customerDataSmall$income)
Bias
Let’s try with samples of 20:
Bias
Are these estimates biased?
- Are they more likely to be above or below the true mean?
% above:
sum( # Sum of the logical argument
sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
nrow(sample.results)
% below:
sum( # Sum of the logical argument
sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
nrow(sample.results)
Bias
Let’s try with samples of 100:
sample.size <- 100
customerDataSmall <- under30[
sample(x = 1:nrow(under30),
size = sample.size,
replace = F),
]
mean(customerDataSmall$income)
Bias
Let’s try with samples of 100:
Bias
Are these estimates biased?
- Are they more likely to be above or below the true mean?
% above:
sum( # Sum of the logical argument
sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
nrow(sample.results)
% below:
sum( # Sum of the logical argument
sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
nrow(sample.results)
Bias
Conclusion:
- Bias is not random error
- It causes our estimates to be higher or lower than the true/population mean
- Makes our estimates predictably wrong
- It does not get better with sample size