Bias vs. Noise

University of San Francisco

Matt Meister

Bias vs Noise

In statistics, the terms bias and noise refer to very specific things.

Bias:

Systematic errors that consistently push measurements or estimates away from the true value
It is predictable, and often stems from flaws in the measurement process or the sampling method.
Bias tends to push measurements or estimates in a particular direction, either overestimating or underestimating the true value.

Noise:

Randomness or fluctuations in data that can’t be attributed to a systematic cause
Unpredictable. Can result from various sources, including measurement errors, and chance.
Noise doesn’t consistently push data in one direction but adds random fluctuations around the true value.

Bias vs Noise

How we handle each also differs:

Bias:
- Usually involves identifying and eliminating or reducing sources of systematic error.
- Techniques such as calibration, randomization, and careful study design can help mitigate bias.
Noise:
- Inherently random and cannot be eliminated entirely.
- Techniques like averaging, statistical tests, and increasing sample sizes reduce the impact of noise.

Noise

“Why isn’t having a small sample size a source of bias?”

Noise

“Why isn’t having a small sample size a source of bias?”

Read in the data customerData.csv

customerData <- read.csv("customerData.csv")

Let’s treat this full data set of 1000 observations as the population – the entire set of people we are interested in.

In the population, what is the mean of income?

mean(customerData$income)

[1] 65476.08

Noise

Imagine we did not know this number for the population

We could only estimate it by surveying a sample of people.

Take a random subset of 10 observations by running this code:

sample.size <- 10

customerDataSmall <- customerData[
  sample(x = 1:nrow(customerData),
         size = sample.size,
         replace = F), 
  ]

Noise

If you did not know the population average of income, how would you estimate it with your survey sample?

You would see what the mean in the sample is!

mean(customerDataSmall$income)

[1] 54949.2

That sample estimate is going to be noisy

It’s going to vary from sample to sample around the population average
- “true mean”

Noise

That sample estimate is going to be noisy

It’s going to vary from sample to sample around the population average.
If everyone in the class had their own sample (which you do), what might your different estimates look like?

Noise

Are these estimates biased?

Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nsims

[1] 0.51

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nsims

[1] 0.49

Noise

With a small sample, our results are not more likely to fall on one side of the true mean than the other.

As long as our data don’t have crazy outliers!!

What is the benefit of larger samples, then?

Precision!
How far off were our estimates, on average?

mean(abs(sample.results$mean - mean(customerData$income))) |>
  round(2)

[1] 5357.99

Noise

Let’s try with samples of 20:

sample.size <- 20

customerDataSmall <- customerData[
  sample(x = 1:nrow(customerData),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)

[1] 70132.8

Noise

Let’s try with samples of 20:

Noise

Let’s try with samples of 20:

How far off was our estimate, on average?

mean(abs(sample.results[sample.results$sample.size==20,'mean'] - mean(customerData$income))) |>
  round(2)

[1] 3638.14

Noise

Let’s try with samples of 100:

sample.size <- 100

customerDataSmall <- customerData[
  sample(x = 1:nrow(customerData),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)

[1] 66418.28

Noise

Let’s try with samples of 100:

Noise

Let’s try with samples of 100:

How far off was our estimate, on average?

mean(abs(sample.results[sample.results$sample.size==100,'mean'] - mean(customerData$income))) |>
  round(2)

[1] 1570.13

Noise

Conclusion:

Noise is random error
It causes our estimates to bounce around the true/population mean
Makes our estimates imprecise
But it doesn’t push them in one direction or another

Bias on the other hand…

Bias

Is bad…er

Bias

Systematic errors that consistently push measurements or estimates away from the true value
It is predictable, and often stems from flaws in the measurement process or the sampling method.
Bias tends to push measurements or estimates in a particular direction, either overestimating or underestimating the true value.
Is not made better by increasing sample sizes

Bias

Let’s imagine that instead of estimating income with a random sample of 10/20/100 people, we sent out a survey, and got 10 responses.

But young people were more likely to respond

sample.size <- 10

under30 <- customerData[customerData$age <= 29,]

customerDataSmall <- under30[
  sample(x = 1:nrow(under30),
         size = sample.size,
         replace = F), 
  ]

Bias

What is our estimate of income from these samples?

Bias

Are these estimates biased?

Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)

[1] 0.071

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)

[1] 0.929

Bias

Let’s try with samples of 20:

sample.size <- 20

customerDataSmall <- under30[
  sample(x = 1:nrow(under30),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)

[1] 57041.2

Bias

Let’s try with samples of 20:

Bias

Are these estimates biased?

Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)

[1] 0.042

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)

[1] 0.958

Bias

Let’s try with samples of 100:

sample.size <- 100

customerDataSmall <- under30[
  sample(x = 1:nrow(under30),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)

[1] 56206.99

Bias

Let’s try with samples of 100:

Bias

Are these estimates biased?

Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)

[1] 0.028

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)

[1] 0.972

Bias

Conclusion:

Bias is not random error
It causes our estimates to be higher or lower than the true/population mean
Makes our estimates predictably wrong
It does not get better with sample size