Bias vs. Noise

University of San Francisco

Matt Meister

Bias vs Noise

In statistics, the terms bias and noise refer to very specific things.

Bias:

  • Systematic errors that consistently push measurements or estimates away from the true value
  • It is predictable, and often stems from flaws in the measurement process or the sampling method.
  • Bias tends to push measurements or estimates in a particular direction, either overestimating or underestimating the true value.

Noise:

  • Randomness or fluctuations in data that can’t be attributed to a systematic cause
  • Unpredictable. Can result from various sources, including measurement errors, and chance.
  • Noise doesn’t consistently push data in one direction but adds random fluctuations around the true value.

Bias vs Noise

How we handle each also differs:

  • Bias:
    • Usually involves identifying and eliminating or reducing sources of systematic error.
    • Techniques such as calibration, randomization, and careful study design can help mitigate bias.
  • Noise:
    • Inherently random and cannot be eliminated entirely.
    • Techniques like averaging, statistical tests, and increasing sample sizes reduce the impact of noise.

Noise

“Why isn’t having a small sample size a source of bias?”

Noise

“Why isn’t having a small sample size a source of bias?”

Read in the data customerData.csv

customerData <- read.csv("customerData.csv")

Let’s treat this full data set of 1000 observations as the population – the entire set of people we are interested in.

In the population, what is the mean of income?

mean(customerData$income)
[1] 65476.08

Noise

Imagine we did not know this number for the population

  • We could only estimate it by surveying a sample of people.
  • Take a random subset of 10 observations by running this code:
sample.size <- 10

customerDataSmall <- customerData[
  sample(x = 1:nrow(customerData),
         size = sample.size,
         replace = F), 
  ]

Noise

If you did not know the population average of income, how would you estimate it with your survey sample?

  • You would see what the mean in the sample is!
mean(customerDataSmall$income)
[1] 54949.2

That sample estimate is going to be noisy

  • It’s going to vary from sample to sample around the population average
    • “true mean”

Noise

That sample estimate is going to be noisy

  • It’s going to vary from sample to sample around the population average.

  • If everyone in the class had their own sample (which you do), what might your different estimates look like?

Noise

Are these estimates biased?

  • Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nsims
[1] 0.51

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nsims
[1] 0.49

Noise

With a small sample, our results are not more likely to fall on one side of the true mean than the other.

  • As long as our data don’t have crazy outliers!!

What is the benefit of larger samples, then?

  • Precision!
  • How far off were our estimates, on average?
mean(abs(sample.results$mean - mean(customerData$income))) |>
  round(2)
[1] 5357.99

Noise

Let’s try with samples of 20:

sample.size <- 20

customerDataSmall <- customerData[
  sample(x = 1:nrow(customerData),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)
[1] 70132.8

Noise

Let’s try with samples of 20:

Noise

Let’s try with samples of 20:

  • How far off was our estimate, on average?
mean(abs(sample.results[sample.results$sample.size==20,'mean'] - mean(customerData$income))) |>
  round(2)
[1] 3638.14

Noise

Let’s try with samples of 100:

sample.size <- 100

customerDataSmall <- customerData[
  sample(x = 1:nrow(customerData),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)
[1] 66418.28

Noise

Let’s try with samples of 100:

Noise

Let’s try with samples of 100:

  • How far off was our estimate, on average?
mean(abs(sample.results[sample.results$sample.size==100,'mean'] - mean(customerData$income))) |>
  round(2)
[1] 1570.13

Noise

Conclusion:

  • Noise is random error
  • It causes our estimates to bounce around the true/population mean
  • Makes our estimates imprecise
  • But it doesn’t push them in one direction or another

Bias on the other hand…

Bias

Is bad…er

Bias

  • Systematic errors that consistently push measurements or estimates away from the true value
  • It is predictable, and often stems from flaws in the measurement process or the sampling method.
  • Bias tends to push measurements or estimates in a particular direction, either overestimating or underestimating the true value.
  • Is not made better by increasing sample sizes

Bias

Let’s imagine that instead of estimating income with a random sample of 10/20/100 people, we sent out a survey, and got 10 responses.

  • But young people were more likely to respond
sample.size <- 10

under30 <- customerData[customerData$age <= 29,]

customerDataSmall <- under30[
  sample(x = 1:nrow(under30),
         size = sample.size,
         replace = F), 
  ]

Bias

What is our estimate of income from these samples?

Bias

Are these estimates biased?

  • Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)
[1] 0.071

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)
[1] 0.929

Bias

Let’s try with samples of 20:

sample.size <- 20

customerDataSmall <- under30[
  sample(x = 1:nrow(under30),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)
[1] 57041.2

Bias

Let’s try with samples of 20:

Bias

Are these estimates biased?

  • Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)
[1] 0.042

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)
[1] 0.958

Bias

Let’s try with samples of 100:

sample.size <- 100

customerDataSmall <- under30[
  sample(x = 1:nrow(under30),
         size = sample.size,
         replace = F), 
  ]

mean(customerDataSmall$income)
[1] 56206.99

Bias

Let’s try with samples of 100:

Bias

Are these estimates biased?

  • Are they more likely to be above or below the true mean?

% above:

sum( # Sum of the logical argument
  sample.results$mean > mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)
[1] 0.028

% below:

sum( # Sum of the logical argument
  sample.results$mean < mean(customerData$income))/ # Is the sample mean > than the pop?
  nrow(sample.results)
[1] 0.972

Bias

Conclusion:

  • Bias is not random error
  • It causes our estimates to be higher or lower than the true/population mean
  • Makes our estimates predictably wrong
  • It does not get better with sample size