Back to Statistics
Mathematics & Statistics / Statistics

Statistical Significance vs. Statistical Power: What P-Values Really Tell You (And What They Don't)

Statistical Significance vs. Statistical Power: What P-Values Really Tell You

In 2015, the Open Science Collaboration attempted to replicate 100 psychology studies published in top journals. Only 36% of the replications yielded significant results, even though 97% of the original studies reported statistically significant findings (p < 0.05). This wasn't a fluke—it exposed a fundamental misunderstanding of what p-values actually measure. A p-value tells you the probability of seeing your data if the null hypothesis is true, but it says nothing about whether your hypothesis is true, whether your effect is meaningful, or whether your study was powerful enough to detect real effects. Confusing statistical significance with practical importance has contributed to a replication crisis across multiple scientific fields.

Quick Reference: What Statistics Actually Measure

ConceptWhat It MeasuresWhat It Doesn't MeasureTypical Threshold
P-valueProbability of data if null hypothesis is trueProbability that null hypothesis is truep < 0.05
Statistical PowerProbability of detecting an effect if it existsWhether the effect actually exists80% (0.80)
Effect SizeMagnitude of the difference/relationshipStatistical significanceVaries by metric
Confidence IntervalRange of plausible effect sizesPrecision of individual prediction95% typical
Alpha LevelFalse positive rate you'll tolerateWhether any specific result is false0.05 typical

The P-Value Problem: What Researchers Get Wrong

What P-Values Actually Measure

A p-value of 0.03 means: "If there's truly no effect, you'd see data this extreme or more 3% of the time by random chance."

What it doesn't mean (despite widespread misinterpretation):

  • ❌ "There's a 97% chance my hypothesis is true"
  • ❌ "There's only a 3% chance this is due to chance"
  • ❌ "This effect is important or large"
  • ❌ "This result will replicate"

Real example: A drug study reports p = 0.04 for a blood pressure reduction. This tells you the result is unlikely if the drug does nothing. It doesn't tell you:

  • Whether the reduction is clinically meaningful (2 mmHg vs. 20 mmHg)
  • Whether the study had enough participants to detect small effects
  • Whether side effects outweigh benefits
  • The probability the drug actually works

The Replication Crisis Context

Multiple fields are experiencing replication failures:

Psychology: The Reproducibility Project (2015) found only 36% replication rate Economics: A 2016 study found 61% replication rate for experimental economics papers Cancer biology: A 2021 analysis found only 11% replication for preclinical cancer studies Social sciences: The "Many Labs" project found significant variation in replication success

Why p-values contributed to the crisis:

  1. Publication bias: Journals favor p < 0.05, creating incentive to find significance
  2. P-hacking: Researchers try multiple analyses until finding p < 0.05
  3. Low power: Small studies can achieve p < 0.05 by chance, especially with publication bias
  4. Misinterpretation: Treating p < 0.05 as proof rather than weak evidence

Statistical Power: The Other Side of the Coin

What Power Actually Measures

Statistical power is the probability of detecting an effect if that effect actually exists.

Power = 0.80 (80%) means: "If there's a real effect of the expected size, my study has an 80% chance of detecting it (p < 0.05)."

What this implies:

  • Even with a real effect, 20% of well-designed studies will fail to find significance
  • Underpowered studies (power < 50%) are essentially coin flips
  • Low-power studies that do find significance often overestimate effect size

The Power Crisis in Published Research

A 2017 meta-analysis of psychology studies found median statistical power of only 35% for detecting small effects. This means most studies were more likely to miss real effects than detect them.

Real consequences:

Medical research: A 2005 analysis found that many clinical trials were too small to detect realistic treatment effects. Underpowered studies waste resources and potentially harm patients enrolled in ineffective treatment arms.

Education research: A 2018 review found education intervention studies routinely have power below 0.50, meaning most true interventions go undetected while random noise occasionally achieves significance.

Business A/B testing: Companies running underpowered tests risk both missing valuable improvements and implementing changes that don't actually work (false positives).

Effect Size: What Actually Matters

Beyond Statistical Significance

You can achieve p < 0.05 with a trivial effect if your sample is large enough. Conversely, a large, important effect might not reach significance in a small study.

Example from real research:

Study A: 10,000 participants, finds 0.5 mmHg blood pressure reduction, p = 0.001 Study B: 50 participants, finds 8 mmHg reduction, p = 0.08

Study A is "statistically significant" but clinically meaningless. Study B might represent a real, important effect but doesn't reach conventional significance due to small sample size.

Common Effect Size Metrics

MetricUsed ForInterpretation
Cohen's dMean differences0.2 = small, 0.5 = medium, 0.8 = large
Pearson's rCorrelations0.1 = small, 0.3 = medium, 0.5 = large
Odds ratioBinary outcomes1.5 = small, 3 = medium, 6 = large
Variance explained0.02 = small, 0.13 = medium, 0.26 = large

Critical point: Effect size and statistical significance are independent. You need both to draw meaningful conclusions.

Using Statistical Calculators to Avoid Common Mistakes

When designing studies or interpreting published research, statistical power and sample size calculators help you:

Before your study:

  • Calculate required sample size for adequate power (typically 80%)
  • Determine detectable effect sizes given your resources
  • Assess whether your design can meaningfully test your hypothesis

After seeing results:

  • Calculate achieved power to understand replication probability
  • Determine confidence intervals around effect estimates
  • Assess whether "non-significant" results might reflect low power rather than no effect

Example: You want to detect a medium effect (d = 0.5) with 80% power at α = 0.05. A power calculator shows you need ~64 participants per group. With only 20 per group, your power drops to 33%—a coin flip.

This matters because underpowered studies aren't just uninformative—they're misleading. When they do find significance, they typically overestimate effect sizes (winner's curse).

Common Misconceptions About Statistical Testing

Misconception 1: "P < 0.05 means my hypothesis is probably true"

Reality: P-values assume the null hypothesis is true. They can't tell you the probability your hypothesis is correct. That requires Bayesian analysis considering prior probability.

What you can say: "If there's no real effect, results this extreme would occur less than 5% of the time." What you can't say: "There's a 95% chance my hypothesis is correct."

Misconception 2: "Non-significant results mean no effect"

Reality: Absence of evidence isn't evidence of absence, especially with low power.

Real example: Many early COVID-19 studies found "no significant difference" in outcomes between treatments. Later meta-analyses with higher combined power found significant effects. The early studies were underpowered, not necessarily wrong about effect direction.

Misconception 3: "Bigger samples always give more accurate results"

Reality: Large samples reduce random error but don't fix systematic bias. A biased sample of 1 million is worse than a representative sample of 1,000.

Example: Online surveys often have huge samples but severe selection bias, producing "significant" results that don't generalize.

Misconception 4: "P = 0.049 is meaningful, p = 0.051 is not"

Reality: The 0.05 threshold is arbitrary. These p-values provide nearly identical evidence.

Better approach: Report exact p-values and confidence intervals rather than just "significant" or "not significant." A result of p = 0.051 with a confidence interval showing likely meaningful effect is often more important than p = 0.001 with a trivial effect size.

Misconception 5: "Statistically significant = practically important"

Reality: With large samples, tiny, meaningless differences become statistically significant.

Real example: An educational app study with 50,000 users found statistically significant improvement in test scores (p < 0.001). The actual improvement: 0.8% on a 100-point test. Statistically robust, practically worthless.

What Researchers Should Do Instead

The New Statistical Mindset

  1. Pre-register studies: Specify hypotheses, methods, and analyses before data collection
  2. Calculate required power: Ensure adequate sample size for meaningful effects
  3. Report effect sizes and confidence intervals: Not just p-values
  4. Consider practical significance: Would the effect matter in the real world?
  5. Embrace replication: Single studies rarely provide definitive answers

Moving Beyond P-Values

Several fields are adopting alternative approaches:

Estimation over testing: Focus on effect size confidence intervals rather than binary significance Bayesian methods: Incorporate prior knowledge and provide probability statements about hypotheses Equivalence testing: Actively test whether effects are small enough to be meaningless Multi-lab collaborations: Pool resources for adequately powered studies

Example of better practice: Instead of reporting "p = 0.03," report "Cohen's d = 0.35, 95% CI [0.05, 0.65], indicating a small to medium effect that explained 11% of variance in outcomes."

Key Takeaways

Statistical significance and statistical power measure different things—you need both to draw valid conclusions. The replication crisis exposed what happens when researchers confuse p < 0.05 with proof, run underpowered studies that can't detect real effects, and prioritize statistical significance over practical importance.

Understanding what p-values actually measure—and what they don't—is essential for:

  • Designing studies that can meaningfully test hypotheses
  • Interpreting published research critically
  • Avoiding overconfidence in "significant" findings
  • Recognizing when "non-significant" results might reflect low power

The next time you see "statistically significant" in a headline, ask:

  • What was the effect size?
  • Was the study adequately powered?
  • What's the confidence interval?
  • Is the effect practically meaningful?

P < 0.05 isn't a finish line—it's a single, often misunderstood piece of evidence. Good research requires understanding the full statistical picture, including what your study can and can't detect.