View/Export Results
Manage Existing Surveys
Create/Copy Multiple Surveys
Collaborate with Team Members
Sign inSign in with Facebook
Sign inSign in with Google
Skip to article content

Statistical significance explained

What it means, how to interpret p-values, and how to use it in survey results.

Key Takeaways

  1. Statistical significance is about chance: A result is "statistically significant" when your data would be unlikely if the null hypothesis were true (often judged by p < 0.05).
  2. A p-value is not "the probability your result is wrong": It is the probability of seeing results at least this extreme assuming the null hypothesis is true.
  3. Sample size drives significance: Large samples can make tiny differences look significant; small samples can miss meaningful effects. Plan sample size before fielding.
  4. In surveys, significance is only one checkpoint: It does not fix response bias, poor sampling, or unclear measurement.
  5. Report more than p-values: Include the size of the difference (in points/units), uncertainty (confidence interval if available), and what the difference means for decisions.

What statistical significance means (plain English)

"Statistical significance" is a decision rule used in hypothesis testing. It answers a narrow question: if there were really no difference (or no effect), how surprising would your results be?

If that "surprise" is large enough (technically: if the p-value is small enough), analysts say the result is statistically significant. Government reference glossaries describe this in the same spirit: statistical significance indicates that an observed difference is unlikely to be due to chance alone under a specified model and threshold (see CDC definition and NIH NCATS glossary).

If you want the deeper background and terminology, start with statistical significance (including how it shows up in research reporting).

What significance is (and is not)

Is: evidence against a specific "no difference" claim (the null hypothesis), given assumptions.

Is not: proof your hypothesis is true, proof a result is important, or proof the finding will replicate.

Hypothesis testing in 60 seconds

Most significance testing follows the same structure:

  • warning
    Null hypothesis (H0): Usually "no difference" (e.g., Group A satisfaction rate equals Group B satisfaction rate).
  • warning
    Alternative hypothesis (H1): A difference exists (two-sided) or a specific direction exists (one-sided).
  • warning
    Test statistic: A number summarizing how far your data is from what H0 predicts (z, t, chi-square, etc.).
  • warning
    p-value: How likely you would see results at least as extreme as yours if H0 were true.
  • warning
    Alpha (alpha level): Your chosen cutoff for calling something "significant" (often 0.05).

The common rule is: if p < alpha, call the result statistically significant; otherwise, do not.

How to interpret p-values (correctly)

A p-value is widely misunderstood. Use this interpretation:

Normal curve with both tails beyond z = 2.01 shaded to show a two-sided p-value of about 0.044.
More extreme results under H0 make the p-value smaller.

p = 0.04 means: if the null hypothesis were true, results like yours (or more extreme) would happen about 4% of the time due to random sampling variability.

It does not mean there is a 4% chance the null hypothesis is true, and it does not mean your result will replicate with 96% probability. Those are different questions.

Two-sided vs one-sided p-values

Two-sided tests ask: "Is there any difference?" (either direction). One-sided tests ask: "Is A higher than B?"

In survey reporting, two-sided is usually safer unless you preregistered a directional hypothesis and would truly ignore a difference in the opposite direction.

Alpha, errors, and why sample size changes everything

Alpha is the false-positive tolerance you set before looking at results. If you set alpha = 0.05 and H0 is true, then (in the long run) about 5% of tests will look significant anyway.

Two stacked bars show that alpha 0.05 yields 5 false positives per 100 tests and alpha 0.01 yields 1 per 100 when the null is true.
Lower alpha reduces false positives but raises the bar for significance.

This connects to two classic error types:

  • Type I error: Calling a difference "real" when it is just noise (false positive). Alpha controls this rate.
  • Type II error: Missing a real difference (false negative). This is tied to power (1 - Type II error rate).

Sample size affects both. With more responses, the random noise shrinks, and even small differences can produce small p-values. With fewer responses, noise is larger and meaningful differences can look "not significant." Government guidance for large-scale assessments makes this explicit: statistical significance depends on both the observed difference and sample size (see NCES on statistical significance and sample size).

If you are planning a survey or comparing subgroups, treat sample size as a design choice, not an afterthought. See choosing a sample size for practical planning.

Survey walkthrough: is a 7-point difference significant?

Scenario: You ran the same customer survey in two segments and want to compare the share who answered "Satisfied" (top-box). You got:

Bars compare satisfied rates: Group A 52% (n=400) vs Group B 45% (n=420), a 7-point gap with p about 0.044.
A 7-point satisfaction gap can be significant with n�400 per group.
Example survey comparison (two proportions)
GroupSample size (n)Satisfied countSatisfied rate
A40020852%
B42018945%

Observed difference = 52% - 45% = 7 percentage points.

One common test here is a two-proportion z-test (for a quick check) or an equivalent chi-square test (same conclusion in this two-group case). Here is the logic without heavy math.

  1. Step 1: State the hypotheses

    H0: the true satisfaction rates are equal in A and B. H1: they are different.

  2. Step 2: Compute the best single estimate under H0

    Under "no difference," the best estimate is the pooled rate: (208 + 189) / (400 + 420) = 397 / 820 = 0.484.

  3. Step 3: Convert the 7-point gap into a standardized distance

    Using the pooled rate, the estimated standard error of the difference is about 0.035. The z-statistic is (0.52 - 0.45) / 0.035 = 2.01.

  4. Step 4: Turn z into a p-value

    A z of about 2.01 corresponds to a two-sided p-value around 0.044.

  5. Step 5: Compare to your alpha

    If alpha = 0.05, then 0.044 < 0.05, so you would call the difference statistically significant. If you required alpha = 0.01, you would not.

How to say this in a survey report:

  • Good: "Group A reported 52% satisfied vs 45% in Group B (7-point difference; p = 0.044)."
  • Avoid: "The segments are different" (too absolute) or "We proved A is better" (too strong).
Quick practical check

Before you test anything, verify you are comparing like with like: same question wording, same scale labels, same field period constraints, and comparable respondent eligibility rules. Otherwise you are testing a mix of real differences and measurement differences.

Statistical vs practical significance: what matters after p < 0.05

Statistical significance is about detectability, not importance. Practical significance is about whether the effect is large enough to matter for decisions.

In survey work, practical significance often looks like one of these:

  • Absolute difference: "+7 points satisfied" (easy to communicate).
  • Relative difference: "A is 16% higher than B" (0.52/0.45 - 1).
  • Downstream impact: "If this holds, it represents ~X more satisfied customers per 10,000."

A simple way to operationalize this is to set a minimum meaningful difference before you analyze (for example, "We only act if the gap is at least 3 points"). Then you use significance testing to check whether the data supports that kind of difference, not just any nonzero gap.

  • warning
    Report the estimate: show the two percentages (or means), not just "significant."
  • warning
    Report uncertainty: include p-value and, if you can, a confidence interval around the difference.
  • warning
    Connect to decisions: explain what a 1-point, 3-point, or 7-point shift means operationally.
  • warning
    Document choices: alpha, one- vs two-sided, which groups were compared, and whether comparisons were planned in advance. (See research methods for reporting habits that travel well.)

Common pitfalls when using significance in surveys

Significance testing is easy to misuse in survey dashboards because you can slice results many ways. These are the issues that most often break interpretation.

1) Multiple comparisons (the "too many cuts" problem)

If you compare 20 segments at alpha = 0.05, you should expect about 1 "significant" result just by chance even when nothing is going on. That does not mean you should never segment; it means you should plan comparisons, limit them, and treat exploratory findings as leads to confirm.

2) p-hacking (trying analyses until something is significant)

Common examples in survey work include: testing many different question variants, trying different recodes of a scale, excluding different subsets, or switching between one- and two-sided tests after seeing the data. A cleaner approach is to define rules up front and keep a short audit trail.

3) Biased data: significance does not fix it

A tiny p-value can coexist with a biased sample. If the wrong people respond, or response differs systematically by group, you may get "precise" but misleading estimates. Review response bias and how it interacts with your analysis.

Also distinguish "statistically significant" from "generalizable." Generalizability depends on how respondents were selected and who was reachable, which is a sampling methods issue.

4) Confounding: significance does not imply causality

A significant difference between groups does not tell you why the difference exists. If Group A differs from Group B in other ways (tenure, region, product mix), the observed gap may be due to those factors.

This is where correlational research cautions apply: statistical significance shows an association under a model, not a cause. If you need to adjust for other variables, use regression rather than relying only on simple group comparisons.

5) Measurement noise: poor questions create weak signals

Unclear wording, double-barreled items, or inconsistent scales increase noise and make it harder to detect real differences (or they create artifacts that look like differences). Improving question design is often a higher-leverage fix than changing alpha. See write better survey questions.

6) Dirty data and duplicate respondents

Basic checks matter: remove obvious speeders, verify eligibility, handle duplicates, and confirm coding. Otherwise, your p-values will be answering a different question than you think. Start with working with survey data before you test differences.

A practical workflow for using significance on real survey projects

If you want significance testing to help decisions (not derail them), use a simple workflow your team can repeat.

  1. Define the decision first

    What action changes if the metric is higher/lower? Write down the minimum difference that would trigger action.

  2. Plan comparisons and sample size

    Decide which segments and which metrics you will compare. Ensure each subgroup has enough n for the differences you care about (see sample size).

  3. Field with sampling in mind

    Use appropriate random sampling or recruitment controls. Track response rates by segment so you can spot nonresponse patterns early.

  4. Clean and validate data

    Run consistent exclusion rules and coding. Keep a short log of what was removed and why (see data quality checks).

  5. Run the right test for the metric

    Percentages: two-proportion test / chi-square. Means (e.g., 1-5 ratings): t-test for two groups, ANOVA for 3+ groups. If you need adjustment, use regression.

  6. Report: estimate, uncertainty, meaning

    Write one sentence that includes the two values, the difference, and the p-value (or confidence interval), plus one sentence on practical meaning. Use research best practices to keep reporting consistent.

Which significance test is typically used for common survey comparisons?
Survey outputExampleCommon comparisonTypical test
Proportion (percent)% "Satisfied"Two groupsTwo-proportion z-test or chi-square
Mean ratingAverage 1-5 satisfactionTwo groupsTwo-sample t-test
Mean ratingAverage 1-5 satisfaction3+ groupsANOVA (then follow-up tests)
Paired resultsSame people pre/postBefore vs afterPaired t-test or McNemar test
AssociationRole vs attrition intentTwo categorical variablesChi-square test of independence

If you are applying these ideas to an upcoming study, you can create a survey and plan your comparisons up front. If you want a second set of eyes on design or analysis choices, use survey help.

References

Frequently Asked Questions

quiz Does p < 0.05 mean there is a 95% chance the result is true? expand_more

No. A p-value is computed assuming the null hypothesis is true. p < 0.05 means your observed data (or more extreme data) would be uncommon under the null model, not that your conclusion has a 95% probability of being true.

quiz If a result is not significant, does that mean there is no difference? expand_more

Not necessarily. "Not significant" usually means you do not have enough evidence (given your sample size and variability) to rule out chance as a plausible explanation. The true difference could be small, or your sample could be too small to detect the difference you care about.

quiz Should I always use alpha = 0.05? expand_more

No. 0.05 is a convention, not a law. Use a stricter threshold (like 0.01) when false positives are costly or when you are running many comparisons. Use context: what decision is being made, and what is the cost of a wrong call?

quiz Why did a tiny difference become significant after I got more responses? expand_more

Because bigger samples reduce random error. With enough n, even small effects can produce small p-values. That is why you should pair p-values with practical significance (effect size) and a minimum meaningful difference.

quiz Can I use significance tests on Likert-scale survey questions? expand_more

Often, yes. Teams commonly compare mean scores with a t-test (two groups) or ANOVA (3+ groups), especially with 5- or 7-point scales and moderate sample sizes. If you want to be more conservative, you can also analyze distributions (for example, top-box percentages) and compare proportions.