Key Takeaways
- Statistical significance is about chance: A result is "statistically significant" when your data would be unlikely if the null hypothesis were true (often judged by p < 0.05).
- A p-value is not "the probability your result is wrong": It is the probability of seeing results at least this extreme assuming the null hypothesis is true.
- Sample size drives significance: Large samples can make tiny differences look significant; small samples can miss meaningful effects. Plan sample size before fielding.
- In surveys, significance is only one checkpoint: It does not fix response bias, poor sampling, or unclear measurement.
- Report more than p-values: Include the size of the difference (in points/units), uncertainty (confidence interval if available), and what the difference means for decisions.
What statistical significance means (plain English)
"Statistical significance" is a decision rule used in hypothesis testing. It answers a narrow question: if there were really no difference (or no effect), how surprising would your results be?
If that "surprise" is large enough (technically: if the p-value is small enough), analysts say the result is statistically significant. Government reference glossaries describe this in the same spirit: statistical significance indicates that an observed difference is unlikely to be due to chance alone under a specified model and threshold (see CDC definition and NIH NCATS glossary).
If you want the deeper background and terminology, start with statistical significance (including how it shows up in research reporting).
Is: evidence against a specific "no difference" claim (the null hypothesis), given assumptions.
Is not: proof your hypothesis is true, proof a result is important, or proof the finding will replicate.
Hypothesis testing in 60 seconds
Most significance testing follows the same structure:
- Null hypothesis (H0): Usually "no difference" (e.g., Group A satisfaction rate equals Group B satisfaction rate).
- Alternative hypothesis (H1): A difference exists (two-sided) or a specific direction exists (one-sided).
- Test statistic: A number summarizing how far your data is from what H0 predicts (z, t, chi-square, etc.).
- p-value: How likely you would see results at least as extreme as yours if H0 were true.
- Alpha (alpha level): Your chosen cutoff for calling something "significant" (often 0.05).
The common rule is: if p < alpha, call the result statistically significant; otherwise, do not.
How to interpret p-values (correctly)
A p-value is widely misunderstood. Use this interpretation:
p = 0.04 means: if the null hypothesis were true, results like yours (or more extreme) would happen about 4% of the time due to random sampling variability.
It does not mean there is a 4% chance the null hypothesis is true, and it does not mean your result will replicate with 96% probability. Those are different questions.
Two-sided tests ask: "Is there any difference?" (either direction). One-sided tests ask: "Is A higher than B?"
In survey reporting, two-sided is usually safer unless you preregistered a directional hypothesis and would truly ignore a difference in the opposite direction.
Alpha, errors, and why sample size changes everything
Alpha is the false-positive tolerance you set before looking at results. If you set alpha = 0.05 and H0 is true, then (in the long run) about 5% of tests will look significant anyway.
This connects to two classic error types:
- Type I error: Calling a difference "real" when it is just noise (false positive). Alpha controls this rate.
- Type II error: Missing a real difference (false negative). This is tied to power (1 - Type II error rate).
Sample size affects both. With more responses, the random noise shrinks, and even small differences can produce small p-values. With fewer responses, noise is larger and meaningful differences can look "not significant." Government guidance for large-scale assessments makes this explicit: statistical significance depends on both the observed difference and sample size (see NCES on statistical significance and sample size).
If you are planning a survey or comparing subgroups, treat sample size as a design choice, not an afterthought. See choosing a sample size for practical planning.
Survey walkthrough: is a 7-point difference significant?
Scenario: You ran the same customer survey in two segments and want to compare the share who answered "Satisfied" (top-box). You got:
| Group | Sample size (n) | Satisfied count | Satisfied rate |
|---|---|---|---|
| A | 400 | 208 | 52% |
| B | 420 | 189 | 45% |
Observed difference = 52% - 45% = 7 percentage points.
One common test here is a two-proportion z-test (for a quick check) or an equivalent chi-square test (same conclusion in this two-group case). Here is the logic without heavy math.
Step 1: State the hypotheses
H0: the true satisfaction rates are equal in A and B. H1: they are different.
Step 2: Compute the best single estimate under H0
Under "no difference," the best estimate is the pooled rate: (208 + 189) / (400 + 420) = 397 / 820 = 0.484.
Step 3: Convert the 7-point gap into a standardized distance
Using the pooled rate, the estimated standard error of the difference is about 0.035. The z-statistic is (0.52 - 0.45) / 0.035 = 2.01.
Step 4: Turn z into a p-value
A z of about 2.01 corresponds to a two-sided p-value around 0.044.
Step 5: Compare to your alpha
If alpha = 0.05, then 0.044 < 0.05, so you would call the difference statistically significant. If you required alpha = 0.01, you would not.
How to say this in a survey report:
- Good: "Group A reported 52% satisfied vs 45% in Group B (7-point difference; p = 0.044)."
- Avoid: "The segments are different" (too absolute) or "We proved A is better" (too strong).
Before you test anything, verify you are comparing like with like: same question wording, same scale labels, same field period constraints, and comparable respondent eligibility rules. Otherwise you are testing a mix of real differences and measurement differences.
Statistical vs practical significance: what matters after p < 0.05
Statistical significance is about detectability, not importance. Practical significance is about whether the effect is large enough to matter for decisions.
In survey work, practical significance often looks like one of these:
- Absolute difference: "+7 points satisfied" (easy to communicate).
- Relative difference: "A is 16% higher than B" (0.52/0.45 - 1).
- Downstream impact: "If this holds, it represents ~X more satisfied customers per 10,000."
A simple way to operationalize this is to set a minimum meaningful difference before you analyze (for example, "We only act if the gap is at least 3 points"). Then you use significance testing to check whether the data supports that kind of difference, not just any nonzero gap.
- Report the estimate: show the two percentages (or means), not just "significant."
- Report uncertainty: include p-value and, if you can, a confidence interval around the difference.
- Connect to decisions: explain what a 1-point, 3-point, or 7-point shift means operationally.
- Document choices: alpha, one- vs two-sided, which groups were compared, and whether comparisons were planned in advance. (See research methods for reporting habits that travel well.)
Common pitfalls when using significance in surveys
Significance testing is easy to misuse in survey dashboards because you can slice results many ways. These are the issues that most often break interpretation.
1) Multiple comparisons (the "too many cuts" problem)
If you compare 20 segments at alpha = 0.05, you should expect about 1 "significant" result just by chance even when nothing is going on. That does not mean you should never segment; it means you should plan comparisons, limit them, and treat exploratory findings as leads to confirm.
2) p-hacking (trying analyses until something is significant)
Common examples in survey work include: testing many different question variants, trying different recodes of a scale, excluding different subsets, or switching between one- and two-sided tests after seeing the data. A cleaner approach is to define rules up front and keep a short audit trail.
3) Biased data: significance does not fix it
A tiny p-value can coexist with a biased sample. If the wrong people respond, or response differs systematically by group, you may get "precise" but misleading estimates. Review response bias and how it interacts with your analysis.
Also distinguish "statistically significant" from "generalizable." Generalizability depends on how respondents were selected and who was reachable, which is a sampling methods issue.
4) Confounding: significance does not imply causality
A significant difference between groups does not tell you why the difference exists. If Group A differs from Group B in other ways (tenure, region, product mix), the observed gap may be due to those factors.
This is where correlational research cautions apply: statistical significance shows an association under a model, not a cause. If you need to adjust for other variables, use regression rather than relying only on simple group comparisons.
5) Measurement noise: poor questions create weak signals
Unclear wording, double-barreled items, or inconsistent scales increase noise and make it harder to detect real differences (or they create artifacts that look like differences). Improving question design is often a higher-leverage fix than changing alpha. See write better survey questions.
6) Dirty data and duplicate respondents
Basic checks matter: remove obvious speeders, verify eligibility, handle duplicates, and confirm coding. Otherwise, your p-values will be answering a different question than you think. Start with working with survey data before you test differences.
A practical workflow for using significance on real survey projects
If you want significance testing to help decisions (not derail them), use a simple workflow your team can repeat.
Define the decision first
What action changes if the metric is higher/lower? Write down the minimum difference that would trigger action.
Plan comparisons and sample size
Decide which segments and which metrics you will compare. Ensure each subgroup has enough n for the differences you care about (see sample size).
Field with sampling in mind
Use appropriate random sampling or recruitment controls. Track response rates by segment so you can spot nonresponse patterns early.
Clean and validate data
Run consistent exclusion rules and coding. Keep a short log of what was removed and why (see data quality checks).
Run the right test for the metric
Percentages: two-proportion test / chi-square. Means (e.g., 1-5 ratings): t-test for two groups, ANOVA for 3+ groups. If you need adjustment, use regression.
Report: estimate, uncertainty, meaning
Write one sentence that includes the two values, the difference, and the p-value (or confidence interval), plus one sentence on practical meaning. Use research best practices to keep reporting consistent.
| Survey output | Example | Common comparison | Typical test |
|---|---|---|---|
| Proportion (percent) | % "Satisfied" | Two groups | Two-proportion z-test or chi-square |
| Mean rating | Average 1-5 satisfaction | Two groups | Two-sample t-test |
| Mean rating | Average 1-5 satisfaction | 3+ groups | ANOVA (then follow-up tests) |
| Paired results | Same people pre/post | Before vs after | Paired t-test or McNemar test |
| Association | Role vs attrition intent | Two categorical variables | Chi-square test of independence |
If you are applying these ideas to an upcoming study, you can create a survey and plan your comparisons up front. If you want a second set of eyes on design or analysis choices, use survey help.
References
- Centers for Disease Control and Prevention, National Center for Health Statistics. (n.d.). Statistical significance. In Health, United States: Sources and Definitions. Retrieved March 2, 2026.
- National Institutes of Health, National Center for Advancing Translational Sciences. (n.d.). Statistical significance. NCATS Toolkit (Glossary). Retrieved March 2, 2026.
- U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics. (n.d.). Statistical significance and sample size. National Assessment of Educational Progress (NAEP). Retrieved March 2, 2026.
- Cox, D. R. (2020). Statistical significance. Annual Review of Statistics and Its Application, 7, 1-10.
Frequently Asked Questions
Does p < 0.05 mean there is a 95% chance the result is true?
No. A p-value is computed assuming the null hypothesis is true. p < 0.05 means your observed data (or more extreme data) would be uncommon under the null model, not that your conclusion has a 95% probability of being true.
If a result is not significant, does that mean there is no difference?
Not necessarily. "Not significant" usually means you do not have enough evidence (given your sample size and variability) to rule out chance as a plausible explanation. The true difference could be small, or your sample could be too small to detect the difference you care about.
Should I always use alpha = 0.05?
No. 0.05 is a convention, not a law. Use a stricter threshold (like 0.01) when false positives are costly or when you are running many comparisons. Use context: what decision is being made, and what is the cost of a wrong call?
Why did a tiny difference become significant after I got more responses?
Because bigger samples reduce random error. With enough n, even small effects can produce small p-values. That is why you should pair p-values with practical significance (effect size) and a minimum meaningful difference.
Can I use significance tests on Likert-scale survey questions?
Often, yes. Teams commonly compare mean scores with a t-test (two groups) or ANOVA (3+ groups), especially with 5- or 7-point scales and moderate sample sizes. If you want to be more conservative, you can also analyze distributions (for example, top-box percentages) and compare proportions.