Key Takeaways
- Regression answers "what predicts what": It estimates how an outcome changes when a predictor changes, holding other predictors constant.
- Match the model to your outcome: Use linear regression for numeric outcomes, logistic for binary outcomes, and ordinal models for ordered categories.
- Data prep is most of the work: Clean your data, code categories, handle missing values, and document every decision before modeling.
- Interpret effects in units people understand: Convert coefficients into real-world units ("each additional $1,000 in ad spend is associated with $4,400 more revenue") rather than only reporting p-values.
- Validate, do not just fit: Check assumptions, look for multicollinearity, and sanity-check predictions on a holdout sample to reduce overfitting.
What regression analysis is (and when to use it)
Regression analysis models the relationship between an outcome you care about (the dependent variable, Y) and one or more predictors (independent variables, Xs). The output tells you how Y is expected to change when a predictor changes, while holding other predictors constant.
This guide walks through the concepts, shows a worked example, gives you a calculator to try your own data, and covers interpretation and common mistakes.
Regression is a good fit when you want to:
- Estimate effects: How much does revenue change per additional dollar of advertising spend?
- Adjust for confounding: Separate the relationship between a predictor and an outcome from other variables that might muddy the picture.
- Predict: Forecast an outcome (revenue, test score, customer spend) from multiple input variables.
Regression is not a magic predictor. It is a structured way to compare how much each input variable matters, on a common footing, and to quantify the uncertainty around those estimates.
Example 1: predicting revenue from advertising spend
This example walks through a simple linear regression on a small dataset, start to finish. The numbers are realistic enough to show every piece of the output.
The question
A retail chain wants to know: "For every additional $1,000 we spend on digital ads per store per month, how much extra revenue should we expect?"
The data
| Store | Ad spend ($000s) | Monthly revenue ($000s) |
|---|---|---|
| A | 2.0 | 52 |
| B | 3.5 | 50 |
| C | 5.0 | 65 |
| D | 4.0 | 55 |
| E | 6.5 | 70 |
| F | 1.5 | 48 |
| G | 7.0 | 72 |
| H | 3.0 | 58 |
| I | 5.5 | 61 |
| J | 8.0 | 78 |
Step 1: Check correlation first
Compute the Pearson correlation between ad spend and revenue before fitting a model. Here, r = 0.95 (strong positive), which confirms a clear linear pattern worth modeling.
Step 2: Fit the regression
A simple linear regression of revenue (Y) on ad spend (X) produces:
| Output element | Value | What it means |
|---|---|---|
| Intercept | 40.6 | Expected revenue ($40,600) when ad spend is zero. Treat as a mathematical baseline, not a real scenario. |
| Slope (Ad spend) | 4.4 | Each additional $1,000 in ad spend is associated with roughly $4,400 in additional monthly revenue. |
| R-squared | 0.89 | Ad spend explains about 89% of the store-to-store variation in revenue. |
| p-value (slope) | < 0.001 | Strong statistical evidence that the slope is not zero. |
Step 3: Interpret carefully
- Practical meaning: Revenue = 40.6 + 4.4 x Ad Spend. A store spending $5,000/month on ads should bring in roughly $62,600 in revenue.
- Causation caveat: These 10 stores differ in location, foot traffic, and staffing. The coefficient captures an association, not proof that raising ad budgets will deliver exactly $4,400 per $1,000.
- Extrapolation warning: The equation only holds within the observed spend range ($1,500 to $8,000). Plugging in $20,000 would be guesswork.
- Small sample: Ten data points illustrate the method, but they're too few for confident decisions. More stores would tighten the confidence interval around the slope.
The dataset is deliberately small so you can replicate it in any statistics tool or spreadsheet. Try entering these numbers into the calculator above to see the same results.
Correlation and regression calculator
Paste your own data or try one of the example datasets. The calculator computes Pearson r, Spearman rho, and fits a simple linear regression with a scatter plot. All computation runs in your browser; no data leaves your device.
Data Input
| # | X | Y |
|---|
Chart
Correlation Results
Regression Results
Interpretation
Correlation vs regression: measuring the relationship before modeling it
Check whether a relationship exists before you model it. Correlation quantifies how strongly two variables move together in a straight line. If the correlation is weak, regression won't find much to work with.
Two measures dominate survey and business research:
| Measure | What it captures | Best for | Assumes |
|---|---|---|---|
| Pearson r | Strength of linear relationship | Continuous or near-continuous data (e.g., 0-10 ratings, revenue figures) | Both variables roughly normally distributed; relationship is linear |
| Spearman rho | Strength of monotonic (rank-order) relationship | Ordinal data, skewed distributions, or when outliers are a concern | No distributional assumptions; works on ranks |
How correlation leads to regression
Correlation tells you whether two variables track together and how tightly. Regression answers a different question: how much does the outcome shift per unit change in the predictor, after accounting for everything else in the model? Think of correlation as the screening step. Regression is the modeling step.
| Dimension | Correlation | Regression |
|---|---|---|
| Question answered | How strongly are X and Y associated? | How much does Y change per unit of X, holding other variables constant? |
| Direction | Symmetric: r(X,Y) = r(Y,X) | Asymmetric: Y is the outcome, X is the predictor |
| Number of variables | Typically two at a time | One outcome and one or more predictors |
| Output | Correlation coefficient (r or rho) | Equation with intercept and slope coefficients, R-squared, p-values |
| Use case | Exploring associations, screening variables, simple reporting | Driver analysis, prediction, adjusting for confounders |
Types of regression analysis
The type of outcome you are modeling determines which regression family to use.
| Outcome you are modeling (Y) | Typical example | Often-used model | What the coefficient means (plain English) |
|---|---|---|---|
| Numeric (roughly continuous) | 0-10 satisfaction rating; average of multiple items | Linear regression (OLS) | Change in points on the rating scale per 1-unit change in X |
| Binary (0/1) | Churn intention: Yes/No; Would recommend: Yes/No | Logistic regression | Change in log-odds (often reported as odds ratio) per 1-unit change in X |
| Ordered categories | Strongly disagree ... strongly agree (single item) | Ordinal logistic (proportional odds) or treat as numeric with care | Shift toward higher categories, conditional on assumptions |
| Counts | Number of support contacts last month | Poisson / negative binomial | Multiplicative change in expected count |
This guide focuses on linear regression (OLS) because it is the most common starting point. If your outcome is binary or ordinal and you force a linear model, the results may still be directionally helpful, but interpretation and fit can suffer.
How to run a regression analysis (step by step)
Follow a repeatable workflow so your results are auditable and reproducible.
Step 1: Define the question and variables
Write down the specific question regression will answer. Choose one primary outcome (Y) and a short list of predictors (Xs). Add control variables if they matter (e.g., region, time period). Keep the model focused.
Step 2: Prepare and clean data
Handle missing values, code categorical variables as dummies, check for outliers, and document every transformation. Most regression errors come from data prep, not the math. See data preparation basics for a general checklist.
Step 3: Fit a baseline model
Start simple. Include your key predictors and obvious controls. Avoid throwing in every variable "to see what sticks." Check that your sample size is large enough relative to the number of predictors.
Step 4: Diagnose fit and assumptions
Look at residuals, leverage, multicollinearity (VIF), and whether the model is capturing real patterns or artifacts. Fix issues before adding complexity.
Step 5: Interpret and validate
Translate coefficients into meaningful units. Add confidence intervals. If prediction is a goal, validate on a holdout sample or with cross-validation. State limitations clearly, especially around causality.
How to read regression output
Most software outputs the same core pieces. The details differ, but interpretation rules do not. For a deeper walk-through of standard outputs (including SPSS), UCLA's Statistical Consulting Group materials are a useful reference: Regression with SPSS (UCLA).
| Output element | What it is | How to explain it | Common mistake |
|---|---|---|---|
| Intercept | Expected Y when all Xs are zero (and reference groups) | Often not meaningful unless X=0 is meaningful; treat as a baseline | Interpreting it as an "average" without context |
| Coefficient (B) | Expected change in Y per 1-unit change in X, holding others constant | "Each additional $1,000 in ad spend is associated with $4,400 more in monthly revenue" | Calling it causal when your data are observational |
| Standard error | Uncertainty in the coefficient estimate | Use it to form confidence intervals and gauge precision | Ignoring it and focusing only on sign |
| p-value | Evidence against the null (often B=0) under model assumptions | Use with effect size and context; see p-values and significance | "p<0.05 means important" (importance is practical, not statistical) |
| R-squared | Share of variance in Y explained by the model | "The model explains about X% of the variation" (not "predicts X%") | Thinking higher R-squared always means a better driver story |
| Adjusted R-squared | R-squared adjusted for number of predictors | Prefer for comparing models with different predictor counts | Using it as a stamp of causality |
| F-test / model p-value | Tests whether the model improves on an intercept-only model | Good as a general check, but not the main story | Using it to justify including weak predictors |
For careful reading of coefficients and uncertainty, see the clinician-oriented overview in Bzovsky et al. (2022) (the concepts apply beyond clinical settings).
Standardized vs unstandardized coefficients
Unstandardized coefficients are easiest to explain because they are in the original units (points on a 0-10 scale). Standardized coefficients are useful for comparing predictors measured on different scales, but they can confuse stakeholders. If you use them, provide both and explain the difference.
Check assumptions and diagnose problems
Regression is a model, not a fact. You do not need perfection, but you should check whether the model is badly misspecified. Introductory regression texts cover these diagnostics in detail (for example, Schroeder et al. (2016) and Gordon (2015)).
- Linearity: Plot Y vs each key X. If the relationship bends, consider transformations or splines (or a different model).
- Homoscedasticity: Residual spread should not explode at certain prediction levels. If it does, consider robust standard errors.
- Independence: Watch clustered data (multiple responses per person, team, store). Cluster-robust approaches may be needed.
- Multicollinearity: If predictors overlap heavily, coefficients can become unstable. Check VIF and consider composites.
- Influential points: A few extreme respondents can drive results. Check leverage/Cook's distance and investigate, not auto-delete.
Common misinterpretations and how to avoid them
Many "bad regression" problems are really communication problems. These are the ones that most often break decision-making.
1) Correlation vs causation
A regression coefficient is an association in your data under your model. Unless you have a research design that supports causal inference, avoid causal language. If your stakeholders need the explanation, use our guide on correlation vs causation.
2) "Significant" does not mean "important"
With enough responses, tiny effects can be statistically significant. With small samples, practically large effects can look uncertain. Use statistical significance as one input, alongside effect size and business relevance.
3) Overfitting and p-hacking
If you test many predictors, some will appear "significant" by chance. Keep a pre-specified core model, validate on a holdout set, and be cautious about stepwise selection.
References
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
- Schroeder, L. D., Sjoquist, D. L., & Stephan, P. E. (2016). Understanding Regression Analysis: An Introductory Guide (2nd ed.). SAGE Publications.
- Gordon, R. A. (2015). Regression Analysis for the Social Sciences (2nd ed.). Routledge.
- Chen, X., Ender, P., Mitchell, M., & Wells, C. (2003). Regression with SPSS (SPSS Web Books: Regression). UCLA Statistical Consulting Group.
- Bzovsky, S., Phillips, M. R., Guymer, R. H., Wykoff, C. C., Thabane, L., Bhandari, M., & Chaudhary, V. (2022). The clinician's guide to interpreting a regression analysis. Eye, 36(9), 1715-1717.
Frequently Asked Questions
What does R-squared mean in plain English?
R-squared is the share of variation in your outcome that the model explains. It does not mean the model is "X% accurate," and a higher R-squared does not guarantee the model is correct or causal. Use it alongside diagnostics and effect sizes.
How many data points do I need for regression?
It depends on the number of predictors, missing data, and how precise you need estimates to be. As a practical rule, avoid many predictors with a small sample because coefficients become unstable. Use sample size guidelines to plan, and keep a focused model when data is limited.
What is the difference between correlation and regression?
Correlation produces a single number (r, from -1 to +1) that captures how strongly and in what direction two variables relate. Regression goes further: it fits an equation that estimates how much an outcome changes per unit change in one or more predictors. Correlation is symmetric (r of X,Y equals r of Y,X). Regression is directional (Y depends on X). Use correlation for screening and quick reporting. Use regression when you need to size the effects or control for other variables. For more on the distinction, see correlation vs causation in research.
When should you use regression analysis?
Regression fits when you want to estimate how much an outcome shifts as one or more predictors change, holding other factors constant. Common cases: figuring out which inputs most affect a business outcome, forecasting revenue or churn probability from multiple variables, and separating the effect of one variable from confounders like region or time period. If you only need to know whether a relationship exists (not how large the effect is), a simple correlation may be enough.
What is simple linear regression?
Simple linear regression fits a straight line between one predictor (X) and one outcome (Y): Y = intercept + slope x X. The intercept is the expected Y when X equals zero. The slope is the expected change in Y for each one-unit increase in X. It is the most basic form of regression, and a natural starting point before you add more predictors (multiple regression).
How do you interpret a regression coefficient?
A regression coefficient (B, sometimes called beta) tells you how much the outcome is expected to change for a one-unit increase in that predictor, holding everything else constant. For example, if the coefficient for advertising spend is 4.4, each additional $1,000 in spend is associated with $4,400 more revenue. Report coefficients in the original units of your data, pair them with confidence intervals (not just the point estimate), and remember that "associated with" is not the same as "caused by" unless your study design supports causal claims.