View/Export Results
Manage Existing Surveys
Create/Copy Multiple Surveys
Collaborate with Team Members
Sign inSign in with Facebook
Sign inSign in with Google
Skip to article content

Regression analysis guide

A practical walkthrough of correlation, regression basics, setup, and interpretation, with worked examples.

Key Takeaways

  1. Regression answers "what predicts what": It estimates how an outcome changes when a predictor changes, holding other predictors constant.
  2. Match the model to your outcome: Use linear regression for numeric outcomes, logistic for binary outcomes, and ordinal models for ordered categories.
  3. Data prep is most of the work: Clean your data, code categories, handle missing values, and document every decision before modeling.
  4. Interpret effects in units people understand: Convert coefficients into real-world units ("each additional $1,000 in ad spend is associated with $4,400 more revenue") rather than only reporting p-values.
  5. Validate, do not just fit: Check assumptions, look for multicollinearity, and sanity-check predictions on a holdout sample to reduce overfitting.

What regression analysis is (and when to use it)

Regression analysis models the relationship between an outcome you care about (the dependent variable, Y) and one or more predictors (independent variables, Xs). The output tells you how Y is expected to change when a predictor changes, while holding other predictors constant.

This guide walks through the concepts, shows a worked example, gives you a calculator to try your own data, and covers interpretation and common mistakes.

Regression is a good fit when you want to:

  • Estimate effects: How much does revenue change per additional dollar of advertising spend?
  • Adjust for confounding: Separate the relationship between a predictor and an outcome from other variables that might muddy the picture.
  • Predict: Forecast an outcome (revenue, test score, customer spend) from multiple input variables.
A practical mental model

Regression is not a magic predictor. It is a structured way to compare how much each input variable matters, on a common footing, and to quantify the uncertainty around those estimates.

Example 1: predicting revenue from advertising spend

This example walks through a simple linear regression on a small dataset, start to finish. The numbers are realistic enough to show every piece of the output.

The question

A retail chain wants to know: "For every additional $1,000 we spend on digital ads per store per month, how much extra revenue should we expect?"

The data

Monthly advertising spend and revenue across 10 stores
StoreAd spend ($000s)Monthly revenue ($000s)
A2.052
B3.550
C5.065
D4.055
E6.570
F1.548
G7.072
H3.058
I5.561
J8.078

Step 1: Check correlation first

Compute the Pearson correlation between ad spend and revenue before fitting a model. Here, r = 0.95 (strong positive), which confirms a clear linear pattern worth modeling.

Step 2: Fit the regression

A simple linear regression of revenue (Y) on ad spend (X) produces:

Regression output summary
Output elementValueWhat it means
Intercept40.6Expected revenue ($40,600) when ad spend is zero. Treat as a mathematical baseline, not a real scenario.
Slope (Ad spend)4.4Each additional $1,000 in ad spend is associated with roughly $4,400 in additional monthly revenue.
R-squared0.89Ad spend explains about 89% of the store-to-store variation in revenue.
p-value (slope)< 0.001Strong statistical evidence that the slope is not zero.

Step 3: Interpret carefully

  • Practical meaning: Revenue = 40.6 + 4.4 x Ad Spend. A store spending $5,000/month on ads should bring in roughly $62,600 in revenue.
  • Causation caveat: These 10 stores differ in location, foot traffic, and staffing. The coefficient captures an association, not proof that raising ad budgets will deliver exactly $4,400 per $1,000.
  • Extrapolation warning: The equation only holds within the observed spend range ($1,500 to $8,000). Plugging in $20,000 would be guesswork.
  • Small sample: Ten data points illustrate the method, but they're too few for confident decisions. More stores would tighten the confidence interval around the slope.

The dataset is deliberately small so you can replicate it in any statistics tool or spreadsheet. Try entering these numbers into the calculator above to see the same results.

Correlation and regression calculator

Paste your own data or try one of the example datasets. The calculator computes Pearson r, Spearman rho, and fits a simple linear regression with a scatter plot. All computation runs in your browser; no data leaves your device.

Data Input

Load an example dataset:
Paste Data
Manual Entry
#XY

Chart

Correlation Results

Pearson r
Spearman rho
p-value (2-tailed)
Sample Size

Regression Results

R-squared
Slope
Intercept
p-value (Slope)

Interpretation

Correlation vs regression: measuring the relationship before modeling it

Check whether a relationship exists before you model it. Correlation quantifies how strongly two variables move together in a straight line. If the correlation is weak, regression won't find much to work with.

Two measures dominate survey and business research:

Pearson r vs Spearman rho: when to use each
MeasureWhat it capturesBest forAssumes
Pearson rStrength of linear relationshipContinuous or near-continuous data (e.g., 0-10 ratings, revenue figures)Both variables roughly normally distributed; relationship is linear
Spearman rhoStrength of monotonic (rank-order) relationshipOrdinal data, skewed distributions, or when outliers are a concernNo distributional assumptions; works on ranks

How correlation leads to regression

Correlation tells you whether two variables track together and how tightly. Regression answers a different question: how much does the outcome shift per unit change in the predictor, after accounting for everything else in the model? Think of correlation as the screening step. Regression is the modeling step.

Correlation vs regression at a glance
DimensionCorrelationRegression
Question answeredHow strongly are X and Y associated?How much does Y change per unit of X, holding other variables constant?
DirectionSymmetric: r(X,Y) = r(Y,X)Asymmetric: Y is the outcome, X is the predictor
Number of variablesTypically two at a timeOne outcome and one or more predictors
OutputCorrelation coefficient (r or rho)Equation with intercept and slope coefficients, R-squared, p-values
Use caseExploring associations, screening variables, simple reportingDriver analysis, prediction, adjusting for confounders

Types of regression analysis

The type of outcome you are modeling determines which regression family to use.

Common outcome types and regression choices
Outcome you are modeling (Y)Typical exampleOften-used modelWhat the coefficient means (plain English)
Numeric (roughly continuous)0-10 satisfaction rating; average of multiple itemsLinear regression (OLS)Change in points on the rating scale per 1-unit change in X
Binary (0/1)Churn intention: Yes/No; Would recommend: Yes/NoLogistic regressionChange in log-odds (often reported as odds ratio) per 1-unit change in X
Ordered categoriesStrongly disagree ... strongly agree (single item)Ordinal logistic (proportional odds) or treat as numeric with careShift toward higher categories, conditional on assumptions
CountsNumber of support contacts last monthPoisson / negative binomialMultiplicative change in expected count

This guide focuses on linear regression (OLS) because it is the most common starting point. If your outcome is binary or ordinal and you force a linear model, the results may still be directionally helpful, but interpretation and fit can suffer.

How to run a regression analysis (step by step)

Follow a repeatable workflow so your results are auditable and reproducible.

  1. Step 1: Define the question and variables

    Write down the specific question regression will answer. Choose one primary outcome (Y) and a short list of predictors (Xs). Add control variables if they matter (e.g., region, time period). Keep the model focused.

  2. Step 2: Prepare and clean data

    Handle missing values, code categorical variables as dummies, check for outliers, and document every transformation. Most regression errors come from data prep, not the math. See data preparation basics for a general checklist.

  3. Step 3: Fit a baseline model

    Start simple. Include your key predictors and obvious controls. Avoid throwing in every variable "to see what sticks." Check that your sample size is large enough relative to the number of predictors.

  4. Step 4: Diagnose fit and assumptions

    Look at residuals, leverage, multicollinearity (VIF), and whether the model is capturing real patterns or artifacts. Fix issues before adding complexity.

  5. Step 5: Interpret and validate

    Translate coefficients into meaningful units. Add confidence intervals. If prediction is a goal, validate on a holdout sample or with cross-validation. State limitations clearly, especially around causality.

How to read regression output

Most software outputs the same core pieces. The details differ, but interpretation rules do not. For a deeper walk-through of standard outputs (including SPSS), UCLA's Statistical Consulting Group materials are a useful reference: Regression with SPSS (UCLA).

Regression output terms and what they mean
Output elementWhat it isHow to explain itCommon mistake
InterceptExpected Y when all Xs are zero (and reference groups)Often not meaningful unless X=0 is meaningful; treat as a baselineInterpreting it as an "average" without context
Coefficient (B)Expected change in Y per 1-unit change in X, holding others constant"Each additional $1,000 in ad spend is associated with $4,400 more in monthly revenue"Calling it causal when your data are observational
Standard errorUncertainty in the coefficient estimateUse it to form confidence intervals and gauge precisionIgnoring it and focusing only on sign
p-valueEvidence against the null (often B=0) under model assumptionsUse with effect size and context; see p-values and significance"p<0.05 means important" (importance is practical, not statistical)
R-squaredShare of variance in Y explained by the model"The model explains about X% of the variation" (not "predicts X%")Thinking higher R-squared always means a better driver story
Adjusted R-squaredR-squared adjusted for number of predictorsPrefer for comparing models with different predictor countsUsing it as a stamp of causality
F-test / model p-valueTests whether the model improves on an intercept-only modelGood as a general check, but not the main storyUsing it to justify including weak predictors

For careful reading of coefficients and uncertainty, see the clinician-oriented overview in Bzovsky et al. (2022) (the concepts apply beyond clinical settings).

Standardized vs unstandardized coefficients

Unstandardized coefficients are easiest to explain because they are in the original units (points on a 0-10 scale). Standardized coefficients are useful for comparing predictors measured on different scales, but they can confuse stakeholders. If you use them, provide both and explain the difference.

Check assumptions and diagnose problems

Regression is a model, not a fact. You do not need perfection, but you should check whether the model is badly misspecified. Introductory regression texts cover these diagnostics in detail (for example, Schroeder et al. (2016) and Gordon (2015)).

  • warning
    Linearity: Plot Y vs each key X. If the relationship bends, consider transformations or splines (or a different model).
  • warning
    Homoscedasticity: Residual spread should not explode at certain prediction levels. If it does, consider robust standard errors.
  • warning
    Independence: Watch clustered data (multiple responses per person, team, store). Cluster-robust approaches may be needed.
  • warning
    Multicollinearity: If predictors overlap heavily, coefficients can become unstable. Check VIF and consider composites.
  • warning
    Influential points: A few extreme respondents can drive results. Check leverage/Cook's distance and investigate, not auto-delete.

Common misinterpretations and how to avoid them

Many "bad regression" problems are really communication problems. These are the ones that most often break decision-making.

1) Correlation vs causation

A regression coefficient is an association in your data under your model. Unless you have a research design that supports causal inference, avoid causal language. If your stakeholders need the explanation, use our guide on correlation vs causation.

2) "Significant" does not mean "important"

With enough responses, tiny effects can be statistically significant. With small samples, practically large effects can look uncertain. Use statistical significance as one input, alongside effect size and business relevance.

3) Overfitting and p-hacking

If you test many predictors, some will appear "significant" by chance. Keep a pre-specified core model, validate on a holdout set, and be cautious about stepwise selection.

References

Frequently Asked Questions

quiz What does R-squared mean in plain English? expand_more

R-squared is the share of variation in your outcome that the model explains. It does not mean the model is "X% accurate," and a higher R-squared does not guarantee the model is correct or causal. Use it alongside diagnostics and effect sizes.

quiz How many data points do I need for regression? expand_more

It depends on the number of predictors, missing data, and how precise you need estimates to be. As a practical rule, avoid many predictors with a small sample because coefficients become unstable. Use sample size guidelines to plan, and keep a focused model when data is limited.

quiz What is the difference between correlation and regression? expand_more

Correlation produces a single number (r, from -1 to +1) that captures how strongly and in what direction two variables relate. Regression goes further: it fits an equation that estimates how much an outcome changes per unit change in one or more predictors. Correlation is symmetric (r of X,Y equals r of Y,X). Regression is directional (Y depends on X). Use correlation for screening and quick reporting. Use regression when you need to size the effects or control for other variables. For more on the distinction, see correlation vs causation in research.

quiz When should you use regression analysis? expand_more

Regression fits when you want to estimate how much an outcome shifts as one or more predictors change, holding other factors constant. Common cases: figuring out which inputs most affect a business outcome, forecasting revenue or churn probability from multiple variables, and separating the effect of one variable from confounders like region or time period. If you only need to know whether a relationship exists (not how large the effect is), a simple correlation may be enough.

quiz What is simple linear regression? expand_more

Simple linear regression fits a straight line between one predictor (X) and one outcome (Y): Y = intercept + slope x X. The intercept is the expected Y when X equals zero. The slope is the expected change in Y for each one-unit increase in X. It is the most basic form of regression, and a natural starting point before you add more predictors (multiple regression).

quiz How do you interpret a regression coefficient? expand_more

A regression coefficient (B, sometimes called beta) tells you how much the outcome is expected to change for a one-unit increase in that predictor, holding everything else constant. For example, if the coefficient for advertising spend is 4.4, each additional $1,000 in spend is associated with $4,400 more revenue. Report coefficients in the original units of your data, pair them with confidence intervals (not just the point estimate), and remember that "associated with" is not the same as "caused by" unless your study design supports causal claims.