Basic Correlation and Regression Analysis Using R
June 28, 2025 | by Bloom Code Studio
Recall in Inferential Statistics and Regression Analysis the discussion on correlation and regression analysis. A first step in correlation analysis is to calculate the correlation coefficient (r) for (x, y) data. R provides a built-in function cor() to calculate the correlation coefficient for bivariate data.
As an example, consider the dataset in Table B5 that tracks the return on the S&P 500 versus return on Coca-Cola stock for a seven-month time period.
| Month | S&P 500 Monthly Return (%) | Coca-Cola Monthly Return (%) |
|---|---|---|
| Jan | 8 | 6 |
| Feb | 1 | 0 |
| Mar | 0 | -2 |
| Apr | 2 | 1 |
| May | -3 | -1 |
| Jun | 7 | 8 |
| Jul | 4 | 2 |
Table B5 Monthly Returns of Coca-Cola Stock versus Monthly Returns for the S&P 500
To calculate the correlation coefficient for this dataset, first create two vectors in R, one vector for the S&P 500 returns and a second vector for Coca-Cola returns:
> SP500 <- c(8,1,0,2,-3,7,4)
> CocaCola <- c(6,0,-2,1,-1,8,2)
The R command called cor returns the correlation coefficient for the x-data vector and y-data vector:
> cor(SP500, CocaCola)
[1] 0.9123872
Thus the correlation coefficient for this dataset is approximately 0.912.
Linear Regression Models Using R
To create a linear model in R, assuming the correlation is significant, the command lm() (for linear model) will provide the slope and y-intercept for the linear regression equation.
The format of the R command is
lm(dependent_variable_vector ~ independent_variable_vector)
Notice the use of the tilde symbol as the separator between the dependent variable vector and the independent variable vector.
We use the returns on Coca-Cola stock as the dependent variable and the returns on the S&P 500 as the independent variable, and thus the R command would be
> lm(CocaCola ~ SP500)
Call:
lm(formula = CocaCola ~ SP500)
Coefficients:
(Intercept) SP500
-0.3453 0.8641
The R output provides the value of the y-intercept as -0.3453 and the value of the slope as 0.8641. Based on this, the linear model would be
yˆyˆ==a+bx−0.3453+0.8641x
where x represents the value of S&P 500 return and y represents the value of Coca-Cola stock return.
The results can also be saved as a formula and called “model” using the following R command. To obtain more detailed results for the linear regression, the summary command can be used, as follows:
> model <- lm(CocaCola ~ SP500)
> summary(model)
Call:
lm(formula = CocaCola ~ SP500)
Residuals:
1 2 3 4 5 6 7
-0.5672 -0.5188 -1.6547 -0.3828 1.9375 2.2969 -1.1109
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3453 0.7836 -0.441 0.67783
SP500 0.8641 0.1734 4.984 0.00416 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.658 on 5 degrees of freedom
Multiple R-squared: 0.8325, Adjusted R-squared: 0.7989
F-statistic: 24.84 on 1 and 5 DF, p-value: 0.004161
In this output, the y-intercept and slope is given, as well as the residuals for each x-value. The output includes additional statistical details regarding the regression analysis.
Predicted values and prediction intervals can also be generated within R. First, we can create a structure in R called a dataframe to hold the values of the independent variable for which we want to generate a prediction. For example, we would like to generate the predicted return for Coca-Cola stock, given that the return for the S&P 500 is 6.
We use the R command called predict().
To generate a prediction for the linear regression equation called model, using the dataframe where the value of the S&P 500 is 6, the R commands will be
> a <- data.frame(SP500=6)
> predict(model, a)
1
4.839062
The output from the predict command indicates that the predicted return for Coca-Cola stock will be 4.8% when the return for the S&P 500 is 6%.
We can extend this analysis to generate a 95% prediction interval for this result by using the following R command, which adds an option to the predict command to generate a prediction interval:
> predict(model, a, interval="predict")
fit lwr upr
1 4.839062 0.05417466 9.62395
Thus the 95% prediction interval for Coca-Cola return is (0.05%, 9.62%) when the return for the S&P 500 is 6%.
Multiple Regression Models Using R
R also includes many tools to allow the data scientist to conduct multiple regression, where a dependent variable is predicted based on more than one independent variable. For example, we might arrive at a better prediction model for monthly return of Coca-Cola stock if we consider not only the S&P500 monthly return but also take into account the monthly sales of Coca-Cola products as well.
Here are several examples where a multiple regression model might provide an improved prediction model as compared to a regression model with only one independent variable:
- Employee salaries can be predicted based on years of experience and education level.
- Housing prices can be predicted based on square footage of a home, number of bedrooms, and number of bathrooms.
The general form of the multiple regression model is:
yˆ=a+b1x1+b2x2+b3x3+?+bnxn
where:
x1,x2,x3,…,xn are the independent variables,
b1,b2,b3,…,bn are the coefficients where each coefficient is the amount of change in y when the independent variable xi is changed by one unit and all other independent variables are held constant,
a is the y-intercept, which is the value of y when all xi=0.
Recall from an earlier example that the format for linear regression analysis in R when there is only one independent variable looked like the following:
> model <- lm(y ~ x)
where y is the dependent variable and x is the independent variable.
To “add in” additional independent variables for the multiple regression approach, we use a format as follows:
> model <- lm(y ~ x1 + x2 + x3)
where x1,x2,x3 are the independent variables.
Example:
Use R to create a multiple regression model to predict the price of a home based on the independent variables of square footage and number of bedrooms based on the following dataset. Then use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms (see Table B6).
| Price of Home (y) | Square Footage x1 | Number of Bedrooms x2 |
|---|---|---|
| 466000 | 4668 | 6 |
| 355000 | 3196 | 5 |
| 405900 | 3998 | 5 |
| 415000 | 4022 | 5 |
| 206000 | 1834 | 2 |
| 462000 | 4668 | 6 |
| 290000 | 2650 | 3 |
Table B6 Home Prices Based on Square Footage and Number of Bedrooms
First, create three vectors in R, one vector each for home price, square footage and number of bedrooms:
> price <- c(466000, 355000, 405900, 415000, 206000, 462000, 290000)
> square_footage <- c(4668, 3196, 3998, 4022, 1834, 4668, 2650)
> bedrooms <- c(6, 5, 5, 5, 2, 6, 3)
Next, run the multiple regression model using the lm command:
> model <- lm(price ~ square_footage + bedrooms)
> summary(model)
Call:
lm(formula = price ~ square_footage + bedrooms)
Residuals:
1 2 3 4 5 6 7
-1704.6 1438.4 -606.9 6908.7 -6747.4 -5704.6 6416.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57739.887 9529.693 6.059 0.00375 **
square_footage 66.017 8.924 7.398 0.00178 **
bedrooms 16966.536 6283.183 2.700 0.05408 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6563 on 4 degrees of freedom
Multiple R-squared: 0.9968, Adjusted R-squared: 0.9953
F-statistic: 630.6 on 2 and 4 DF, p-value: 9.996e-06
In the R output, note the column called “Estimate” provides the estimates for the coefficients and the y-intercept.
The y-intercept is given as 57740 (rounding to nearest whole number).
The coefficient for the “square footage” variable is given as 66.
The coefficient for the “bedrooms” variance is given as 16967.
Based on these values, the multiple regression model is:
yˆ=57740+66×1+16967×2
We can now use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms by setting x1=3000 and x2=3, as follows:
yˆyˆ ==57740+66(3000)+16967(3)306641
Thus, the predicted price of a home with 3,000 square feet and 3 bedrooms is $306,641.
RELATED POSTS
View all