Learning

open
close

Basic Correlation and Regression Analysis Using R

June 28, 2025 | by Bloom Code Studio

Recall in Inferential Statistics and Regression Analysis the discussion on correlation and regression analysis. A first step in correlation analysis is to calculate the correlation coefficient (r) for (xy) data. R provides a built-in function cor() to calculate the correlation coefficient for bivariate data.

As an example, consider the dataset in Table B5 that tracks the return on the S&P 500 versus return on Coca-Cola stock for a seven-month time period.

MonthS&P 500 Monthly Return (%)Coca-Cola Monthly Return (%)
Jan86
Feb10
Mar0-2
Apr21
May-3-1
Jun78
Jul42

Table B5 Monthly Returns of Coca-Cola Stock versus Monthly Returns for the S&P 500

To calculate the correlation coefficient for this dataset, first create two vectors in R, one vector for the S&P 500 returns and a second vector for Coca-Cola returns:

    > SP500 <- c(8,1,0,2,-3,7,4)
    
    > CocaCola <- c(6,0,-2,1,-1,8,2)

The R command called cor returns the correlation coefficient for the x-data vector and y-data vector:

    > cor(SP500, CocaCola)
    
    [1] 0.9123872

Thus the correlation coefficient for this dataset is approximately 0.912.

Linear Regression Models Using R

To create a linear model in R, assuming the correlation is significant, the command lm() (for linear model) will provide the slope and y-intercept for the linear regression equation.

The format of the R command is

lm(dependent_variable_vector ~ independent_variable_vector)

Notice the use of the tilde symbol as the separator between the dependent variable vector and the independent variable vector.

We use the returns on Coca-Cola stock as the dependent variable and the returns on the S&P 500 as the independent variable, and thus the R command would be

    > lm(CocaCola ~ SP500)
    
    Call:
    
    lm(formula = CocaCola ~ SP500)
    
    Coefficients:
    
    (Intercept)    SP500
    
       -0.3453   0.8641

The R output provides the value of the y-intercept as -0.3453 and the value of the slope as 0.8641. Based on this, the linear model would be

yˆyˆ==a+bx−0.3453+0.8641x

where x represents the value of S&P 500 return and y represents the value of Coca-Cola stock return.

The results can also be saved as a formula and called “model” using the following R command. To obtain more detailed results for the linear regression, the summary command can be used, as follows:

    > model <- lm(CocaCola ~ SP500)
    
    > summary(model)
    
    Call:
    
    lm(formula = CocaCola ~ SP500)
    
    Residuals:
    
          1    2    3    4    5    6    7
    
    -0.5672 -0.5188 -1.6547 -0.3828 1.9375 2.2969 -1.1109
    
    Coefficients:
    
            Estimate Std. Error t value Pr(>|t|)
    
    (Intercept) -0.3453   0.7836 -0.441 0.67783
    
    SP500     0.8641   0.1734  4.984 0.00416 **
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.658 on 5 degrees of freedom
    
    Multiple R-squared: 0.8325,  Adjusted R-squared: 0.7989
    
    F-statistic: 24.84 on 1 and 5 DF, p-value: 0.004161

In this output, the y-intercept and slope is given, as well as the residuals for each x-value. The output includes additional statistical details regarding the regression analysis.

Predicted values and prediction intervals can also be generated within R. First, we can create a structure in R called a dataframe to hold the values of the independent variable for which we want to generate a prediction. For example, we would like to generate the predicted return for Coca-Cola stock, given that the return for the S&P 500 is 6.

We use the R command called predict().

To generate a prediction for the linear regression equation called model, using the dataframe where the value of the S&P 500 is 6, the R commands will be

    > a <- data.frame(SP500=6)
    
    > predict(model, a)
    
         1
    
    4.839062

The output from the predict command indicates that the predicted return for Coca-Cola stock will be 4.8% when the return for the S&P 500 is 6%.

We can extend this analysis to generate a 95% prediction interval for this result by using the following R command, which adds an option to the predict command to generate a prediction interval:

    > predict(model, a, interval="predict")
    
      fit    lwr   upr
    
    1 4.839062 0.05417466 9.62395

Thus the 95% prediction interval for Coca-Cola return is (0.05%, 9.62%) when the return for the S&P 500 is 6%.

Multiple Regression Models Using R

R also includes many tools to allow the data scientist to conduct multiple regression, where a dependent variable is predicted based on more than one independent variable. For example, we might arrive at a better prediction model for monthly return of Coca-Cola stock if we consider not only the S&P500 monthly return but also take into account the monthly sales of Coca-Cola products as well.

Here are several examples where a multiple regression model might provide an improved prediction model as compared to a regression model with only one independent variable:

  1. Employee salaries can be predicted based on years of experience and education level.
  2. Housing prices can be predicted based on square footage of a home, number of bedrooms, and number of bathrooms.

The general form of the multiple regression model is:

yˆ=a+b1x1+b2x2+b3x3+?+bnxn

where:

x1,x2,x3,…,xn are the independent variables,
b1,b2,b3,…,bn are the coefficients where each coefficient is the amount of change in y when the independent variable xi is changed by one unit and all other independent variables are held constant,
a is the y-intercept, which is the value of y when all xi=0.

Recall from an earlier example that the format for linear regression analysis in R when there is only one independent variable looked like the following:

> model <- lm(y ~ x)

where y is the dependent variable and x is the independent variable.

To “add in” additional independent variables for the multiple regression approach, we use a format as follows:

> model <- lm(y ~ x1 + x2 + x3)

where x1,x2,x3 are the independent variables.

Example:

Use R to create a multiple regression model to predict the price of a home based on the independent variables of square footage and number of bedrooms based on the following dataset. Then use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms (see Table B6).

Price of Home (y)Square Footage x1Number of Bedrooms x2
46600046686
35500031965
40590039985
41500040225
20600018342
46200046686
29000026503

Table B6 Home Prices Based on Square Footage and Number of Bedrooms

First, create three vectors in R, one vector each for home price, square footage and number of bedrooms:

    > price <- c(466000, 355000, 405900, 415000, 206000, 462000, 290000)
    > square_footage <- c(4668, 3196, 3998, 4022, 1834, 4668, 2650)
    > bedrooms <- c(6, 5, 5, 5, 2, 6, 3)

Next, run the multiple regression model using the lm command:

    > model <- lm(price ~ square_footage + bedrooms)
    > summary(model)
    
    Call:
    lm(formula = price ~ square_footage + bedrooms)
    
    Residuals:
           1       2       3       4       5       6       7
    -1704.6  1438.4  -606.9  6908.7 -6747.4 -5704.6  6416.5
    
    Coefficients:
                           Estimate Std. Error t value Pr(>|t|)
    (Intercept)    57739.887   9529.693   6.059  0.00375 **
    square_footage    66.017      8.924   7.398  0.00178 **
    bedrooms       16966.536   6283.183   2.700  0.05408 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 6563 on 4 degrees of freedom
    Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9953
    F-statistic: 630.6 on 2 and 4 DF,  p-value: 9.996e-06

In the R output, note the column called “Estimate” provides the estimates for the coefficients and the y-intercept.

The y-intercept is given as 57740 (rounding to nearest whole number).

The coefficient for the “square footage” variable is given as 66.

The coefficient for the “bedrooms” variance is given as 16967.

Based on these values, the multiple regression model is:

yˆ=57740+66×1+16967×2

We can now use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms by setting x1=3000 and x2=3, as follows:

yˆyˆ ==57740+66(3000)+16967(3)306641 

Thus, the predicted price of a home with 3,000 square feet and 3 bedrooms is $306,641.

RELATED POSTS

View all

view all