Basic Data Analysis Using R
June 28, 2025 | by Bloom Code Studio
R is a statistical analysis tool that is widely used in the data science field. R provides many tools for data exploration, visualization, and statistical analysis. It is available as a free, open-source program and provides an integrated suite of functions for data analysis, graphing, and statistical programming. R is increasingly being used by data scientists as a data analysis and statistical tool in part because R is an open-source language and additional features are constantly being added by the user community. The tool can be used on many different computing platforms and can be downloaded at the R Project website.
Once you have installed and started R on your computer, at the bottom of the R console, you should see the symbol >, which indicates that R is ready to accept commands.
For a user new to R, typing help.start() at the R prompt provides a menu of Manuals and Reference materials as shown in Figure B1.
Figure B1 R Help Menu Based on
help.start()
R provides many built-in help resources. When a user types help() at the R prompt, a listing of help resources are provided. For a specific example, typing help(median) will show various documentation on the built-in median function within R.
In addition, if a user types demo() at the R prompt, various demonstration options are shown. For a specific example, typing demo(graphics) will provide some examples of various graphics plots.
R is a command-driven language, meaning that the user enters commands at the prompt, which R then executes one at a time. R can also execute a program containing multiple commands. There are ways to add a graphic user interface (GUI) to R. An example of a GUI tool for R is RStudio.
The R command line can be used to perform any numeric calculation, similar to a handheld calculator. For example, to evaluate the expression 10+3⋅7, enter the following expression at the command line prompt and press return. The numeric result of 31 is then shown:
> 10+3*7
[1] 31
Most calculations in R are handled via functions. For data science and statistical analysis, there are many pre-established functions in R to calculate mean, median, standard deviation, quartiles, and so on. Variables can be named and assigned values using the assignment operator <-. For example, the following R commands assign the value of 20 to the variable named x and assign the value of 30 to the variable named y:
> x <- 20
> y <- 30
These variable names can be used in any calculation, such as multiplying x by y to produce the result 600:
> x*y
[1] 600
The typical method for using functions in statistical applications is to first create a vector of data values. There are several ways to create vectors in R. For example, the c function is often used to combine values into a vector. The following R command will generate a vector called salaries that contains the data values 40000, 50000, 75000, and 92000:
> salaries <- c(40000, 50000, 75000, 92000)
This vector salaries can then be used in statistical functions such as mean, median, min, max, and so on, as shown:
> mean(salaries)
[1] 64250
> median(salaries)
[1] 62500
> min(salaries)
[1] 40000
> max(salaries)
[1] 92000
Another option for generating a vector in R is to use the seq function, which will automatically generate a sequence of numbers. For example, we can generate a sequence of numbers from 1 to 5, incremented by 0.5, and call this vector example1, as follows:
> example1 <- seq(1, 5, by=0.5)
If we then type the name of the vector and press enter, R will provide a listing of numeric values for that vector name.
> salaries
[1] 40000 50000 75000 92000
> example1
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Oftentimes, a data scientist is interested in generating a quick statistical summary of a dataset in the form of its mean, median, quartiles, min, and max. The R command called summary provides these results.
> summary(salaries)
Min. 1st Qu. Median Mean 3rd Qu. Max.
40000 47500 62500 64250 79250 92000
For measures of spread, R includes a command for standard deviation, called sd(), and a command for variance, called var(). The standard deviation and variance are calculated with the assumption that the dataset was collected from a sample.
> sd(salaries)
[1] 23641.42
> var(salaries)
[1] 558916667
To calculate a weighted mean in R, create two vectors, one of which contains the data values and the other of which contains the associated weights. Then enter the R command weighted.mean(values, weights).
The following is an example of a weighted mean calculation in R:
EXAMPLE B.1
Problem
Assume a financial portfolio contains 1,000 shares of XYZ Corporation, purchased on three different dates, as shown in Table B1. Use R to calculate the weighted mean of the purchase price, where the weights are based on the number of shares in the portfolio.
| Date Purchased | Purchase Price ($) | Number of Shares Purchased |
|---|---|---|
| January 17 | 78 | 200 |
| February 10 | 122 | 300 |
| March 23 | 131 | 500 |
| Total | 1000 |
Table B1 Portfolio of XYZ Shares
Solution
Here is how you would create two vectors in R: the price vector will contain the purchase price, and the shares vector will contain the number of shares. Then execute the R command weighted.mean(price, shares), as follows:
> price <- c(78, 122, 131)
> shares <- c(200, 300, 500)
> weighted.mean(price, shares)
[1] 117.7
A list of common R statistical commands appears in Table B2.
| R Command | Result |
|---|---|
| mean() | Calculates the arithmetic mean |
| median() | Calculates the median |
| min() | Calculates the minimum value |
| max() | Calculates the maximum value |
| weighted.mean() | Calculates the weighted mean |
| sum() | Calculates the sum of values |
| summary() | Calculates the mean, median, quartiles, min, and max |
| sd() | Calculates the sample standard deviation |
| var() | Calculates the sample variance |
| IQR() | Calculates the interquartile range |
| barplot() | Plots a bar chart of non-numeric data |
| boxplot() | Plots a boxplot of numeric data |
| hist() | Plots a histogram of numeric data |
| plot() | Plots various graphs, including a scatter plot |
| freq() | Creates a frequency distribution table |
Table B2 List of Common R Statistical Commands
Basic Visualization and Graphing Using R
R provides many built-in functions for data visualization and graphing and allows the data scientist significant flexibility and customization options for graphs and other data visualizations.
There are many statistical applications in R, and many graphical representations are possible, such as bar graphs, histograms, time series plots, scatter plots, and others.
As a simple example of a bar graph, assume a college instructor wants to create a bar graph to show enrollment in various courses such as statistics, history, physics, and chemistry courses.
Table B3 shows the enrollment data:
| College Course | Student Enrollment |
|---|---|
| Statistics | 375 |
| History | 302 |
| Physics | 294 |
| Chemistry | 193 |
Table B3 Enrollment Data
The basic command to create a bar graph in R is the command barchart().
First, create a dataframe called enrollment to hold the data (a dataframe can be considered a table or matrix to store the dataset).
enrollment <- data.frame(course=c("Statistics", "History", "Physics", "Chemistry"),
enrolled=c(375, 302, 294, 193))
Next, use the barplot function to create the bar graph and add labels for x-axis, y-axis, and overall title.
barplot(enrollment$enrolled, names.arg=enrollment$course,
main="Student Enrollment at the College", xlab="Course Name",
ylab = "Enrollment")
The resulting output is shown below in Figure B2:
Figure B2 Bar Graph of Student Enrollment Data
The basic command to create a scatter plot in R is the plot command, plot(x, y), where x is a vector containing the x-values of the dataset and y is a vector containing the y-values of the dataset.
The general format of the command is as follows:
>plot(x, y, main="text for title of graph",
xlab="text for x-axis label", ylab="text for y-axis label")
For example, we are interested in creating a scatter plot to examine the correlation between the value of the S&P 500 and Nike stock prices. Assume we have the data shown in Table B4, collected over a one-year time period.
| Date | S&P 500 | Nike Stock Price ($) |
|---|---|---|
| 4/1/2020 | 2912.43 | 87.18 |
| 5/1/2020 | 3044.31 | 98.58 |
| 6/1/2020 | 3100.29 | 98.05 |
| 7/1/2020 | 3271.12 | 97.61 |
| 8/1/2020 | 3500.31 | 111.89 |
| 9/1/2020 | 3363.00 | 125.54 |
| 10/1/2020 | 3269.96 | 120.08 |
| 11/1/2020 | 3621.63 | 134.70 |
| 12/1/2020 | 3756.07 | 141.47 |
| 1/1/2021 | 3714.24 | 133.59 |
| 2/1/2021 | 3811.15 | 134.78 |
| 3/1/2021 | 3943.34 | 140.45 |
| 3/12/2021 | 3943.34 | 140.45 |
| (source: https://finance.yahoo.com/) | ||
Table B4 Data for S&P 500 and Nike Stock Price over a 12-Month Period
Note that data can be read into R from a text file or Excel file or from the clipboard by using various R commands. Assume the values of the S&P 500 have been loaded into the vector SP500 and the values of Nike stock prices have been loaded into the vector Nike. Then, to generate the scatter plot, we can use the following R command:
>plot(SP500, Nike, main="Scatter Plot of Nike Stock Price vs. S&P 500",
xlab="S&P 500", ylab="Nike Stock Price")
As a result of these commands, R provides the scatter plot shown in Figure B3.
Figure B3 Scatter Plot Generated by R for Nike Stock Price vs. S&P 500
RELATED POSTS
View all