Learning

open
close

Basic Statistical Analysis Using Excel

June 28, 2025 | by Bloom Code Studio

Excel provides many statistical functions and statistical distributions that are used in data science applications.

Excel Analysis for Measures of Center and Dispersion

Excel allows you to type a formula into a cell. Using a formula, you can refer to a value from another cell or produce an output based on a value in a cell (or values from multiple cells). A formula always starts with an equal sign (=). Following is an example formula, calculating an average of three numbers in cells A1, A2, and A3.

= AVERAGE(A1, A2, A3)

Suppose you have numbers in cells A1, A2, and A3 and want to show the average of the three numbers in cell B1. Figure A22 shows how to use AVERAGE in this scenario.

A screenshot of an Excel spreadsheet showing columns A, B, and C and rows 1, 2 and 3. Cell A1 is 10, cell A2 is 20 , and cell A3 is 30. Cell B2 has an AVERAGE formula that says =AVERAGE(A1, A2, A3).

Figure A22 Use of AVERAGE in MS Excel (Used with permission from Microsoft)

Once you are done typing the formula in B1, press enter. The formula in B1 will change to the result of the formula – 20 (see Figure A23).

A screenshot of an Excel spreadsheet showing columns A, B, and C and rows 1, 2 and 3. Cell A1 is 10, cell A2 is 20 , and cell A3 is 30. Cell B2 is 20, which shows the result of using the formula =AVERAGE(A1, A2, A3).

Figure A23 The Result of Using AVERAGE in MS Excel (Used with permission from Microsoft)

Excel does NOT save the formula when the file is saved as CSV

You can use Excel formulas with no problem on a CSV file, but Excel does not save those when the file is saved in the CSV format. If you want to keep the formula you have used, make sure to save your file as XLSX (File > Save As… > Excel Workbook (XLSX)). XLSX stores more information beyond text, such as formulas and drawings.

EXAMPLE A.6

Problem

On ch1-SOCR-small.csv, calculate the average height of the MLB players in the dataset using AVERAGE.

Solution

On the worksheet where the MLB Players dataset is open, type this formula in any empty cell and press the enter key (Figure A24).

=AVERAGE(D2:D13)

This will fill the average across cells D2, D3, D4, …, through D13. The resulting average height is 73.3333.

A screenshot of an MS Excel worksheet. It contains a dataset with 13 rows and 6 columns. The columns are labeled name, team, position, height (inches), weight (pounds), and age. The data in the Height column is highlighted and in row 14 the formula =AVERAGE(D2:D13) is being used to calculate the average height.

Figure A24 Calculate the Average Height Using AVERAGE in MS Excel (Used with permission from Microsoft)

Other Useful Formulas for Summarizing Data

There are a lot of other ways to use simple Excel formulas to analyze data values. You can type each of the example formulas listed in Table A1 and see how it works.

FormulaDescription
=MEDIAN(D2:D13)Compute median of D2, D3, …, and D13. Returns 72.5 as a result.
=STDEV(D2:D13)Compute standard deviation of D2, D3, …, and D13. Returns 3.28 as a result.
=MIN(D2:D13)Find the minimum value among D2, D3, …, and D13. Returns 70 as a result.
=MAX(D2:D13)Find the maximum value among D2, D3, …, and D13. Returns 82 as a result.
=SUM(D2:D13)Sum the values of D2, D3, …, through D13. Returns 880 as a result.
=COUNTIF(D2:D13, 72)Count the number of occurrences of “72” within the range of D2, D3, …, through D13. Returns 3 as a result (i.e., there are three 72s in D2:D13).

Table A1 Example MS Excel Formulas

EXAMPLE A.7

Problem

On ch1-SOCR-small.csv, write a formula that calculates the median weight of the players.

Solution

The formula is

=MEDIAN(E2:E13)

This will find the median across cell E2, E3, E4, …, through E13, which is 200. If you do not know what median is, don’t worry! You will learn more about it in Measures of Position.

Similar to the built-in function for the mean (=AVERAGE), Excel provides a function to calculate the sample standard deviation of a dataset =STDEV.S (for the sample standard deviation). To calculate these statistical results in Excel, enter the data values in a column. Let’s assume the data values are placed in cells A2 through A11. In any cell, type the Excel command =AVERAGE(A2:A11) and press enter. Excel will calculate the arithmetic mean in this cell. Then, in any other cell, type the Excel command =STDEV.S(A2:A11) and press enter. Excel will calculate the sample standard deviation in this cell. Figure A25 shows the mean and standard deviation for the 10 ages.

A screenshot of an Excel worksheet with column A labeled Ages. There are 10 numbers in the column. Overlaid text says: Excel command to calculate sample mean: =AVERAGE(A2:A11). Result: 46. Excel command to calculate sample standard deviation: =STDEV.S(A2:A11). Result: 6.50

Figure A25 Mean and Standard Deviation in MS Excel (Used with permission from Microsoft)

Here is a more comprehensive listing of various statistical functions available within Excel. The reader is encouraged to go to the Formulas menu in Excel and select any of the submenus such as “Financial,” “Math & Trig,” Lookup & Reference,” etc. to see various Excel formulas. Under the option for “More Functions,” select STATISTICAL to see a menu of various statistical functions available within Excel. Table A2 shows a sample of some statistical functions available within Excel:

FunctionPurpose
=AVERAGE(A1:A10)Find the mean of a set of numbers.
=MEDIAN(A1:A10)Find the median of a set of numbers.
=STDEV.S(A1:A10)Find the standard deviation for a set of numbers representing a sample.
=STDEV.P(A1:A10)Find the standard deviation for a set of numbers representing a population.
=VAR.S(A1:A10)Find the variance for a set of numbers representing a sample.
=VAR.P(A1:A10)Find the variance for a set of numbers representing a population.
=MIN(A1:A10)Find the minimum of a set of numbers.
=MAX(A1:A10)Find the maximum of a set of numbers.
=MAX(A1:A10) – MIN(A1:A10)Find the range of a set of numbers.
=CORREL(A1:A10, B1:B10)Find the correlation coefficient “r” for (x, y) data.
x-data is in cells A1 to A10; y-data is in cells B1 to B10.
=QUARTILE (A1:A10, 1)Find the first quartile for a set of numbers.
=QUARTILE (A1:A10, 2)Find the second quartile for a set of numbers (note that the second quartile is the same as the median).
=QUARTILE (A1:A10, 3)Find the third quartile for a set of numbers.

Table A2 Statistical Functions Available within MS Excel

Excel Analysis for Probability Distributions

As discussed in Measures of Position, data scientists are often interested in various probability distributions such as the normal distribution, binomial distribution, and Poisson distribution. Excel provides built-in functions to analyze many probability distributions.

Excel uses the command =NORM.DIST to find the area under the normal curve to the left of a specified value:

=NORM.DIST(XVALUE, MEAN, STANDARD_DEV, TRUE)

For example, suppose that at a financial consulting company, the mean employee salary is $60,000 with a standard deviation of $7,500. A normal curve can be drawn to represent this scenario, in which the mean of $60,000 would be plotted on the horizontal axis, corresponding to the peak of the curve. Then, to find the probability that an employee earns more than $75,000, you would calculate the area under the normal curve to the right of the data value $75,000.

For example, at the financial consulting company mentioned previously, the mean employee salary is $60,000 with a standard deviation of $7,500. To find the probability that a random employee’s salary is less than $55,000 using Excel, this is the command you would use:

=NORM.DIST(55000, 60000, 7500, TRUE)

Result: 0.25249

Thus, there is a probability of about 25% that a random employee has a salary less than $55,000.

Excel also provides built-in functions for binomial and Poisson distributions as shown in Table A3:

Probability DistributionExcel CommandUsage
Binomial=BINOMIAL.DIST=BINOMIAL.DIST(Number_s, Trials, Probability_S, Cumulative)
Poisson=POISSON=POSSON(X, Mean, Cumulative)

Table A3 Built-In Functions for Binomial and Poisson Distributions in MS Excel

Excel Analysis for Correlation and Regression

In Inferential Statistics and Regression Analysis, the concepts of correlation and regression were discussed.

In correlation analysis, the data scientist is often interested in calculating the correlation coefficient, which is a numerical measure of the direction and strength of the correlation between two numeric quantities.

The Excel command to calculate the correlation coefficient uses the following format:

=CORREL(A1:A10, B1:B10)

In regression analysis, the data scientist is interested in calculating the equation of the best-fit linear model for two numeric quantities (assuming the correlation is significant).

Recall the equation of the best fit line is

yˆ=a+bx

where m is the slope of the line and b is the y-intercept of the line.

To calculate the slope and y-intercept of the linear model using Excel, start by entering the (x, y) data in two columns in Excel. Then the Excel commands =SLOPE and =INTERCEPT can be used to calculate the slope and intercept, respectively.

The dataset in Table A4 will be used as an example: the monthly amount spent on advertising and the monthly revenue for a Fortune 500 company for 12 months.

MonthAmount Spent on AdvertisingRevenue
Jan4912210
Feb14517590
Mar5713215
Apr15319200
May9214600
Jun8314100
Jul11717100
Aug14218400
Sep6914100
Oct10615500
Nov10916300
Dec12117020

Table A4 Revenue versus Advertising for a Fortune 500 Company ($000s) Monthly Amount Spent on Advertising and the Monthly Revenue for a Fortune 500 Company

To calculate the slope of the regression model, use the Excel command

=SLOPE(y-data range, x-data range)

It’s important to note that this Excel command expects that the y-data range is entered first and the x-data range is entered second. Since revenue depends on amount spent on advertising, revenue is considered the y-variable and amount spent on advertising is considered the x-variable. Notice the y-data is contained in cells C2 through C13 and the x-data is contained in cells B2 through B13. Thus the Excel command for slope would be entered as

=SLOPE(C2:C13, B2:B13)

In the same way, the Excel command to calculate the y-intercept of the regression model is

=INTERCEPT(y-data range, x-data range)

For the dataset shown in the above table, the Excel command would be

=INTERCEPT(C2:C13, B2:B13)

The results are as follows (see Figure A26):

slope bintercept a==61.89,376.7

A screenshot of an MS Excel worksheet. A table showing monthly advertising expenditure and revenue for a year. The table also shows the calculation of slope and intercept. It contains a dataset with 13 rows and 3 columns. The columns are labelled Month, Advertising Expenditure, and Revenue. In cell B15 is the formula =SLOPE(C2:C13, B2:B13). Cell B16 is 61.8. In cell C15 is the formula =INTERCEPT(C2:C13, B2:B13). Cell C16 is 9,376.7.

Figure A26 Revenue versus Advertising for a Fortune 500 Company ($000s) Showing Slope and y-Intercept Calculation in MS Excel (Used with permission from Microsoft)

Based on this, the regression equation can be written as

yˆyˆ==a+bx9,376.7+61.8x

where x represents the amount spent on advertising (in thousands of dollars) and y represents the amount of revenue (in thousands of dollars).

RELATED POSTS

View all

view all