This appendix provides a summary of Python functions used in this textbook. The intent is to provide students with a cross-reference of Python commands that includes a description of the Python functions, general syntax for usage, and a link to the section where the function is first used in the text.
Please note this is a very high-level description of these functions. Many functions require specific libraries to be installed. For more details on Python functions, syntax, and usage, please refer to the Python documentation posted online.
| Python Function | Description | Syntax | First Reference |
|---|---|---|---|
| What Are Data and Data Science? | |||
| print() | Prints a specified message or specified values to the screen or other output device | print(“text”) print(x, y) | Python Basics for Data Science |
| pd.read_csv() | Loads data from a CSV (comma-separated values) file and stores in a DataFrame | pd.read_csv (path_to_csv datafile) | Python Basics for Data Science |
| DataFrame.describe() | Returns a table with basic statistics for a dataset including min, max, mean, count, and quartiles | DataFrame.describe() Where: DataFrame is the name of the DataFrame. | Python Basics for Data Science |
| DataFrame.iloc[] | Allows access to data in a DataFrame using row/column integer-based indexes. | DataFrame.iloc[row, column] Where: DataFrame is the name of the DataFrame. | Python Basics for Data Science |
| DataFrame.loc[] | Used to access a group of rows and columns by labels or a Boolean array | DataFrame.loc[criteria] Where: DataFrame is the name of the DataFrame. | Python Basics for Data Science |
| Plt.scatter() | Generates a scatterplot for (x, y) data | plt.scatter(x_data, y_data) | Python Basics for Data Science |
| Plt.title() | Specifies a title for a chart | plt.title(“Title”) | Python Basics for Data Science |
| Plt.xlabel() | Specifies a label for the x-axis | plt.xlabel(“x-axis label”) | Python Basics for Data Science |
| Plt.ylabel() | Specifies a label for the y-axis | plt.ylabel(“y-axis label”) | Python Basics for Data Science |
| Plt.xlim() | Specifies limits to use for x-axis numbering | plt.xlim(lower, upper) | Python Basics for Data Science |
| Plt.ylim() | Specifies limits to use for y-axis numbering | plt.ylim(lower, upper) | Python Basics for Data Science |
| Collecting and Preparing Data | |||
| pd.read_html() | Read HTML table from a web page and convert into a DataFrame | pd.read_html(URL) | Web Scraping and Social Media Data Collection |
| pd.to_numeric() | Converts strings or other data types to numeric values | pd.to_numeric (column_name) | Web Scraping and Social Media Data Collection |
| len() | Returns the length of an object | len(object) | Web Scraping and Social Media Data Collection |
| re.findall() | Returns all non-overlapping matches of a specified pattern in a string | re.findall(pattern, string) | Web Scraping and Social Media Data Collection |
| re.search() | Checks if a specified pattern appears in a string | re.search(pattern, string) | Web Scraping and Social Media Data Collection |
| Descriptive Statistics: Statistical Measurements and Probability Distributions | |||
| binom.pmf() | Calculates the probability mass function (PMF) for a binomial distribution. It gives the probability of having exactly x successes in n trials with success probability p. | binom.pmf(x, n, p) Where: x is the number of successes in the experiment, n is the number of trials in the experiment, p is the probability of success. | Discrete and Continuous Probability Distributions |
| round() | Rounds a numeric result to a specified level of precision | round(number, digits) | Discrete and Continuous Probability Distributions |
| poisson.pmf() | Calculates probabilities associated with the Poisson distribution | poisson.pmf(x, mu) Where: x is the number of events of interest, mu is the mean of the Poisson distribution. | Discrete and Continuous Probability Distributions |
| norm.cdf() | Calculates probabilities associated with the normal distribution (returns the area under the normal probability density function to the left of a specified measurement) | norm.cdf(x, mu, std) Where: x is the measurement of interest, mu is the mean of the normal distribution, std is the standard deviation of the normal distribution. | Discrete and Continuous Probability Distributions |
| Inferential Statistics and Regression Analysis | |||
| t.ppf() | Generates the value of the t-distribution corresponding to a specified area under the t-distribution curve and specified degrees of freedom | t.ppf (area to left, degrees of freedom) | Statistical Inference and Confidence Intervals |
| bootstrap() | Performs bootstrap process to generate confidence interval | bootstrap (data, statistic, confidence_level, number_resamples) | Statistical Inference and Confidence Intervals |
| norm.interval() | Calculates confidence interval for the mean when population standard deviation is known, given sample mean, population standard deviation, and sample size (uses normal distribution). Note: Standard error is the standard deviation divided by the square root of the sample size. | norm.interval (conf_level, sample_mean, standard_error) | Statistical Inference and Confidence Intervals |
| t.interval() | Calculates confidence interval for the mean when population standard deviation is unknown, given sample mean, sample standard deviation, and sample size (uses t-distribution). Note, standard error is the standard deviation divided by the square root of the sample size. | t.interval (conf_level, degrees_freedom, sample_mean, standard_error) | Statistical Inference and Confidence Intervals |
| proportion_confint() | Calculates confidence interval for a proportion (uses normal distribution) | proportion_confint (success, sample_size, alpha) | Statistical Inference and Confidence Intervals |
| ttest_1samp() | Returns the value of the test statistic and the two-tailed p-value for a one-sample hypothesis test using the t-distribution | ttest_1samp (data_array, null_hypothesis_mean) | Hypothesis Testing |
| ttest_ind_from_stats() | Returns the value of the test statistic and the two-tailed p-value for a two-sample hypothesis test using the t-distribution | ttest_ind_from_stats (sample_mean1, sample_standard_deviation1, sample_size1, sample_mean2, sample_standard_deviation2, sample_size2) | Hypothesis Testing |
| np.array() | Creates a numerical array from a list-like object | np.array(object) | Correlation and Linear Regression Analysis |
| pearsonr() | Calculates the value of the Pearson correlation coefficient r | pearsonr (x_data, y_data) | Correlation and Linear Regression Analysis |
| linregress() | Generates a linear regression model and provides slope, y-intercept, and other regression-related output | linregress (x_data, y_data) | Correlation and Linear Regression Analysis |
| f_oneway() | Returns both the F test statistic and the p-value for the one-way ANOVA hypothesis test | f_oneway (Array1, Array2, Array3, …) | Analysis of Variance (ANOVA) |
| Time Series and Forecasting | |||
| plot() | Generates a time series plot | plot(dataframe) | Introduction to Time Series Analysis |
| rolling() | Provides rolling window calculations | rolling (window=window) | Time Series Forecasting Methods |
| mean() | Computes the average of a dataset | mean(dataset) | Time Series Forecasting Methods |
| diff() | Computes the first-order difference of data in a window | diff(dataframe) | Time Series Forecasting Methods |
| plot_acf() | Plots the ACF (autocorrelation function) for a time series, up to lag L | Plot_acf (time_series_data, lags=L) | Time Series Forecasting Methods |
| STL() | Decomposes a time series with known period P into its components | STL (time_series_data, period=P) | Time Series Forecasting Methods |
| ewm() | Performs exponential moving average (EMA) smoothing | ewm(dataframe) | Time Series Forecasting Methods |
| adfuller() | Performs the Augmented Dickey-Fuller (ADF) test, which is a statistical test for checking the stationarity of a time series | adfuller (time_series_data) | Time Series Forecasting Methods |
| ARIMA() | Fits an ARIMA(p, d, q) (AutoRegressive Integrated Moving Average) model to time series data | ARIMA (time_series_data, order=(p, d, q)) | Time Series Forecasting Methods |
| Decision-Making Using Machine Learning Basics | |||
| LogisticRegression() | Creates a logistic regression model | LogisticRegression() | Classification Using Machine Learning |
| model.fit() | Trains a machine learning model on a given dataset | model.fit (feature_matrix, target_vector) | Classification Using Machine Learning |
| KMeans() | Sets up a k-means clustering model (Use model.fit() to fit the model to a dataset.) | KMeans(n_clusters=k) | Classification Using Machine Learning |
| DBSCAN() | Sets up a DBSCAN (Density-Based Spatial Clustering of Applications with Noise) model (Use model.fit() to fit the model to a dataset.) | DBSCAN(options) | Classification Using Machine Learning |
| confusion_matrix() | Used to visualize the performance of a model by comparing actual and predicted values | confusion_matrix (target_values, predicted_values) | Classification Using Machine Learning |
| LinearRegression() | Fits a linear regression model to data | LinearRegression() .fit(feature_matrix, target_vector) | Machine Learning in Regression Analysis |
| predict() | Used on trained machine learning models to generate predictions for new data points | predict(feature_matrix) | Machine Learning in Regression Analysis |
| DecisionTreeClassifier() | Sets up a decision tree model (Use model.fit() to fit the model to a dataset.) | DecisionTreeClassifier (options) | Decision Trees |
| ens.RandomForestRegressor() | Sets up a random forest model (Use model.fit() to fit the model to a dataset.) | ens.RandomForestRegressor (options) | Other Machine Learning Techniques |
| GaussianNB() | Set up a Naïve Bayes classification model (Use model.fit() to fit the model to a dataset.) | GaussianNB() | Other Machine Learning Techniques |
| Deep Learning and Artificial Intelligence (AI) Basics | |||
| Perceptron() | Sets up a perceptron model (Use model.fit() to fit the model to a dataset.) | Perceptron() | Introduction to Neural Networks |
| train_test_split() | Splits dataset randomly into train and test subsets, using a proportion of P of the data for the test set | train_test_split (input_data_arrays, target_data, test_size=P) | Introduction to Neural Networks |
| StandardScaler() | Used to standardize features by removing the mean and scaling to unit variance | StandardScaler() | Introduction to Neural Networks |
| accuracy_score() | Calculates the accuracy of a classification model as the ratio of the number of correct predictions to the total number of predictions | accuracy_score (y_true, y_predicted) | Introduction to Neural Networks |
| scaler.fit_transform() | Fits a scaler to the data and then transforms the data according to the fitted scaler | scaler.fit_transform(array) | Introduction to Neural Networks |
| scaler.transform() | Applies a previously fitted scaler to new data | scaler.transform(array) | Introduction to Neural Networks |
| tf.keras.Sequential() | Creates a linear stack of layers for building a neural network model | tf.keras.Sequential (layers, additional options) | Backpropagation |
| model.compile() | Used to configure the learning process of a neural network model before training | model.compile (optimizer, loss, metrics) | Backpropagation |
| Visualizing Data | |||
| boxplot() | Creates a box-and-whisker plot | plt.boxplot(array) | Encoding Univariate Data |
| hist() | Creates a histogram | plt.hist (array) | Encoding Univariate Data |
| plot() | Creates 2D line plots such as a time series graph | plt.plot (x_data, y_data) | Graphing Probability Distributions |
| bar() | Creates a bar chart | plt.bar (x_array, heights) | Graphing Probability Distributions |
| imshow() | Displays an image on a 2D regular raster, such as a heatmap | plt.imshow(array) | Geospatial and Heatmap Data Visualization Using Python |
| heatmap() | Creates a heatmap visualization | sns.heatmap(array) | Geospatial and Heatmap Data Visualization Using Python |
| colorbar() | Adds a colormap to a figure | plt.colorbar() | Multivariate and Network Data Visualization Using Python |
| corr() | Calculates the pairwise correlations of columns in a DataFrame | dataframe.corr() | Multivariate and Network Data Visualization Using Python |
| add.subplot() | Adds a subplot to a figure stored in fig | fig.add.subplot (position) | Multivariate and Network Data Visualization Using Python |
| ax.scatter() | Creates a scatterplot | ax.scatter (x_data, y_data) | Multivariate and Network Data Visualization Using Python |
| Reporting Results | |||
| plot_tree() | Creates a visualization of a decision tree | plot_tree (estimator, feature_names) | Validating Your Model |
| DataFrame.info() | Provides a concise summary of a DataFrame’s structure and content | DataFrame.info() | Validating Your Model |
| DataFrame.drop() | Removes rows or columns from a DataFrame | DataFrame.drop (labels, axis=rows_columns) | Validating Your Model |
| score() | Evaluates the performance of a trained model on a given dataset | model.score (feature_matrix, true_labels) | Validating Your Model |
| dt.get_depth() | Retrieves the depth of the decision tree, dt | dt.get_depth() | Validating Your Model |
| cross_val_score() | Evaluates a model’s performance using cross-validation | cross_val_score (estimator, feature_matrix, target_variable) | Validating Your Model |
| GridSearchCV () | Search for the best parameters for a specified estimator, with k-fold cross-validation | GridSearchCV (estimator, parameters, k) | Validating Your Model |
Table D1