10/15/24, 9:08 Final Exam Cheat
AM Sheet
Cheat Sheet Final Exam:
Stepwise Regression: (Exam) Suited for: Variable selection and/or prediction from feature data. Definition: Variable selection
process that can combine forward selection and backward regression.
Lasso Regression: (Exam) Suited for: Variable selection and/or prediction from feature data. Definition: Method for limiting the
number of variables in a model by limiting the sum of all coefficients’ absolute values. Can be very helpful when number of data
points is less than number of factors.
Fractional Factorial Designs: (Exam) Suited for: Experimental Design Definition: Test of a subset of all possible combinations of
factor values over multiple factors. If chosen well, the desired effects of factors and factor interaction effects can be obtained
Support Vector Machines: (Exam) Suited for: Classification and/or prediction from feature data Definition: Classification
algorithm that uses a boundary to separate the data into two or more categories (“classes”). Used to: (Exam): Using feature
data to predict whether or not something will happen two time period in the future.
K-means algorithm: (Exam) Suited for: Clustering Definition: Clustering algorithm that defines � clusters of data points, each
corresponding to one of � cluster centers selected by the algorithm.
Exponential Smoothing: (Exam) Suited for: Prediction from time-series data. Definition: Data smoothing technique in which
older observations are assigned exponentially decreasing weights, so more emphasis is given to recent observations. (Exam)
Analysis: Using time-series data to predict the amount of something two time periods in the future.
GARCH: (Exam) Suited for: Prediction from time-series data Definition: Autoregressive method used to model variance in time
series data. (Exam) Model used to: Using time-series data to predict the variance of something two time period in the future.
ARIMA: (Exam) Suited for: Prediction from time-series data. Definition: Time series model that uses differences between
observations when data is nonstationary. Also called Box-Jenkins. (Exam) Model used to: Using time series to predict the
amount of something two time periods in the future.
K-nearest-Neighbor Regression: (Exam): Suited for: Using feature data to predict the amount and/or probability of something
two time periods in the future. Definition: Regression model where a data point’s response is estimated based on the responses
of the �� _nearest data points with known response.
Logistic regression tree: (Exam) Suited for: Using feature data to predict the probability of something happening and/or
weather or not something will happen two time period in the future. Definition Logistic Regression Regression model that uses
an exponential function of variables to estimate a response that is either between 0 and 1, or must be equal to 0 or 1 :
(Examples of Logistic Regression): Exam (Q43): Estimate the probability that a patient survives heart
transplant surgery. Another example: estimate the likehood that a flight from Atlanta to Detroit will take more than two hours.
Definition Regression Tree: Tree-based method for regression. After branching to split the data, each subset is analyzed with its
own regression model. Tree: Iterative split (branching) of a data set into more-specific subsets that each are modeled
separately. Often used for classification, regression, and decision-making. Also, can be used to solve optimization problems
Random Support Vector Machine Forest: (Exam) Model to: Using feature data to predict whether or not something will happen
two time periods in the future. Definition Forest: A set of multiple trees. Just like in real life.
Linear regression tree: (Exam) Suited for: Using feature regression to predict the amount of something two time periods in the
future. Definition Linear Regression: Regression model where the relationships between attributes and a response are modeled
as linear functions (Examples of Linear Regression): Exam (Q43) Forecast the number of hotdogs that will be
sold at a baseball game. Another example: Estimate the amount of time it will take to process a certain loan application.
For each type of data specify if it is or it is not time series:
Definition Time Series: Data that records the same attribute/response at multiple points in time (often at equal time intervals).
• Characteristics of a day (day of week, season, temperature, amount of rainfall) that might affect the number of
burgers sold: (EXAM) NOT TIME SERIES
• Fraction of burgers sold that had cheese, on each of the past 2000 days: (EXAM) TIME SERIES
• Number of burgers a restaurant sold on each of the past 2000 days: (EXAM) TIME SERIES
• Number of toppings on each burger sold in the past 2000 days: (EXAM) NOT TIME SERIES
Data that is scaled before point outliers are removed:
Definition Point outlier: A data point that is (uncommonly) far from other data points – for example, an outdoor temperature
reading of 200 degrees Fahrenheit. Scaling: Shrinking or expanding, and moving, the range of data to fit exactly into a specific
interval (for example, between 0 and 1, or between 100 and 800).
• If data is scaled first, the range of data after outliers are removed will be (EXAM) NARROWER than intended
• Point outliers (EXAM) WOULD NOT appear to be valid data if not removed before scaling
• Valid data (EXAM) WOULD NOT appear to be outliers if data is scaled first.
Specify whether using a variable selection approach like lasso or stepwise regression would be important:
Definition Variable Selection: Process of selecting the best subset of predictors to explain variance in data; involves eliminating
unnecessary or redundant or less-important variables from a potential set of predictors.
about:blan 1/
k 9
, 10/15/24, 9:08 Final Exam Cheat
AM Sheet
• Time-series data is being used: (Exam) No, don’t use variable selection
• There are fewer data points than Variables (Exam) Yes, use variable selection
• There are too few data points to avoid overfitting if all variables are included (Exam) Yes, use variable selection
• It is too costly to create a model with a large number of variables (Exam) Yes, use variable selection
(Exam) What are the best software packages used toft
R: Linear regression
PuLP: Linear programming (optimization)
SimPy: Discrete-event simulation
Arena: Discrete-event simulation
Homework R functions:
Function predict: Make predictions from models
Function scale: Scale data
Function glm: Linear regression
Function cv: Cross-Validation
Function FrF2: Creating and Analyzing Fractional Factorial (None of the above)
Fraction train: Train various models
Fraction kmeans: k-means
Function ggplot: Graphing
Function ksvm: Support vector machine
Function prcomp: PCA
Function HoltWinters: Holt-Winters
Function RandomForest: Random Forest
Function kknn: k-nearest-neighbor
Function lm: Linear regression
The following process was followed to predict sales of a product each month for the next three years:
1. Split past sales data randomly into three sets: training, validation, and test.
Definition: Training set: Portion of the data to build/fit a model. Normally, most of the data is used for
training. Validation set: Portion of the data used for validation of a model and compare between models.
Test set: Portion of the data used to assess the effectiveness of a model once built
2. Build 20 different models using the training data.
Definition: Model: A mathematical description of a system. Because real-life systems are complex,
mathematical models of them are only approximate. In analytics, the term “model” is used in at least
three different ways: (1) A general type of mathematical approach, like “regression”; (2) A general type of
mathematical approach with specific parameters, like “regression using credit score and income as
predictors”; (3) A general type of mathematical approach with specific parameters and values for the
parameters, like “regression, with the prediction equal to 100,000, plus 100 times credit score, plus 3 times
income”.
3. Evaluate all 20 models on the validation data.
4. Select the model that performed best on the validation data.
5. Evaluate the selected model on the test data.
6. Use the selected model to predict monthly sales for the next three years based on real-time data and
observe its true performance.
• (EXAM) It is unclear how the selected model´s expected performance on test data compare to its observed
performance on real-time data, because the training data and the test data were taken from the same population,
but the real time data might be different
• (EXAM) The selected model´s expected performance on test data will be worse than its expected performance
on the validation data, because there is selection bias: the selected model is more likely to have better than
average performance on random patterns in the validation data
• (EXAM) Every model´s expected performance on training data will be better than its expected performance on
the validation data, because model fits partly to random patterns in the training data.
A positive correlation has been observed between health and wealth among older Americans (healthier people are
wealthier on average, and wealthier people are healthier on average). Based on that observed correlation, select all of the
following statements about the direction of causality between health and wealth that are true.
Definition: Correlation: Relationship in which two things are likely to happen together, regardless of whether
one causes the other. (There is also a quantitative statistical definition measuring the amount of correlation.)
Causation: Relationship in which one thing makes another happen (i.e., one thing causes another).
about:blan 2/
k 9