Dsci 100 Final Exam Complete Questions And Answers For All Blocks.
5 views 0 purchase
Course
DSCI
Institution
DSCI
Underfitting in classification - correct answer Increased number of neighbours means that more observations are influencing the classification. This smooths out the boundaries between each class, so the model isn't influenced enough by the training data.
Unde...
Underfitting in classification - correct answer Increased number of neighbours
means that more observations are influencing the classification. This smooths out the boundaries
between each class, so the model isn't influenced enough by the training data.
Underfitting in KNN regression - correct answer Increased K values means that more
observations are influencing the regression line, which can cause the line to become flat. Since the
regression no longer follows the training observations, accuracy in predicting training datapoints
decreases.
Overfitting in classification - correct answer Decreased number of neighbours
means that fewer observations are influencing the classification. This makes the boundaries between
classes more jagged and complex, so the model is influenced too much by the training data. The
classifier will just match new observations to the closest neighbour in the training data set, resulting in
high accuracy.
Overfitting in KNN regression - correct answer Decreased K values means that fewer
observations are influencing the regression line, which can cause the line to simply follow the training
data points. Since the regression is perfectly following the training observations, accuracy in predicting
training datapoints increases.
Bootstrapping (concept) - correct answer Given a single sample from a population,
you take a single observation from this sample, record it, then return it to the sample. This is sampling
with replacement. This should be repeated until the bootstrap sample is the same size as the original
sample. Now, you should be able to calculate the mean or proportion from the bootstrap sample. These
steps should be repeated many times to form a bootstrap distribution of means or proportions. This will
provide an estimate of the population parameter, and should resemble the sampling distribution
spread.
, KNN regression - correct answer - Used to infer a quantitative measurement of a
new observation based on existing observations
- Relatively simple and intuitive, doesn't require much info about the relationship in advance (can be
used on non-linear relationships), can be used for binary or multi-variable relationships
- Takes a long time for large data sets, performs poorly with many different predictors or if classes are
unbalanced
- Splits data into training and testing data
- Training data can be split further to use cross-validation, which allows you to determine the best K
value to use (determine which one has the highest accuracy)
- Regression algorithm used to predict output on the testing data and determine accuracy
- If predictions are closer to the true values, RMSPE will be smaller and vv.
- recipe, nearest_neighbour with tuning ("kknn", "regression"), vfold_cv, tibble for neighbors, workflow
with tune_grid(), collect_metrics for smallest rmse
- run workflow again with new nearest_neighbour using the determined k value, metrics(truth =
prediction variable, estimate = .pred)
Linear regression - correct answer - Used to infer a quantitative measurement of a
new observation based on existing observations
- Better for inferring outside the existing training data, more interpretable and provides an equation to
describe a relationship
- Cannot be used with non-linear relationships, more complex, can be influenced by outliers and
multicollinearity
- Splits data into training and testing data
- Regression algorithm used to predict output on the testing data and determine accuracy
- If predictions are closer to the true values, RMSPE will be smaller and vv.
- recipe, linear_reg ("lm", "regression"), workflow with fit() to output coefficients of relationship
- can extract coefficients with extract_fit_parsnip() and tidy()
KNN classification - correct answer - Used to infer the characteristic/specific
category of a new observation based on existing observations
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller RealGrades. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $12.99. You're not tied to anything after your purchase.