For the Machine Learning exam it is allowed to keep a cheat sheet of 1A4. This is specially made so that all subjects of this course are described in detail, such as Decision Trees, Perceptron, Gradient Descent, Feature Engineering, Logistic Regression and Neural Networks. You are allowed to bring ...
Evaluation (how well the algorithm is learning) Decision tree/forest
• MAE: average absolute difference between true value and predicted The outcome is in one of the nodes, each node performs a test. • Bc possible trees grows exponentially with the number of attributes/features, you can’t check them all to see which works best. That’s why there
value. • MSE: average squared difference between true value and needs to be build a tree incrementally with some idea what makes the best tree. • the first question that has been asked, helps classify. • Recursion is the step by step process by which a DT is built by splitting or not
predicted value. These can be used for regression, numerical output splitting each node on the tree into two daughter nodes. It’s recursive bc each sub-population may be split into an indefinite number of time until the splitting process terminates after a particular stopping criteria is
with a preference to MSE; exaggerates the outliers (/magnitude of big reached. • Base case: no recursion, returns itself (if n==1: return 1) -> Leaf node. • Recursive case: applies itself (else:..) -> branch node. • Speed depends on no. questions to get to leaf node, which depends on depth
numbers) and MAE doesn’t. • Accuracy: number correct predictions of tree (as parameter or not). How more questions, more time. How more unbalanced, more time. In balanced binary tree, each time you ask question, you have no. of remaining questions (depth of balanced tree) -
dataset (TP+TN)/(P+N) 1 – error rate. • Error: proportion of mistakes > how many halvings of N to get to 1. How many times we do have to double results to get to 1/N? (how many doublings of 1 to get to N?) (log is just a reverse exponentiation) • Reducing impurity we can grow very
(FP+FN)/(P+N). Disadvantage:
proportion of mistakes not take into
(FP+FN)/(P+N). account if not
Disadvantage: FN istake
worse
intothan
account if FN is worse than large trees, but then chance overfitting (perform well on train set but worse in terms of generalization error (accuracy prediction on unseen data)). Solved by pruning in 1st phase & use validation to test overfit. Solved
FP. Predicting gender could could use use accuracy
accuracy or or error.
error. However,
However, for for flagging spam error is by test for overfit in building phase & stop when performance on validation get worse • Measure impurity to finds the best split condition (quality of question) and stops when no improvement is possible -> how
flagging spam
preferred. error is is
If accuracy preferred.
99%, error If accuracy is 99%,
rate is better error rate
to display. is better measure of relevancy,
• Precision: minimize the G/impurity of split. Weighted by relative size of left/right branche. Measures how well the two classes are separated (like to sep. all 0s &1s). Impurity criteria used to choose which question to ask first:
to display.
correctly • Precision:
predicted positive measure
observations of relevancy, correctly positive:
to total predicted predictedP = TP/(TP+FP) • Recall: 1. Misclassification (not common use): proportion of misclassified examples, only cares about majority class and not distribution of other classes. (1-class-class-prop)
positive observations
measure of truly relevantto total predicted
returned positive:
results, correctlyP =predicted
TP/(TP+FP) •
positive observations to all 2. Entropy C4.5 (little impact overall performance): measure of uncertainty, uniform distributions have more uncertainty. Very uniform is almost maximum entropy, data is not divided/splitted enough. More skewed
Recall: measure of truly relevant returned results, correctly predicted
observations in actual class: R = TP/(TP+FN). Measures of success of prediction when the is preferred, bc better distrimination. Negative weighted sum of the log of P, i will change.
positive observations to all observations in actual class: R = TP/(TP+FN).
classes are very imbalanced. • Fscore: harmonic mean P&R. This score takes both FP and FN 3. Gini (little impact overall performance): how often a random element be labeled incorrect if labels were assigned at random from given distribution.
Measures
into account.of •success
Fbeta: betaof prediction
tells how much whenwethe careclasses are very
more about recall than precision. F0.5
imbalanced.
would mean •that Fscore:
we careharmonic
half as mean
much P&R. aboutThis score
recall takesprecision.
as about both FP <1 lends more weight G = weighted sum of impurity nodes
andprecision,
to FN into account.
>1 favors• Fbeta:
recall.beta tells how(multi-class):
• Macro-av much we careFscore more about
& av per class and find their F = relative size of data in your node (1/3 is in node, 2/3 in other node)
recall than precision.
unweighted F0.5 would
mean. Doesn’t takemeanlabelthat we careinto
imbalance half account,
as much about
rare classes same impact as
recall as about precision. <1 lends more weight to precision, >1
frequent. Good or bad, depends what you want. • Micro-av (multi-class): total number offavors
recall. • Macro-av (multi-class): Fscore & av
times each class correctly and incorrect predicted (case by caseper class and find theirbasis). This makes only Feature engineering
unweightedif single
difference mean.label
Doesn’t take label
classification andimbalance into account,
if want average rare (including 0/default).
over all classes Is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improves model accuracy on unseen data. It gets the most out of your
classes same
Precision impact
= recall = f1 =asaccuracy.
frequent. Comp Good or bad,
macro av ofdepends
precisionwhat you av of recall, then calc
and macro data. Thus, designing what x, the input, should be by: extracting features (extract something by measuring something), transforming features, selecting features (for some models feature selection does not
want.
the • Micro-av
macro (multi-class):
av f1 score. This preferred totalbc number of times
f1 is expected eachRclass
between & P. really add much for the performance, bc model does that itself/automatically) • Learning algorithms is domain agnostic (generic) and doesn’t care about domain, engineering is domain specific (specific) and
correctly and incorrect predicted (case by case basis). This makes only should know what plausible characteristics are to learn solution to your problem. • Transformations: Standardization (z-score normalization) rescales the feature so that they have properties of standard normal
difference
Perceptronif single label classification and if want average over all distribution, centered on 0 and st dev of 1 is important when comparing measurements that have different units. Feature scaling doesn’t make difference with tree-based algorithms. Log-transform strong
classes (includingfind
Linear classifier: 0/default). Precision = recall
simple boundaries = f1 = +1
(separating accuracy.
and -1)Comp
in space. • Binary classifier: transformation with major effect on distribution shape (for reducing right skewness), can be valuable for making patterns in the data more interpretable/visible. Polynomial features/feature interactions (useful
macro av
makes of precision
predictions and macro
based on linear av of recall, then
predictor calc the
function macro avaf1set of weights with the
combining for linear models but not tree), adding polynomial features increases variance. Two features a, b we can suspect that there is a polynomial relation a2+ab+b2, each term is a feature and ab in the middle is the
score. This
feature preferred
vector. bc f1 1.
• In short: is Cycles
expected between
through trainR data
& P. by processing training examples one at interaction. Other: Minimum, maximum, standard deviation, skewness, kurtosis. • Ablation analysis removing features and see how that affects the performance. 1. Remove one feature a time 2. Measure
a time 2. Starts with preparing for (w, b) (e.g. w = [0, …,0]; b = 0) 3. Iterative mistake-driven drop in accuracy 3. Quantifies contribution of feature, given all other features. -> Relative accuracy drop: the higher, the more relevant. • Neural networks can extract features from ‘raw’ inputs while learning •
algorithm for w, b -> don’t update if w correct predict label of current train example -> update Expressiveness: need to consider how the model considers the relationship between the variables when we do engineering. E.g., logistic regression model need to consider relationships between 2 variables
w when mispredicts label of current train example (+1 label true, -1 label false). 4. Repeat until (e.g. correlation) and engineer the features accordingly so it’s a proper fit for the model. But correlation between 2 variables doesn’t impact the DT/RT. So, when running different models on same data, consider
when agrees on point/prediction. • Discriminant: f(x) = w · x + b. • Bias: decides which class how the model works, and engineer accordingly.
the node should be pushed to. Not depend on input value. Bias modified during process: 1.
When w · x = 0, bias decides which class to predict 2. Makes the default decision 3. Biases
classifier towards positive or negative class. At start, all neurons have random weights and
biases. After iteration, weights and biases are gradually shifted so that next result is closer to
desired output. The W and B decide where line is. • Decision boundary depends on threshold
of 0 (w1x1 + w2x2 + b = 0) formula of line. We want the hyperplane have the maximum margin
between datapoints. Large margins leads to good generalization on testdata. Finding good
boundary: 1. Go through examples one by one 2. Try classifying current example with current
(w,b) 3. If correct keep going 4. If not, adjust parameters (w,b). Learning from
trial/error/mistakes. • How to adjust (w,b) example = (x, +1) 1. With current (w,b), the score
f(x) = w · x + b is less than 0. 2. How change b to make it higher? Change it into the direction of
the outcome we want 3. How change w to make f or x higher? If particular value of x is positive
we want increase corresponding weight, and other way around if negative. -> we make the Logistic regression (classifier – probabilistic classification)
prediction, and in what direction did we made a mistake, + or -? • Online: sees example, Different models come with different error functions, SSE for linear, Loss for classification. • Loss function quantifies mistake on a single example. Squared loss corresponds to SSE, and is loss function for linear
updates W and throws example away. Every time we see example, we try the weights, sees regression. Where z = w · x + b is score of the model and y is the target. The gradient of this constant function is 0, we compare the loss function to the z and not predicted outcome. • Loss function for
output, if not like it, we adjust it, and throw away previous example. So, it won’t get slow. The classification, zero-one loss 1 if we made mistake, 0 otherwise. This is what we care about, if we made a mistake or not. Zero-one loss and gradient descent -> no gradient can be found. • Graph: SSE and Zero
examples are simply forgotten. • Evaluation: prediction current example, record if correct or one behave differently, for classification we don’t have that kind of smoothness (output is 1 or 0). Zero-one is where care about in classification. There is a problem: no value of z that minimize it, can’t minimize
not, update model and go to next example, at each point in time: error. Early stopping look at function easily. We can’t use the GD to minimize it, because we can’t find it. Therefore, we use cross-entropy for loss function LR • SSE for classification no, bad decision boundary and model cares too much
validation to know when stop training. When rate not fluctuating anymore, stop training. You about predicting exactly 1 for example with high #good. So, need better loss function. Therefore, cross-entropy • Regression for classification regression care about error of datapoints, for classification is not
stop model in time to prevent overfitting. Now the W are more generizable and not perfect. important (->logistic). In regression we predict number, in classification predict labels. • Logistic regression for probabilistic classification (when y = binary) regresses on probabilities of labels. We not directly
Multiple iterations can’t see error rate each example so evaluate on previous seen example predict labels, but the probability of labels. This model predicts logit(p) using a linear model. The logit function is the natural log of the odds that Y equals one of the categories and code them as 0 or 1. P is
(use separate development set). No. of iterations is hyper-parameter: learned based on defined as the probability that Y = 1 (true/yes). • Logistic: 1. Let p = probability that label is positive (number between 0 and 1). 2. Logit function maps p to [-∞,∞] (minus inf to inf) 3. Can map the logit back to
validation data or arbitrary chosen by scientist. • Weight averaging: remember weights from probability using inverse logit, function logit-1 4. Putting the pieces together • logistic VS linear 1. Both use the score of linear model z = w x + b 2. Linear uses it direcly y = pred 3. But logistic via inverse logit •
each iteration and average them, such weights generalize better in practice. • Sparsity: the Measure of error/cross-entropy/log loss quantifying mistakes for logistic regression (difference between 2 probabilities) -> minimize log loss – find model which gives maximum probability to training targets
concept of the matrix (multidimensional array) where most value are zeros. Use sparse (see alternative notation). • SGD for linear: wnew = wold – N x 2(ypred – y)X -> for logistic: wnew = wold + N x (y – Ppred)X • Overfitting penalize weight variance to control overfitting -> L2 regularization. If the
representation that omits zero values. weights are low, Z is relatively low (otherwise: scores are relative to infinity). Easier to see in Logistic but also counts for linear. • L2 regularization L2 is the square of the sum of the weights (aka parameters).
We care about: if alpha is 0, then you don’t have a L2 regularization. The larger the alpha, the more weights are small and discourage the model to overfit but not too large because it will underfit. How do you
know ideal value? Try on different validation, and cross-validate. Regularization improves, i.e. the performance on new unseen data. • Summary: different loss functions give rise to different linear model.
What did they do: 1. Convert z to label. 2. What happens if z
= 2.0 -> it would be positive class (1) 3. What would be
squared loss in this case: 1. 4. And if z = 10, it would be
positive class (1). Loss function would be 81. Penalizes
confident correct predictions.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller ambervdmeijs. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.26. You're not tied to anything after your purchase.