This document contains notes and summaries covering the content of the course Pattern Recognition within the Artificial Intelligence Master at Utrecht University. It covers the following topics:
- LMs for regression
- LMs for classification 1&2
- intro to DL and NN
- CNNs
- RNNs
– what’s statistical pattern recognition --> field of PR ML concerned
with automatic discovery or regularities in data thru use of computer
algorithms and with use of these regularities to take actions such as
classifying data into categories
– ML approach to do this --> use set of training data of n labeled
examples and fit a model to this data, the model can subsequently be
used to predict class for new input vectors about a new data point,
ability to categorise correctly is called generalization
– supervised: numeric target (regression), discrete unordered target
(classification), discrete ordered target (ordinal classification, ranking)
– unsupervised learning: clustering, density estimation
– always in data analysis --> DATA = STRUCTURE + NOISE --> we want
to capture the structure but not the noise
– coefficients (weights) of a liner and nonlinear function are learned or
estimated from the data
– maximum likelihood estimation: --> probability of observing x events
(successes) for y trials, get N independent observations from the
same probability distribution, find that particular value such that the
observed data are more likely to have come from the probability we
actually sampled from, how likely that future observation is similar to
observations drawn from general distribution
– score = quality of fit = complexity
– decision theory --> when wanting to make a decision in a situation
involving uncertainty, then two steps:
– 1) inference: learn p(x,t) from data (main subject of this course)
– 2) decision: given the estimate of p(x,t), determine optimal
decision (not focus on this in this course)
– suppose we know the joint probability distribution p(x,t):
– the task is --> given value for x, predict class labels t
– Lkj is the loss of predicting class j when the true class is k, K is the
number of classes
– then, to minimize the expected loss, predict the class j that
minimizes the function at page 37 of lecture 1 slides
– useful properties of expectation and variance:
– E[c] = c for constant c.
– E[cx] = cE[x].
– E[x ± y ] = E[x] ± E[y ].
– var[c] = 0 for constant c.
– var[cx] = c2var[x].
– var[x ± y ] = var[x] + var[y ] if x and y independent
, – if E (expected value), c is a constant and x a variable:
– 1) E(c) = c
– 2) E(cx) = cE(x)
– 3) E(x +- y) = E(x) +- E(y)
– to minimize the expected squared error we should predict the mean of
t at point t0
– generally y(x) = Et(t|x) --> population regression function
– simple approach to regression:
– predict the mean of the target values of all training observations
with x = x0
– problem here is that given the numerical nature of the x value, it
wouldn’t really be in practice
– possible solution --> look at data points that are close by to where
i wanna make the prediction, then take the mean of those train
points
– idea here is KNN for regression, ex. with in variable x ad out
variable t:
– 1) for each x define a neighborhood Nk(x) containing the
indices n of the k points (xn, tx)
– with k = small value (e.g. 2), line is very wiggly, but with k =
larger value (e.g. 20), line much smoother because we are
averaging on more k nearest neighbors
– k is a complexity hyperparameter for KNN
– for a regression problem, the best predictore for out variable is the
mean of the out variable for a given input x, but this not applicate with
finite data sample, so:
– NN function approximates mean by averaging over training data
– then NN function relaxes conditioning at a specific input
passphrase for ssh key: Umga!20RuspaSape21?mgaU
PR CLASS 2 - LINEAR MODELS FOR CLASSIFICATION 1
– assumption in linear regression model is that the expected value of t is
a linear function of x
– have to estimate slope and intercept from data
– the observed values of t are equal to the linear function x + some
random error
– given a set of train points (N), we have N pairs of input variables x and
target variable t
– find values for w0 and w1 that minimized the sum of squared errors
– average value is denoted with a bar little dash above the variable term
, – r2 (coefficient of determination):
– metrics for regression prediction evaluation
– define three terms:
– 1) total sum of squares —> is total variation of t in the sample
data
– 2) explained sum of squares —> is the part explained by the
regression, the variation in t explained by regression function
– 3) error sum of squares, part explained by the regression —> is
the sum over all observation to find the difference between
actual value of t and predictions
– therefore we have SST = SSR (part explained by regression) + SSE
– Rsquared = proportion of variation in t explained by x:
– r2 = SSR/SST = 1 - SSE / SST
– output is a number between 1 and 0
– if = 1, all data points are on regression line
– If = 0, they as far away as possible
– the above is the Calculus based solution to this
– but let's check the geometrical approach:
– suppose pop regression line goes through origin, meaning that w0
intercept is = 0, passes thru origin on y axis
– then follow same approach, result is same just regression error is
different
– error vector should be taken perpendicular to the vector of x
meaning that the dot product of x and error vector should be 0
– in least squares solution:
– y is a linear combination of the columns of X
– y = Xw
– want the error vector to be as small as possible
– the observed target values are not a linear combination of the
predictor variables --> if that would be the case, we'd get a perfect
regression line (all the points on it), which is just never the case with
real data
– general linear model:
– the term linear means that it has to be a linear function of the
coefficients/weights, but not necessarily of the input variables
– is linear in the features, but not necessarily in the input variables
that generate them
– interaction between two features modifies the effect of these two
variables on the target variable (e.g. age and income on pizza
expenditure)
– regularized least squares:
– add a regularization term to control for overfitting, in those cases
where the number of predictor variables is very large
– here use ridge regression (weight decay in NN)
– penalize the size of the weights
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller massimilianogarzoni. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $11.92. You're not tied to anything after your purchase.