100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Cheat Sheet for 'Introduction to Deep Learning' CSAI $5.78   Add to cart

Summary

Summary Cheat Sheet for 'Introduction to Deep Learning' CSAI

 2 views  0 purchase
  • Course
  • Institution

Cheat sheet for the course 'Introduction to Deep Learning' at Tilburg University, including some practice exercises. 3rd year course for CSAI, I got a 7.5 using this!

Preview 1 out of 1  pages

  • October 3, 2024
  • 1
  • 2023/2024
  • Summary
avatar-seller
Default: β(EWA factor)=0.9, → Improves computational efficiency and allows for 18. LSTM: Linear algebra →
1. Perceptron update rule Hinge loss
α=0.001, v= weighted average localized feature detection. Cell state: Long-term memory, not modified directly.
Binary classification
Parameter sharing: Kernel coefficients are identical Prevents gradient issues.
Here, margin=1
for each input location. → Reduces the number of Hidden state: Short-term memory, modified by
parameters and enables translation invariance. weights.
y: target output (-1 or 1), z: 1. Forget gate: Determines relevant parts of cell state
Equivariant representation: Convolution value Exercises:
y (expected output) - o output of network without Problem: lacks momentum, (% to remember) Input: the previous hidden state and
covaries with input value. → Provides robustness to
(output) = error any activation function initial values can be biased the new input data
transformations and improves data efficiency.
Step activation function: Update rule:
Categorical Cross-Entropy CNN elements:
Multiclass classification Downsampling: 1. Stride: defines the amount of 2. Input gate: Decides what new information should
Compares output to OHE movement over the input 2. Pooling: replaces output be stored in the cell. Input: current input and previous
output of NN at a certain location with a summary statistic of hidden state, Output: value between 0 & 1 (=how much
Stacking perceptron:
the nearby outputs 2.1: Max Pooling → Outputs of the input to let in cell state), which will be merged
multiple perceptrons with with the ‘old’ memory from the forget gate, creating a
maximum value from the input window 2.2: Average
the same input new cell state (updating cell state:)
Default: β1=0.9, β2=0.999, Pooling → outputs the average value from the input
2.Activation function in NN ɛ=1e^(-8) window 2.3: L2 norm → reduces spatial dimensions
6. Weights Initialization
Variance given by: Adadelta: Extends Adagrad while retaining important features
Xavier: tanh, sigmoid → 1/n Adamax: Extends Adam Padding: amount of pixels added around the input (to
Nadam: Combines Adam and an image) → maintains spatial dimensions, prevents 3. Output gate: Controls whether to output
He: ReLU → 2/n
z = weighted sum of o Nesterov shrinking output size information from the current cell state. Input: current
n: number of incoming neurons
a^L = ouput of each 9. Hyperparameters: Dilation: expands the receptive field by inserting input and previous hidden state, Output: value
hidden layer 7. Gradient Descent between 0 & 1 (=represents how much of the current
Learning rate spaces between filter elements → capture larger
MLP Network size context without increasing filter size cell state to output)
3. Back-Propagation Regularization parameters CNN Building Blocks:
θ: parameter to update (e.g.
Updating parameters: weights), α: learning rate, : 10. Regularization: adding 1. Convolution Layer → kernel/filter is passed over
Weights: gradient of J, J: loss function penalties against complexity the image
1. Batch/Vanilla GD: Contraints on weights 2. Activation Layer → introduces non-linearity (to
Updates θ by calculating the Additional terms to allow backpropagation)
gradients using the whole dataset objective (loss) function 3. Downsampling → reducing input size Input Embedding: Words mapped to continuous
2. Mini-Batch GD: 4. Fully connected Layer → traditional MLP structure vectors to represent meaning.
Bias: updates θ by calculating the J: objective function (loss), θ: 13. Recurrent Networks (RNN): Process sequential Positional Encoding: Adds positional information to
gradient using randomly selected parameter=weights, Ω: penalty data by maintaining hidden states across time steps embeddings.
examples L1: absolute value of magnitude, to remember previous information (loops in network) Encoder Layer: Maps input sequence (embeddings
3. Stochastic GD: can lead to sparsity in weights Training: Uses Backpropagation Through Time (BPTT): with positional encodings) into abstract representation.
Updates the parameters θ by → for feature selection extends backpropagation to sequences (layers repeated N times)
calculating the gradients using Limitations: Vanishing/exploding gradients, difficulty Multi-Headed Attention: Associates words in input
4. Activation Functions
every single example Ω(w) = in capturing long-term dependencies, slow training with each other.
19. GRU:
Linear: o(z) = z Queries (Q): Target tokens for which attention
8. Optimization L2: Weight decay, squared Solution: gradient clipping (force gradients to a Reset gate (z): how much of the previous hidden state
Regression (output layer) weights are computed.
magnitude → weights close to specific min/max) or LSTM/GRU to forget in the current state (0–1). Input: current input
Range (-inf,inf) Keys (K): Tokens used to compute attention
but never reach 0 Update rule: & previous hidden state, Output: between 0-1(=how
Sigmoid:
v: filtered version of gradients scores for the queries.
much of the previous hidden state to forget)
x, β: filtering parameter, x: Values (V): Content associated with each token
gradients at time t in the input sequence.
Binary classif. (output) Update gate (r): How much information to pass
Larger β → more weight 14 Bidirectional RNNs: Process data in both forward Computation: Multiply queries and keys, divide
Range (0,1) Dropout: Randomly remove through from previous time steps. Input: current input
to recent data, smoother and backward directions for more context. 2 RNNs: by the square root of dimensions, then multiply
neurons with a certain and previous hidden states, Output: between 0-1
Tanh: one processes the sequence from start to end by values:
curve probability to prevent co- (=how much of the previous hidden state to keep &
Smaller β → includes a (forward), and the other from end to start (backward). how much of the new input to let through) →
adaptation
longer history of data, Limitation: computational cost, longer training time, Combination of forget gate and input gate in LSTM. Residual Connection: Adds output of attention to
Early stopping: Keep track and
Binary classif. (output) probably less data higher memory consumption input.
stop training when validation
Range (0,1) 15. RNN Architectures Layer Normalization: Stabilizes and normalizes
Momentum-based optimizers: error increases (here the bias-
Softmax: Many-to-Many: Map sequence input x to output.
variance trade-off is reflected)
Multiclass classif. (output) corresponding sequence output o → Video Pointwise Feed-Forward Network: Adds non-
Data augmentation: increase
Range (0,1) classification where each frame is an input, and the linearity and processing.
diversity and size of training set
α: learning rate, β: usually 0.9 output is a label for each frame Decoder Layer: Generates text sequences using
(translation, crop, scaling,
With SGD: To keep the Many-to-One: Map sequence input to single output previous outputs and encoder inputs.
rotation, flipping, adding noise
gradient step equivalent to → Sentiment analysis where a sequence of words Embedding Layer: Converts words to vectors.
ReLU: 11. Batch Normalization:
the one in SGD, the (sentence) is classified into a sentiment label Positional Encoding Layer: Adds positional
Hidden Layer Stabilizes learning by
learning rate is scaled by One-to-Many: Single input to sequence output → information.
Range [0,inf) normalizing layer inputs,
1/(1-β): Image captioning where a single image is described Multi-Headed Attention 1: Considers only past
reducing the Covariate Shift with a sequence of words. tokens (masking).
problem (distribution of layer’s Seq2Seq: encoder-decoder to transform input Multi-Headed Attention 2: Matches encoder input
Nesterov Accelerated input changes during training → sequence into output sequence → Machine with decoder input.
Momentum slows down) → Solution: Residual Connection: Adds output of attention to
translation where a sentence in one language is
Gradient term is computed Normalize mini-batch. 1. input.
translated into another language
from θ+uv Calculate mean and variance of 15.1 Seq2Seq: encoder-decoder LSTM: Preferred for tasks with very long sequences Layer Normalization: Stabilizes and normalizes
Dying Relu: negative side Gradient always points in right inputs per mini-batch. 2. and complex dependencies. output.
Transforms an input sequence into (a vector into) an
gradient is zero, causing direction, momentum may Normalize inputs by subtracting output sequence GRU: Preferred for tasks requiring faster training Feed-Forward Network: Adds non-linearity and
some neurons to remain not. If not → gradient can still mean and dividing by the square and efficiency. processing.
Encoder: 1. Generates a vector per time step (ht) 2.
inactive and not update → ‘go back’ root of variance. 3. Scale and 20. NLP Output Layer:
Last vector can be assumed to summarize the
dead neurons that do not shift normalized inputs using Embeddings: Words as vectors as input to RNN Linear Classifier: Projects the decoder output to
sequence
contribute to learning learnable parameters (gamma One-Hot Embeddings: sparse, high-dimensional, the vocabulary size.
Adaptive Learning rate Optimizers and beta). 4. Apply activation Decoder: 1. First hidden state of the decoder set to
Leaky ReLU: hard-coded Softmax Layer: Produces output probabilities.
Adapt to individual parameters the last hidden state of encoder (normally a special
o(z) = max(0.1x, x) function to the normalized and Word Embeddings: Dense vectors capturing
character) 2. At each time step, generates a hidden 23. Vision Transformers (ViT): Apply transformer
Range (-inf,inf) Adagrad: transformed inputs. semantic meaning, dense, lower-dimensional,
vector (ht​
) and creates an output architecture to image data leveraging self-attention
5. Loss Uses sum of squared grads learned from data
16. Attention: Allows the model to focus on relevant mechanisms to capture global image features.
Mean Squared Error (MSE/L2) Problem: decreasing learning
parts of the input sequence when predicting each part 21. Self-Attention: Self-attention focuses on a single 1. Image Division: Divide image into fixed-size patches
Regression rates bc of the square root of
of the output sequence → encoder passes all hidden sequence. It allows the model to let a sequence learn (e.g., 16x16).
the accumulated square of
states to decoder (as a matrix) information about itself → transformers 2. Patch Embedding: Flatten and project each patch to
gradients in denominator,
1. Create representation using encoder states 2. Calculating self-attention: query vector (e.g. input) * a vector, add positional encoding.
leading to slow/impossible μ: mean, σ^2: variance, γ & β: Attend states in encoder that are most similar to state key vector, followed by softmax function to obtain 3. Transformer Encoder:
Mean Absolute Error (MAE) learning learnable parameters To do:
in decoder 3. Create a final representation (context attention weights Multi-Headed Self-Attention captures patch
Regression 1. Multiply kernel size with input, starting
12. CNN: calculate output size vector) using all relevant information 22. Transformers: model that uses attention to boost relationships. left upper corner, move with stride steps
RMSprop (activation map dimensions): Limitations in RNNs: 1. Relevance of information the training speed → Handle sequential data without Residual Connections and Layer Normalization (you get a 3x3 matrix).
Overcomes Adagrad problem W: width, H: height, F: width/ might depend on future inputs 2. Recurrence recurrence, allowing for parallel processing stabilize output.
2: Max pooling layer (2,2) → take
by using a moving average height of the filter, P: padding, S: prevents parallelization of computations. Encoder: Feed-Forward Network adds non-linearity. maximum value of each matrix (in above
Binary Cross-Entropy Only recent squared gradient stride 17. Gated units: Improve RNNs by managing long- Self-Attention Layer 4. Classification: Use a special CLS token for final matrix): Max(3, 0, 0, 2) = 3, Max(0, 1, 2, 0) =
Binary classification matters; long ago=forgotten term dependencies & mitigating vanishing gradient Feed-Forward Layer classification with an MLP head. 2, Max(0, 2, 1, 0) = 2, Max(2, 0, 0, 3) = 3
Derivative becomes linear Difference with Adagrad: gt: problems Decoder: Pros: Captures global context, flexible, state-of-the-art
So: [[3,2] [2,3]]
measured by exponentially CNN properties: Closed gate: multiply data by 0 → erase content Self-Attention Layer performance with sufficient data.
p: output network, y: target decaying average and not the Sparse connectivity: conv. kernel Open gate: multiply data by 1 → persever content Encoder-Decoder Attention Layer Applications: Image classification, object detection,
output

sum of gradients is much smaller than the input LSTM, GRU Feed-Forward Layer segmentation

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller mve11. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.78. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

75759 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.78
  • (0)
  Add to cart