This document consists of college notes from the theory lessons supplemented with the explanatory figures and additional information. Therefore, it contains all theory that should be studied for the exam except the practicals.
Chapter 1: Introduction.................................................................................................................................... 4
1.1: Introduction .......................................................................................................................................... 4
o Before we start............................................................................................................................... 4
§ A few practical things ................................................................................................................. 4
® Background ........................................................................................................................... 4
o A bit of context............................................................................................................................... 4
§ Big data ..................................................................................................................................... 4
® Definition of big data ............................................................................................................. 5
® Big data is characterized by: .................................................................................................. 5
® Large scale data and AI brought a new data intensive research paradigm .............................. 8
§ What is data? Some definitions of what we are dealing with and how we can represent it?........ 8
® Data can be given by objects and attributes ........................................................................... 8
a) Data object....................................................................................................................... 9
b) Attribute .......................................................................................................................... 9
® Dataset types ...................................................................................................................... 10
a) Record:........................................................................................................................... 10
b) Graph: ............................................................................................................................ 11
c) Ordered:......................................................................................................................... 11
§ Data mining ............................................................................................................................. 12
® What is data mining? ........................................................................................................... 12
® Examples: Is it data mining?................................................................................................. 13
® Data mining challenges........................................................................................................ 13
® Major tasks of data mining (after preprocessing) ................................................................. 14
1) Supervised data mining ................................................................................................... 14
2) Unsupervised data mining ............................................................................................... 17
® Data mining is business ....................................................................................................... 18
® Value of data ....................................................................................................................... 19
® Evolution............................................................................................................................. 19
Chapter 2: Processing principles..................................................................................................................... 20
2.1: Processing principles............................................................................................................................ 20
o Introduction ................................................................................................................................. 20
§ What you usually have vs. what you want and need ................................................................. 20
® In reality you usually have ‘dirty data’ .................................................................................. 20
® Data that you actually want/need is: ................................................................................... 20
o Pre-processing and transformation à to get more minable data that can be further used ............ 20
§ Role of pre-processing and transformation............................................................................... 20
® Unstructured data ............................................................................................................... 20
® Common data processing steps that each make data more ready for data mining ................ 21
a) Feature extraction:......................................................................................................... 21
b) Attribute transformation = feature transformation ........................................................ 21
c) Discretization ................................................................................................................. 22
d) Aggregation.................................................................................................................... 22
e) Noise removal ................................................................................................................ 22
f) Identifying outliers à outlier removal ........................................................................... 23
g) Sampling ........................................................................................................................ 23
h) Handling duplicated data ............................................................................................... 24
i) Handling missing values ................................................................................................. 24
j) Dimensionality reduction ............................................................................................... 25
® Processing steps for specific data types: what types of features are we dealing with? .......... 29
1
, a) Image data: .................................................................................................................... 29
b) Survey data .................................................................................................................... 30
c) Sequence data................................................................................................................ 31
d) Text data ........................................................................................................................ 32
e) Omics data ..................................................................................................................... 32
f) Temporal........................................................................................................................ 38
Chapter 3: Unsupervised clustering................................................................................................................ 39
3.1: Unsupervised clustering ....................................................................................................................... 39
o Introduction ................................................................................................................................. 39
§ Unsupervised vs. supervised .................................................................................................... 39
® Quick overview in difference between supervised and unsupervised ................................... 39
§ Clustering ................................................................................................................................ 39
® What is clustering? .............................................................................................................. 39
® Exists in different domains and has different names but it does something quite similar ...... 39
® Natural grouping ................................................................................................................. 39
§ Similarity ................................................................................................................................. 40
® Wat is similarity? ................................................................................................................. 40
® Defining distance measures ................................................................................................. 40
® How do we measure similarity? ........................................................................................... 41
§ Dendrograms ........................................................................................................................... 42
® What is it? ........................................................................................................................... 42
® Example .............................................................................................................................. 42
® Use of dendrograms ............................................................................................................ 44
§ Algorithms ............................................................................................................................... 44
o 2 types of clustering ..................................................................................................................... 45
§ Hierarchical clustering ............................................................................................................. 45
® Principle: ............................................................................................................................. 45
® Heuristic search (= a more practical feasible way come up with the best dendrogram but
without forgetting that there are multiple options out there) ....................................................... 45
à Since we cannot test all possible trees we will have to heuristic search of all possible trees. We
could do this bottom-up or top-down. .......................................................................................... 45
à use a heuristic search à we cannot guarantee we get the optimal solution, but way faster than
testing every option ..................................................................................................................... 45
® How to measure the distance between 2 clusters based on the distance function? .............. 46
§ Partitional clustering ............................................................................................................... 50
® What is it? ........................................................................................................................... 50
® How many clusters? à how to specify k? ............................................................................ 50
® K-means steps (simple & efficient algorithm) ....................................................................... 51
® Importance of choosing initial centroids .............................................................................. 53
® Weakness of k-means.......................................................................................................... 53
Chapter 4: Principal component analysis (PCA) .............................................................................................. 54
4.1: Principal component analysis (PCA) ..................................................................................................... 54
o PCA as the backbone of modern data analysis .............................................................................. 54
§ What is principal component analysis and why is it necessary?................................................. 54
® PCA is the first thing you do when you get a new dataset..................................................... 54
® Reasons to do PCA:.............................................................................................................. 54
® Multivariate data................................................................................................................. 54
§ Important concepts.................................................................................................................. 55
® Basic variable statistics ........................................................................................................ 55
a) Mean .............................................................................................................................. 55
b) Median ........................................................................................................................... 56
c) Range ............................................................................................................................. 56
d) Variance ......................................................................................................................... 56
2
, e) Standard deviation.......................................................................................................... 56
® Data transformation ............................................................................................................ 56
2) Comparing variables ................................................................................................................. 57
o How does PCA work? .................................................................................................................... 58
§ Data projection ........................................................................................................................ 58
® Too many variables ............................................................................................................. 58
® What’s data projection? ...................................................................................................... 59
® Why use projections? .......................................................................................................... 59
® Data visualization and simplification à data projection should capture as much of the
information as possible ................................................................................................................ 60
® Geometric interpretation of PCA ......................................................................................... 60
® PCA output: IMPORTANT for the exam to interpret output ! ................................................ 62
® PCA usage: scores and loadings ........................................................................................... 64
® PCA examples...................................................................................................................... 64
§ t-SNE ..................................................................................... Fout! Bladwijzer niet gedefinieerd.
® = alternative method for data projection ............................................................................. 71
® How? .................................................................................................................................. 72
® Comparison PCA and t-SNE .................................................................................................. 74
® Perplexity ............................................................................................................................ 74
® Example: t-SNE for single cell RNAseq .................................................................................. 74
Chapter 5: Supervised learning ...................................................................................................................... 76
5.1: Supervised learning ............................................................................................................................. 76
o Introduction ................................................................................................................................. 76
§ Classification problem = problem we have a lot of experience with .......................................... 76
® Use features of an object to assign a hopefully correct label to an object ............................. 76
® Pigeon problems: training pigeons to classify paintings ........................................................ 76
® Grasshopper problem: Given a collection of annotated data. In this case 5 Katydids and 5
Grasshoppers, decide what type of insect the unlabeled example is (2 similar, but not identical
animals) ....................................................................................................................................... 76
o Regression vs. classification .......................................................................................................... 78
§ General.................................................................................................................................... 78
® Differences.......................................................................................................................... 78
§ Classification............................................................................................................................ 78
a) Simple linear classifier.................................................................................................... 78
® General: what is a simple linear classifier? ........................................................................... 78
® Support vector machines (SVM)........................................................................................... 82
® Decision value ..................................................................................................................... 83
® Predictive accuracy.............................................................................................................. 84
® Confusion matrix = matrix that fits all of the samples with the classified label vs. the true label
85
® Thresholds and accuracy ..................................................................................................... 86
® ROC and PR curves .............................................................................................................. 87
b) Nearest neighbor classifier ............................................................................................. 90
® What is this type of classifier? ............................................................................................. 90
Chapter 6: Regression .................................................................................................................................... 93
6.1: Regression ........................................................................................................................................... 93
o Regression = a supervised machine learning (ML) model and can be used to analyze multivariate
data (in data science you often need to deal with regression problems BUT this is different from ‘normal’
statistics) ............................................................................................................................................... 93
§ The regression problem ........................................................................................................... 93
® Given a collection of annotated data (in this case a number of insects with their ages), you
need to try to predict a variable about the data ............................................................................ 93
§ Regression vs. classification...................................................................................................... 94
3
, ® Classification....................................................................................................................... 94
® Regression .......................................................................................................................... 94
§ Types of regression .................................................................................................................. 94
® Simple linear regression...................................................................................................... 94
® Multiple linear regression ................................................................................................... 95
® Non-linear regression ......................................................................................................... 98
® Logistic regression .............................................................................................................. 98
® Cox regression .................................................................................................................... 99
® Regularized regression ...................................................................................................... 100
§ Considerations that need to be made with regression ............................................................ 103
® Overfitting......................................................................................................................... 103
- Intuitively we would say 9 ................................................................................................. 103
a) K-fold cross validation .................................................................................................. 104
b) Leave one-out cross validation (CV) = special case of K-fold cross validation when K =
number of samples ................................................................................................................ 105
® Speed and scalability ......................................................................................................... 105
® Interpretability à model interpretability is really important and leads to model transparency
105
® Robustness........................................................................................................................ 106
Chapter 7: Machine learning methods ......................................................................................................... 108
7.1: Machine learning methods ................................................................................................................ 108
o Supervised machine learning methods........................................................................................ 108
§ Recap .................................................................................................................................... 108
® Supervised vs. unsupervised .............................................................................................. 109
§ Classification.......................................................................................................................... 109
® Classification ..................................................................................................................... 109
® Classification algorithms .................................................................................................... 109
a) Support vector machines.............................................................................................. 110
b) Decision trees............................................................................................................... 110
c) Random forest ............................................................................................................. 114
d) Neural networks (NN) and deep learning ...................................................................... 119
e) K-nearest neighbors ..................................................... Fout! Bladwijzer niet gedefinieerd.
Chapter 1: Introduction
1.1: Introduction
• Introduction
o Before we start
§ A few practical things
® Background
¨ Background on bioinformatics, statistics, omics data analysis (NGS,
microarrays, …), data mining and machine learning
o A bit of context
§ Big data
® What is big data?
¨ In the last 5 decades there has been an evolution of the human system:
from seeing the human body from multi-disciplinary perspectives to the
human system as a complex interplay between genes, proteins, small
molecules, … that interact with each other in a very complex way and
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller jentebeeldens1. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $20.08. You're not tied to anything after your purchase.