1
–Data mining: implicit, previously unknown, and potentially useful
–A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation; KDD—The overall process of non-trivial extraction of implicit,
previously unknown and potentially useful knowledge from large amounts of data
–Data Mining Requirements: lots of data, and lots of computing power; Data Mining: a core step of KDD
–Application of specific algorithms for extracting patterns from data
–Why Not Use Classical Data Analysis? Classical Data Analysis is not enough.
•Why we use Data Mining:-tremendous amount of data (big data)-high-dimensionality of data-high complexity of data-New and sophisticated applications
•What Kind of Data to Be Mined? Flat Files, Relational Databases, Data Warehouse, Transactional Database, Advanced Databases
•Data Mining Tasks and Data Mining Taxonomy
–Descriptive:
▪ Clustering (Given a set of data points, each having a set of attributes, and some proximity measure, goal of clustering is to find clusters such that
o Data points in the same cluster are more similar to one another –intracluster: more similar
o Data points in different clusters are less similar to one another–intercluster: less similar
o Unlike classification and regression which analyze labeled data objects, clustering analyzes data without class labels)
▪ Association Rule Mining (Given set of records each contains some # of items from a given collection. Produce dependency rules which will predict occurrence of an item based on occurrences of other
items.)
▪ Sequence Pattern Mining
–Predictive:
▪ Classification (Examples: Predicting tumor cells benign or malignant. Classifying credit card transactions as legitimate or fraudulent. Classifying secondary structures of protein as alpha-helix, beta-
sheet, or random coil in biology. Categorizing news stories as finance, weather, entertainment, sports, etc.)
o Given a collection of records (training set), each record contains a set of attributes, one of the attributes is the class.
o Find a model for class attribute as a function of the values of other attributes.
o A test set is used to determine the accuracy of the model.
o Goal: To ensure that previously unseen records should be assigned a class as accurately as possible.
▪ Regression (Goal: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. regression--predict a value rather
than a label/category
▪ Outlier Detection
o Outlier: some data point does not comply with the general behavior of the data
o Goal: To detect significant deviations (outliers) from the normal behavior
What is (not) Data Mining? Example of data mining: Group together similar documents returned by search engine according to their content/context (e.g. Amazon rainforest, Amazon.com, etc)
What is data: collection of data objects and their attributes; attribute (columns): property or characteristic of object; object (rows): collection of attributes; record, point, case, sample entity, instance
Types of attributes: nominal (qualitative): e.g., ID numbers, eye color, zip codes-distinctness; ordinal (qualitative): e.g., rankings (e.g., taste of chips on a scale from 1-10), grades, height in {tall, medium, short}/
tshirt size-order has meaning, but differences between each data point is not known; Interval (quantitative): e.g., dates, temperatures in C or F-order and difference has meaning; ratio (quantitative): e.g., length,
time, counts, temperature in K-distinctness, order, addition/subtraction, multiplication/division
• Continuous (has real numbers as attribute values; e.g., temperature, height, weight) or Discrete (finite or countably infinite set of values; often represented as integer variables; binary; e.g., t-shirt
size, zip codes, counts, set of words in doc)
–Types of Data Sets: Record: data matrix (points), document data (term vector), transaction data|Graph: world wide web, social networks; molecular structure|Ordered data-sequences of transactions: spatial,
temporal, sequential, genetic sequence data (e.g., I love teaching)
–Data Quality Problems:
• Noise (modification of original values) and outliers (characteristics are considerably different from most other data objects in the dataset)
• Missing values:
• Duplicate data: data de-duplication-process of dealing with duplicate data
–Characteristics of Structured Data: Dimensionality, Sparsity, Resolution
•Why Data Exploration and Data Pre-processing? Garbage in, Garbage out; most models assume numerical columns
–Why Data Exploration and Data Pre-processing: solve data quality problems for data mining/machine learning
–What is Data Exploration: Goal: to understand data; EDA-preliminary exploration of data to better understand characteristics; select right tool for pre-processing and data mining; recognize patterns not captured
by data analysis
•Basic Techniques for Data Exploration
–Summary Statistics: summarize a set of observations
• location (central tendency): mean (average; sensitive to outliers), median (middle), mode (most frequent); midrange (average of smallest and largest; affected by extreme value) | frequency: % the
value occurs in dataset; frequency and mode are used with categorical data; midhinge (Q1+Q3/2; not affected by extreme values)
∑(𝒙 −𝝁)𝟐 ∑(𝒙 −𝝁)𝟐
• spread (statistical dispersion/variability): variance (square of standard deviation 𝝈𝒛 = 𝒊 |𝒔𝒛 = 𝒊 ; (n-1) because for sample, values for variance and standard deviation are larger), standard
𝑵 𝒏−𝟏
deviation (measures distance from the mean), range (largest – smallest); Spread is important because it helps us in dimensionality reduction in data mining. We prefer as large spread as possible,
variance as large as possible and error as small as possible
• shape of distribution: Quartiles: NOT a measure of location; split ordered data into 4 quarters; P = i (n+1) / 4 --> position of ith Quartile; n is the number of data points; Percentiles
statistical dependence: correlation (measure of statistical relationship between 2 variables; not sensitive to scale of data; -1 to +1: strong; 0: weak; Covariance (joint variability of 2 random variables; sign of
covariance shows tendency in linear relationship between 2 variables; not interesting because it is sensitive to scale of data)| Correlation: P(X,Y) = Cov (X,Y)/𝝈X𝝈Y
–Visualization Techniques: detect general patterns, outliers and unusual patterns; 1 st step: map information to visual format; elimination/de-emphasis of certain objects and attributes; choosing a subset of
attributes or objects
• histogram: distribution of single variable; Difference between Histogram and bar chart:
-A single variable? histogram is single variable, bar chart can have several variable
-Binned quantitative data? histogram is binned quantitative data
-Reordered or not? histogram must be in certain order; bar chart can be in any order
• box plot, scatter plot, contour plot
•Data Pre-processing
–Data Cleaning: number 1 problem in data warehousing; handle missing values (eliminate, fill in manually, fill in automatically with global constant or mean (if no outliers) or most probable value (mode), resolve
redundancy; bad data
–Aggregation: combining 2 or more attributes or objects into single attribute or object; purposes: data reduction, change of scale, more stable data
–Sampling: main technique for data selection; obtaining entire dataset is too expensive/time consuming; key principle: sample is representative—has approximately the same property as the original set of data |
simple random sampling (=probability); sampling without replacement (once selected, an item is removed from population); sampling with replacement (objects not removed from population, can be reused);
stratified sampling (split data in partitions, draw random samples from partition)
–Dimensionality Reduction (avoid curse of dimensionality-data becomes increasingly sparse as dimensionality increases; exponential growth; in high dimension space, almost every point lie at the edge (1) of the
space, far from center (0))
• feature selection: select features that are most likely to be useful for this prediction; filter redundant and irrelevant features
• feature extraction: combine features to create new ones with goal to reduce dimensionality (eg height&weight→BMI)
–Feature Subset Selection:
• Principal Component Analysis (PCA): convert set of possibly correlated variables into a set of values of uncorrelated variables; each principal component is chosen such that it would describe most of
the still available variance, PC are orthogonal to each other and first principal component has maximum variance; dimensions must be correlated to be further reduced
• Singular Value Decomposition (SVD): an extension of PCA; used in matrices of word occurrences or frequencies in documents; latent semantic analysis
–Discretization (transform continuous attribute into categorical attribute) and Binarization (transform continuous or categorical attribute into binary)
𝒗−𝝁
–Attribute Transformation: maps entire set of values to a new set of replacement values; normalization/standardization: min-max normalization; Z-score normalization: 𝒗′ = where 𝝁:mean, 𝝈: stan. dev.
𝝈
–Measure of Similarity (how alike) & Dissimilarity (how different): proximity is used to refer to either similarity or dissimilarity
• Euclidean distance (L2): 𝑑(𝑥, 𝑦) = √∑(𝑥 − 𝑦) 2
• City block distance (L1): 𝑑(𝑥, 𝑦) = ∑|𝑥 − 𝑦|
• Hamming distance (binary vectors) – special case
• Similarity between Binary vectors: objects (p,q): M01->p=0, q=1
o Simple Matching Coefficient (SMC) = number of matches / number of attributes = (M11+ M00) / (M01+ M10+ M11+ M00)
o Jaccard Coefficient: frequently used to handle objects consisting of asymmetric binary attributes; J = number of matching presences/number of attributes not involved in 00 matches =
M11/ (M01+ M10+ M11); Basically Jaccard omits the M00
o Cosine similarity: Cosine similarity really is a measure of the (cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0◦, and x and y are the
same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y is 90◦, and they do not share any terms (words). Cosine similarity does not take the
magnitude of the two data objects into account when computing similarity. (Euclidean distance might be a better choice when magnitude is important.) For vectors with a length of 1, the
𝑑 ⋅𝑑 ∑𝑑1 𝑑2
cosine measure can be calculated by taking a simple dot product. Cos(𝑑1, 𝑑2) = cos(𝑑1 , 𝑑2 ) = 1 2 = Example and Exercise for AdaBoost
‖𝑑1 ‖⋅‖𝑑2‖ √∑𝑑12√∑𝑑22
For a binary classification problem, we have 5 classifiers: C1(x), C2(x) , C3(x) , C4(x) and C5(x).
o Example: We have the following ground truth data (x1, 1), the output of the 5 classifiers are:
d1 = 3 2 0 5 0 0 0 2 0 0 C1(x1)=1, C2(x1)=0 , C3(x1)=0 , C4(x1)=0 and C5(x1)=0.
d2 = 1 0 0 0 0 0 0 1 0 2 We have already known the error rate of them:
e1 = 0.1, e2 = 0.6, e3 = 0.6, e4 = 0.6, e 5 = 0.6.
d1 d2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 Now a new test data is still x1, we know the output will be y=1. Based on the above, calculate
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5= (42) 0.5= 6.4807 the final classification result according to equation (4): C *(x) = arg max ∑𝜶 𝜹(𝑪(𝒙) = 𝒚)
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5= (6) 0.5= 2.4495 Solution:
𝟏 𝟏−𝜺𝒊
cos( d1, d2) = 5/(6.4807*2.4495) = 0.3150 •Let’s compute the importance of each classifier Ci according to equation (2), 𝜶𝒊 = 𝐥𝐧 (
𝟐 𝜺𝒊
)
Correlation 𝛼𝑖 ={ 1, -0.2, -0.2, -0.2,-0.2 }
•Adaboost classifier is to calculate the aggregated max of the sum function of alpha weighted
multiple classifiers by if the vote (i.e., if C(x)=y, here y is the ground truth output) (equation
(4)) (Just as NB classifier):
C* (x1, y=1)=(1* 1) + (-0.2*0) + (-0.2*0) + (-0.2*0) + (-0.2*0) = 1
C* (x1, y=0)=(1* 0) + (-0.2*1) + (-0.2*1) + (-0.2*1) + (-0.2*1) = -0.8
1 > -0.8 → C* (x1, y=1) > C* (x1, y=0) → x1 is correctly classification to y=1.