Grade: 9.0. Extensive summary for the course Data Mining for Business and Governance. The summary contains the content of all lectures and tutorials (the theory, not the coding), including additional notes, examples and explanations. I also incorporated parts of the relevant chapters of the followi...
Data Mining for Business &
Governance
MSc Data Science & Society
Tilburg University
1
,Lecture 1. Introduction to Data Mining
Pattern Classification
In this problem, we have three numerical variables
(features) used to predict the outcome (decision class). This
problem is multi-class since we have three possible
outcomes. The goal in pattern classification is to build a
model able to generalize well beyond the historical training
data; the model should be able to determine the
classification class. The fact that the Y-variable is nominal
(rather than numerical) indicates that the problem is a
classification problem.
Missing Values
Sometimes, we have instances that have missing values for some features. It is of paramount
importance to deal with this situation before building any machine learning or data mining
model, because traditional algorithms are not able to deal with missing values. In data mining
and machine learning, missing data, or missing values, occur when no data value is stored for the
variable in an observation. A missing value can signify a number of different things. Perhaps the
field was not applicable, the event did not happen, or the data was not available. It could be that
the person who entered the data did not know the right value, or did not care if a field was not
filled in. There are several strategies to deal with missing values:
1. Remove feature (variable) containing missing values (simplest way): this strategy is
recommended when the majority of the instances (observations) have missing values for
that feature. However, there are situations in which we have a few features or the feature we
want to remove is deemed relevant.
2. Remove instances having missing values: if we have scattered missing values and few
features, we might want to remove the instances having missing values. However, there
are situations in which we have a limited number of instances.
3. Imputation: replaces the missing values with some value inferred from the data: e.g., the
mean or median (mode) of that feature. However, we need to be aware that we are
introducing noise, because the filled-in value may not be correct. Noise is a random error or
variance in a measured variable.
4. Estimation: replaces the missing values with some value learned from the data (using
machine learning models trained on the non-missing information)
Autoencoders to impute missing values
Autoencoders can be used for pattern completion
problems. Recommender systems is some sort of
missing value problem. Autoencoders are deep
neural networks that involve two neural blocks
named encoder and decoder. The encoder reduces
the problem dimensionality while the decoder
completes the pattern. They are unsupervised
learning to adjust the weights that connect the
neurons.
2
,Feature Scaling
1. Normalization (min-max)
Different features might encode different measurements
and scales (e.g., the age and height of a person).
Normalization allows encoding all numeric features in the
[0,1] scale. Normalization attempts to give all attributes
an equal weight. This is particularly useful for
classification algorithms involving neural networks or
distance measurements such as nearest-neighbor
classification and clustering. There are many methods for
data normalization, among which min-max normalization.
We subtract the minimum from the value to be
transformed and divide the result by the feature range.
2. Standardization
Standardization is similar to the normalization, but the
transformed values might not be in the [0,1] interval. The
vast majority of the normalized values will typically lie in
the range [-3,3] under the normal distribution
assumptions. We subtract the mean from the value to be
transformed and divide the result by the standard
deviation.
Normalization versus standardization
Although the proportions are maintained, normalization and standardization might lead to
different scaling results. Standardization is recommended in case of outliers. The (min-max)
normalization approach is not effective when the maximum and minimum values are extreme
value outliers because of some mistake in data collection. For example, consider the age attribute
where a mistake in data collection caused an additional zero to be appended to an age, resulting
in an age value of 800 years instead of 80. In this case, most of the scaled data along the age
attribute will be in the range [0, 0.1], as a result of which this attribute may be de-emphasized.
Standardization is more robust to such scenarios.
Feature Interaction
Sometimes we need to measure the correlation between numerical features describing a certain
problem domain. For example, what is the correlation between gender and income in Sweden?
Correlation measures the relationship between numerical values; association measures the
relationship between non-numerical values.
3
,Correlation between two numerical variables
Correlation measures the extent to which the
data can be approximated with a linear
regression model.
Pearson Correlation
Pearson correlation is used when we want to determine the
correlation between two numerical variables given 𝑘
observations. It is intended for numerical variables only and
its value lies in [-1,1]. The order of variables does not matter
since the coefficient is symmetric.
Association between two categorical variables
Sometimes, we need to measure the association degree between two
categorical (ordinal or nominal) variables. For example, what is the
association between gender and eye color?
The 𝝌𝟐 association measure
It is used when we want to measure the association between two
categorical variables given 𝑘 observations. We should compare the
frequencies of values appearing together with their individual
frequencies. The data tuples can be shown as a contingency table.
Note that for a set of k binary random variables (items), denoted by X,
there are 2k-possible states representing presence or absence of
different items of X. For example, for k = 2 items {Bread, Butter}, the
22 states are {Bread, Butter}, {Bread,¬Butter}, {¬Bread, Butter}, and
{¬Bread, ¬Butter}. The expected fractional presence of each of these
combinations can be quantified as the product of the supports of the
states (presence or absence) of the individual items. For a given data
set, the observed value of the support of a state may vary significantly
from the expected value of the support. Let 𝑂% and 𝐸% be the observed
and expected values of the absolute support of state 𝑖. For example,
the expected support 𝐸% of {Bread,¬Butter} is given by the total number
of transactions multiplied by each of the fractional supports of Bread
and ¬Butter, respectively.
4
, For example, when X = {Bread, Butter}, one would need to perform the summation in the first
equation over the 22 = 4 states corresponding to {Bread, Butter}, {Bread,¬Butter}, {¬Bread, Butter},
and {¬Bread, ¬Butter}. A value that is close to 0 indicates statistical independence among the
items. Larger values of this quantity indicate greater dependence between the variables.
However, large 𝜒 ) values do not reveal whether the dependence between items is positive or
negative. This is because the 𝜒 ) test measures dependence between variables, rather than the
nature of the correlation between the specific states of these variables.
Example
On the right is the contingency table for two categorical
variables such that the first one has n=2 categories and
the second has m=3 categories.
How to proceed?
We have 26 males from which 6 have blue
eyes, 8 have green eyes and 12 have brown
eyes. The total number of people with blue,
green and brown eyes is 15, 13 and 22,
respectively. The total number of people
studied is 50.
We have 24 females from which 9 have blue
eyes, 5 have green eyes and 10 have brown
eyes. The total number of people with blue,
green and brown eyes is 15, 13 and 22,
respectively.
) )
The 𝜒 ) of provided contingency table is equal to 𝜒 ) = 𝜒(,) + 𝜒(/) .
Encoding Strategies
Some machine learning, data mining algorithms or platforms cannot operate with categorical
features. Therefore, we need to encode these features as numerical quantities. The first strategy
is referred to as label encoding and consists of assigning integer numbers to each category. It
only makes sense if there is an ordinal relationship among the categories, such as: weekdays,
months, star-based hotel ratings, income categories.
One-hot encoding (Dummy encoding)
- It is used to encode nominal features that lack an
ordinal relationship.
- Each category of the categorical feature is transformed
into a binary feature such that one marks the category.
- This strategy often increases the problem
dimensionality notably since each feature is encoded
as binary vector. Main disadvantage: model can easily
explode/become very large.
5
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller tiu43862142. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.82. You're not tied to anything after your purchase.