Complete summary of:
- Book: Data Science for Business (Provost & Fawcett)
- Case studies summary and answers (P. Snoeren)
All exam materials needed next to the lecture slides!
Cases ........................................................................................................................................................... 35
1. Capital one ............................................................................................................................................. 35
2. Gaming industry..................................................................................................................................... 35
3. Easyjet + Fifa .......................................................................................................................................... 36
Easyjet ........................................................................................................................................................ 36
Fifa .............................................................................................................................................................. 36
4. Google Healthcare ................................................................................................................................. 37
5. Twitter and stock returns ....................................................................................................................... 38
6. Privacy.................................................................................................................................................... 38
Sophie van Sonsbeek - 12799955
,Chapter 1. Introduction: Data-Analytic Thinking
Introduction Data collection is done in every aspect of business:
- Operations, manufacturing, supply-chain, customer behavior,
marketing campaign performance, workflow procedure, and
so on.
Data science = the availability of data increases interest in methods to
extract knowledge and information from data.
The ubiquity of Data mining techniques:
data - Marketing: targeted marketing, online advertising,
opportunities recommendations for cross-selling
- Finance: credit scoring, trading, fraud detection
- Retail: Amazon & Walmart applies throughout entire business
Data-analytic thinking enables you to evaluate proposals for data
mining projects.
This book Goal of this book:
Translate business problems into data problems.
Provide data mining/data science techniques.
Example used in the book: Predicting customer churn.
Customers switching from one company to another is called churn,
and it is expensive all around: one company must spend on incentives
to attract a customer while another company loses revenue when the
customer departs.
Data science Data science, engineering, and data-driven decision making
principles
Data driven decision making (DDD) refers to the practice of basing
decisions on the analysis of data, rather than purely on intuition.
Two types of decisions focused on in this book:
1. “Need discoveries”
2. “Repeated decisions”
And so, even a small increase in decision-making accuracy can have a
big impact.
Example:
Target wanted to jump on their competition: Amazon. They were
interested whether they could predict that people are expecting a
baby. If they could, they would gain an advantage by making offers
before their competitors.
If they could, they would gain an advantage by making offers before
their competitors.
à Pregnant mothers often change their diets, wardrobes, vitamin
etc.
Big data Data processing and “Big Data”
Difference between data science and data-driven business:
Sophie van Sonsbeek - 12799955
, • Data science needs data and benefits from data engineering
that are facilitated by data processing technologies. But
these techniques are not only for data science.
o Data processing technologies are important for data-
oriented business tasks that do not involve extracting
knowledge or data-driven decision making.
o E.g. online advertising campaign management,
modern web system processing
• Big data technologies:
o Big data = datasets that are too large for traditional
data processing systems require new processing
technologies.
o Big data technologies are used for implementing data
mining techniques à support data processing of data
mining techniques.
Strategic asset Data and data science capability as a strategic asset
Fundamental principle of data science: data, and the capability to
extract useful knowledge form data, should be regarded as key
strategic assets.
Sophie van Sonsbeek - 12799955
,Chapter 2. Business Problems and Data Science Solutions
Summary Fundamental concepts: A set of canonical data mining tasks; the data
mining process; supervised versus unsupervised data mining.
Understanding the whole data mining process helps to structure data
mining projects into systematic analyses.
Data mining From business problems to data mining tasks
techniques Data scientists decompose a business problem into sub tasks. The
data mining subtasks can then be composed to solve the overall
problem.
Data mining algorithms:
1. Classification and class probability estimation attempt to
predict, for each population, which of small set of classes this
individual belongs to.
Classification and scoring are very closely related; as we shall
see, a model that can do one can usually be modified to do
the other.
2. Regression (“value estimation”) attempts to estimate or
predict, for each individual, the numerical value of some
variable for that individual.
“How much will a given customer use the service?”
3. Similarity matching attempts to identify similar individuals
based on data known about them.
4. Clustering attempts to group individuals in a population
together by their similarity.
“Do our customers form natural groups or segments?”
5. Co-occurrence grouping attempts to find associations
between entities based on transactions involving them.
“What items are commonly purchased together?”
6. Profiling attempts to characterize the typical behavior of an
individual, group or population.
“What is the typical cell phone usage of this customer
segment?”
7. Link prediction attempts to predict connections between data
items.
“Since you and Karen share 10 friends, maybe you’d like to be
Karen’s friend?”
8. Data reduction attempts to take a large set of data and
replace it with a smaller set of data that contains much of the
important information in the larger set.
“GPA instead of list of grades per student”
9. Causal modeling attempts to help us understand what events
or actions actually influence others.
Sophie van Sonsbeek - 12799955
,Supervised versus Supervised learning = training data has a dependent variable or target
unsupervised variable.
methods - Purpose: predicting the target
- Problem: “will a customer leave when her contract expires?”
- Data mining techniques:
o Classification
§ Categorical (binary) target
§ “Which service package will a customer likely
purchase if given incentive I?
o Regression
§ Numeric target
§ “How much will this customer use the service?”
o Causal modeling
The data mining 1. Business understanding
process a. Recasting the problem & designing a solution is
iterative process of discovery.
2. Data understanding
a. It’s important to understand strengths & limitations of
the data because rarely there is an exact match with
the problem
3. Data preparation
a. Is the phase in which data are manipulated and
converted into forms that yield better results?
4. Modeling
a. Output of modeling: some sort of model or pattern
capturing regularities in the data valid & reliable
5. Evaluation
a. Are the data mining results valid & reliable?
6. Deployment
a. Getting return on investment by implementing the
results
Sophie van Sonsbeek - 12799955
,Chapter 3. Introduction to Predictive Modeling: From Correlation to
Supervised Segmentation
Summary Fundamental concepts: identifying informative attributes;
segmenting data by progressive attribute selection.
Exemplary techniques: finding correlations; attribute/variable
selection; tree induction
Predictive modeling: supervised segmentation – how can we segment
the population into groups that differ from each other with respect to
some quantity of interest.
Models, Predictive model = a formula for estimating the target.
induction and - Classification
prediction - Regression
Descriptive model = gain insight into the underlying phenomenon or
process.
Supervised learning = model describes a relationship between
independent variables and target variable.
Deductive vs inductive
Induction = generalizing from specific cases to general rules.
Inductive models:
- Classification and regression
Input data used for inducing the model à training data
Training data = are called labeled data because the value for the
target variable is known.
Supervised Selecting informative attributes
segmentation Classification
The groups need to be pure à homogeneous with respect to the
target variable.
The most common splitting criterion is called information gain, and it
is based on a purity measure called entropy.
Entropy = a measure of disorder (how mixed the segment is with
respect to the target variable).
P = probability for getting that element (p=1, all members of the set
have property x, p=0, no members of the set have property x)
Measure for group impurity
0=pure
1 = maximum impurity
Sophie van Sonsbeek - 12799955
, Information gain = the improvement in purity created by
segmentation. It combines segment size and segment purity.
Numeric variables
Numeric variables can be ‘discretized’ by choosing a split point (or
many split points) and then treating the result as a categorical
attribute.
Visualizing Classification tree
segmentations
Decision lines and hyperplanes
The lines separating the regions are known as decision lines.
Hyperplane is used in data mining literature to refer to the general
separating surface, whatever it may be.
La place Overfitting
correction La place correction moderates the influence of leaves with only a few
instances.
N = number of examples in the leaf belonging to class C
M = the number of examples not belonging to class C
Trees and sets of Before starting to build a classification tree with variables, it is worth
rules asking: how good are each of these variables individually?
For this we measure the information gain of each attribute, as
discussed earlier.
As can be seen, the first three variables – the house value, the
number of leftover minutes, and the number of long calls per month
– have a higher information gain than the rest.
Sophie van Sonsbeek - 12799955
Les avantages d'acheter des résumés chez Stuvia:
Qualité garantie par les avis des clients
Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.
L’achat facile et rapide
Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.
Focus sur l’essentiel
Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.
Foire aux questions
Qu'est-ce que j'obtiens en achetant ce document ?
Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.
Garantie de remboursement : comment ça marche ?
Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.
Auprès de qui est-ce que j'achète ce résumé ?
Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur sophievansonsbeek. Stuvia facilite les paiements au vendeur.
Est-ce que j'aurai un abonnement?
Non, vous n'achetez ce résumé que pour €10,24. Vous n'êtes lié à rien après votre achat.