Summary

Summary Data Science Research Methods (JBM025)

54 views 1 purchase

Course
Data Science Research Methods (JBM025)

Institution
Technische Universiteit Eindhoven (TUE)

Summary on the course Data Science Research Methods (JBM025) from the major Data Science in Eindhoven and Tilburg. This course has two parts. The first part focusses on the scientific method and design of experiments (DOE). The second part focusses on econometrics and builds upon what is discussed ...

[Show more]

Preview 3 out of 26 pages

View example

Uploaded on June 26, 2022
Number of pages 26
Written in 2021/2022
Type Summary

tiu
dsrm
data science
research methods
matching
doe
design of experiments
econometrics
eindhoven
tilburg
tue
tue
data science research methods

Institution
Technische Universiteit Eindhoven (TUE)
Education
Data Science
Course
Data Science Research Methods (JBM025)

NienkeUr

Member since 2 year 31 documents sold

$5.05

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

DATA SCIENCE RESEARCH METHODS
CONTENTS

Deriving optimal settings 14
The scientific method 2 Optimums 14
Six Sigma 2 Optimisation scheme 14

sample size determination 3 Econometrics for data scientists 15
Minimal sample sizes 3 Random variables 15
Normal distribution 3 regressions 16
Binomial distribution 4 Bivariate and multivariate regressions 16
When 𝝈 or 𝒑 is not known 4 Ordinary least squares (OLS) 16
Power analysis 4 Instrumental variable estimation 16
Normal distribution 4
Binomial distribution 4 Causality and selection 17
Causality 17
Analysis of variance (ANOVA) 5 Selection and selection bias 17
ANOVA table 5 Regression and randomized experiments 18
Potential problems with experiments 18
ANOVA – power and multiple comparisons 6
ANOVA power 6 Selection on observables and matching 19
Multiple comparisons 6 Matching 19
Fisher Least Significance Difference (LSD) 6 3 methods of matching 20
Tukey’s Honest Significant Difference (HSD) 6 Exact matching 20
Matching based on closeness of
Two-factor designs and blocking 7 observables 20
Propensity score matching 21
Full factorial designs 8 OLS estimator as matching estimator 21
DOE: how to determine whether an individual Flexible OLS as matching estimator 21
factor is of importance 9
Blocking with 2 factors 9 Differences-in-differences estimation 22
Some important details 23
Fractional Factorial designs 10 Generalization: 23
Fractional experiments 10
fractional factorials 10 Regression Discontinuity design (RDD) 24
Sharp regression discontinuity design 24
Response Surface Optimisation 12 Main idea and interpretation 24
Improvement Efficiently: finding near-optimal Estimation of the treatment effect in Sharp
factor settings 12 RDD 25
box/Simplex method 12 Approach 2 25
Steepest ascent/descent method 12 Approach 1 25
Quadratic models 13 Fuzzy regression discontinuity design 25
Response surface designs 13 Estimation the fuzzy RD 26
Central Composite Design (CCD) 13 Alternative to this estimation 26
Box-Behnken Design 14 Specification testing 26

,THE SCIENTIFIC METHOD

Key concepts What should you be able to do?
 Scientific method  Link elements of Six Sigma to the scientific method
 Experiment  Translate a case study in terms of independent variables (factors) and
 Factor dependent variables
 Independent variable  Be able to distinguish in a specific data science context, which of the
 Six Sigma three basic goals is relevant

Key insights
 It is important to identify which of the three different data science goals are relevant given a certain context
 The scientific method is an iterative process
 If you do not plan an experiment well in advance, then no statistical analysis may yield the hoped for results
 Experiments may involve several factors, each or which may have more than 2 levels
 The scientific method is also very useful in industry
 The Six Sigma approach in industry has incorporated several aspects of the scientific method.

Data science has three goals: Business has similar distinctions regarding analytics:
1. Description 1. Descriptive analytics provide insight into the past
2. Prediction 2. Predictive analytics provide understanding of the future
3. Explanation 3. Prescriptive analytics advice on the possible outcomes

Basic elements of the (iterative) scientific method Steps in experimentation
1. Formulate a question 1. Plan the experiment
2. Perform background research 2. Design the experiment
3. Formulate the hypothesis (answer) 3. Perform the experiment
4. Determine the logical consequences of the hypothesis 4. Analyse the resulting data
5. Collect observations (experiment) 5. Confirm the results
6. Test the truth of the hypothesis by analysing observations (statistics) 6. Evaluate the conclusion
7. Report the results
8. If the hypothesis is not confirmed, go back to 2

There are a number of valid reasons for the iterative approach:
1. New insights were obtained after analysing the experiment
2. New questions arose from the experiment
3. If the hypotheses are built upon wrong assumptions.
The iterative nature means that, if a hypothesis is refuted by the experiment, you should start over again and form
a new hypothesis to verify the new hypothesis. This iteration should be repeated until it’s no longer necessary.

SIX SIGMA

Six Sigma A disciplined, data-driven methodology for process improvement.
It is a combination of quality management tools and the statistical method
DMAIC The circular problem-solving approach of Six Sigma.
Its steps correspond to steps in experimentation of the scientific method:
Define (𝟏, 𝟐) – Measure (𝟑) – Analyse (𝟒) – Improve ( ) – Control ( )

Additionally, DMAIC also uses the principles of the scientific method:
1. DMAIC cycle uses the same iterative discovery cycle
2. It puts emphasis on doing well-defined experiments to discover new insights
3. It’s data driven and puts emphasis on quantification
4. It looks for causal relationships
5. It puts emphasis on proper verification and validation of results

, SAMPLE SIZE DETERMINATION
How much data do I need to collect?

Key concepts What should you be able to do?
 p-value  Compute the minimal sample size determination in terms of CI width
 hypothesis tests when you are given the formula (normal, binomial)
 width confidence interval  Compute the minimal sample size determination in terms of power
 power when you are given the formula (normal, binomial)
 minimal sample size  Compute minimal sample sizes when given a simple confidence or
power formula for a distribution

Key insights
 The absolute error parameter is the half-width of the CI in case of symmetric CIs
 CI width in binomial and normal distributions leads to the minimal sample size
 Minimal sample size determination binomial cases requires extra information on the success probability 𝑝

There are three basic ways of hypothesis testing:
1. Is test statistic in critical region (yes/no) This does not provide a lot of information
2. P-values Allows for people to choose their own 𝛼 value
3. Confidence intervals Gives insight in how uncertain we are about the prediction
(𝜽 ̂ + 𝒄) is a 𝟏𝟎𝟎(𝟏 − 𝜶)% CI when 𝑷(𝜽
̂ − 𝒄 ,𝜽 ̂−𝒄<𝜽< 𝜽 ̂ + 𝒄) = 𝟏 − 𝜶

Type I error False positives
𝜶: The probability to reject 𝑯𝟎 when 𝑯𝟎 is true. 𝟏 − 𝜶 is the True negative (not rejecting 𝑯𝟎 when true)
Type II error False negatives
𝜷: The probability of not rejecting 𝑯𝟎 when 𝑯𝟎 is false.
Power True positives
𝟏 − 𝜷: the probability of rejecting 𝑯𝟎 when 𝑯𝟎 is false

Z-tests (Normal distribution) 𝟏𝟎𝟎(𝟏 − 𝜶)% CI for 𝝁:
𝑋𝑖 ~𝑁(𝜇, 𝜎 2 ) + 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝜎 𝜎
൬𝑥ҧ − 𝑧𝛼/2 , 𝑥ҧ + 𝑧𝛼/2 ൰
𝐻0 : 𝜇 = 𝜇0 ξ𝑛 ξ𝑛
𝐻𝑎 : 𝜇 ≠ 𝜇0
Significance level 𝛼 𝑋ത − 𝜇0
𝝈𝟐 𝑇=
Decision rule: reject if ȁ𝑻ȁ > 𝒛𝜶/𝟐 , 𝑻~𝑵(𝟎, ) 𝜎/ξ𝑛
𝒏

MINIMAL SAMPLE SIZES

The formula to calculate the minimal sample size can be derived from the Confidence Interval.
The formula for the half-width returns the Error (𝑬), this can then be rewritten to calculate 𝑛.
𝒛𝜶/𝟐 𝟐
The formulas to calculate the sample size have a similar form: 𝒏 ≥ ⌈( ) 𝝈𝟐 ⌉
𝑬

If the deviation is not absolute but relative to the expected value 𝜎 (e.g. p of the response time), then 𝐸 = 𝑝 × 𝜎

NORMAL DISTRIBUTION

One-sample Two-sample
If 𝜎 is known If the 𝜎s are known, and 𝑛1 = 𝑛2 = 𝑛
𝜎
CI ̅ ± 𝒛𝜶/𝟐
𝒙 𝝈𝟐𝟏 + 𝝈𝟐𝟐
ξ𝑛 CI ̅𝟐 ± 𝒛𝜶/𝟐 √
̅𝟏 − 𝒙
𝒙
𝒏
𝜎
Error 𝐸 ≥ 𝑧𝛼/2 ×
ξ𝑛
Sample 𝝈𝟐 + 𝝈𝟐𝟐 𝒛𝜶/𝟐 𝟐 𝟐
Sample 𝒛𝜶/𝟐 𝟐 size 𝑬 ≥ 𝒛𝜶/𝟐 √ 𝟏 ⇒ 𝒏≥( ) (𝝈𝟏 + 𝝈𝟐𝟐 )
𝒏 𝑬
size 𝒏≥( ) 𝝈𝟐
𝑬

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller NienkeUr. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.05. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

83637 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Summary

Summary Data Science Research Methods (JBM025)

Document information

Subjects

Written for

Seller

Reviews received

Content preview