A short summary discussing the Statistical Computing JBM050 course for the bachelor Data Science in Tilburg and Eindhoven. This summary is based on the lectures and the reading materials.
Sampling distribution
Statistics The use of data in the context of uncertainty, a branch of mathematics using probability theory.
Bernoulli trial 𝑋~𝐵𝑒𝑟𝑛(𝜋) Binomial trial 𝑋~𝐵𝑖𝑛(𝑛, 𝜋)
A random experiment with exactly 2 A repetition of the Bernoulli trial. P of k successes in n repetitions:
outcomes (binary variables): “success”
𝑛! 𝑛
[P(X=1)=π], and “failure” [P(X=0)=1-π]. 𝑃(𝑋 = 𝑘) = 𝜋 𝑘 (1 − 𝜋)𝑛−𝑘 = ( ) 𝜋 𝑘 (1 − 𝜋)𝑛−𝑘
𝑘!(𝑛−𝑘)! 𝑘
Hypergeometric trial 𝑋~ℎ𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐(𝑁, 𝐾, 𝑛) 𝐾 𝑁−𝐾
( )( )
Calculate the probability of drawing k elements of the K items in a set n with a certain 𝑃(𝑋 = 𝑘) = 𝑘 𝑛 − 𝑘
𝑁
( )
feature without replacement: 𝑛
Estimator An approximation of a population parameter that uses observed data (statistics).
µ → µ̂/𝑥̅ 𝜎 2 → 𝜎̂ 2 /𝑠 2
The population parameter is often denoted using 𝜃, the sample estimate is denoted using 𝜃̂
Normal distribution 𝑋~𝑁(µ, 𝜎 2 ) Sampling distribution If 𝑋~𝑁(µ, 𝜎 2 ),
2
A distribution with the shape of a Probability distribution of the sample statistic. The 𝜎
then 𝑥̅ ~𝑁(µ, )
𝑛
bell-curve. Usually the model statistic is the random variable in the distribution.
parameters need to be estimates, as The sd 𝜎 of a sample statistic is the same as the Standard error.
the population model is unknown. It expresses the uncertainty about the statistic.
Bias, Variance, MSE
A statistic is unbiased if the mean of the sampling distribution coincides with the population parameter.
A statistic has low variance if the deviation from the mean is very small.
2 2 2
𝑩𝒊𝒂𝒔𝟐 = (𝜃 − 𝐸(𝜃̂)) 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆 = 𝐸 ((𝐸(𝜃̂) − 𝜃̂) ) 𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒆𝒓𝒓𝒐𝒓 = √𝑣𝑎𝑟 = √𝐸 ((𝐸(𝜃̂) − 𝜃̂) )
2
̂ ) = 𝐸(𝜃 − 𝜃̂) = 𝑏𝑖𝑎𝑠 2 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑴𝒆𝒂𝒏 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 𝒐𝒇 𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒊𝒐𝒏 = 𝑴𝑺𝑬(𝜽
Efficient estimators have a low MSE. It’s a difficult situation to minimize, as it results in a Bias-Variance Tradeoff. A
lower variance is better, but the Bias is ideally equal to 0. When comparing statistics, this MSE is used most often.
The variance is the variance regarding the sample mean, the MSE is the variance regarding the population mean.
Student’s t distribution 𝑡~𝑡𝑛−1
𝑥̅ −µ
T is the t-statistic 𝑡 = .
𝑠/√𝑛
This is the sample distribution when the population distribution is Normal, and the variance is unknown.
𝑛 − 1 signifies the degrees of freedom, the higher the degrees of freedom, the closer it gets to ~N().
When using the MSE to analyze the estimators, the Maximum Likelihood estimator of smaller samples is more
efficient.
, If you have 2 samples from normally distributed data: 𝑋1 ~𝑁(µ1 , 𝜎 2 ) and 𝑋2 ~𝑁(µ2 , 𝜎 2 )
The sampling distribution of the difference in sample means is a t distribution with 𝑛1 + 𝑛2 − 2 degrees of freedom,
centered at µ2 − µ1 .
sampling distribution of the difference in sample means 𝑡~𝑡𝑛1+𝑛2 −2
This is centered at µ2 − µ1
The Standard error is calculated using: 𝑆𝐸(𝑥̅2 − 𝑥̅1 ) = 𝑠𝑝 √
1
+
1 (𝑛1 −1)𝑠12 +(𝑛2 −1)𝑠22
, 𝑠𝑝 = √
𝑥̅2 − 𝑥̅1 𝑛1 𝑛2 𝑛1 +𝑛2 −2
𝑡=
𝑆𝐸(𝑥̅2 − 𝑥̅2 )
Central Limit Theorem
If n is large enough, the sample mean of X coming from 𝑋~? (µ, 𝜎 2 ) with mean µ and variance
𝜎2
𝜎 2 is approximately the normal distribution 𝑥̅ ~𝑁(µ, )
𝑛
Monte Carlo Simulation
Computer simulation A numerical technique for conducting experiments on the computer. A tool to virtually
investigate the behavior of the system
Monte Carlo Simulation Computer experiment involving random sampling from probability distributions.
Used for estimators and for hypothesis testing (in absence of analytical results)
MC simulations for estimators
An estimator or test statistic has a true sampling distribution under a particular set of conditions. We want to know
this distribution. The derivation is however not always tractable. The MC simulation can be used to approximate the
distribution.
Step 1: Create approximate sampling distribution
Generate S independent data sets of given sample size n under the conditions of interest
Compute the numerical value of the estimator/test statistic 𝜃̂ for each dataset.
Step 2: Derive bias, var, MSE, relative efficiency
If S is large enough, the summary statistics should be a good approximation to the true sampling properties
The sample median is most efficient for distributions with thick tails.
If the distribution is more similar to a normal distribution the mean is more useful.
MC simulations for hypothesis testing
t-statistic:
There are two types of hypothesis testing situations: 𝑥̅ −𝑥̅
1) Randomness (𝐻0 ) vs. Non-randomness (𝐻1 ) of data 𝑡𝑜𝑏𝑠 = 2̅ 1̅
𝑆𝐸(𝑋2 −𝑋1 )
2) No effect (𝐻0 ) vs. Effect (𝐻1 )
𝐻0 is rejected if the observed data/statistics are very unlikely under the assumption of randomness and no effect.
Confidence intervals
Confidence intervals This expresses sampling uncertainty. Often this is mentioned Two sided t-confidence:
instead of the point estimate. [(𝑥̅ 2 − 𝑥̅1 ) − 𝑡𝐶;𝑛1+𝑛2 −2 𝑆𝐸(𝑥̅2 − 𝑥̅1 );
It holds the true population parameter 𝜃 with a probability of C. (𝑥̅2 − 𝑥̅1 ) + 𝑡𝐶;𝑛1+𝑛2−2 𝑆𝐸(𝑥̅2 − 𝑥̅1 )]
A two-sample Student’s t-test does rely on some assumptions: the samples must come
from a normal distribution, and the variances are equal. If these are violated it can impact the quality of the
hypothesis test. If the variances are not equal, Welch’s test applies.
Power of a test complement of Type II error Significance level Type I error
𝑃(𝑡𝑒𝑠𝑡 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑗𝑒𝑐𝑡𝑠 𝐻0 | 𝐻1 𝑡𝑟𝑢𝑒) = 1 − 𝛽 𝑃( 𝑡𝑒𝑠𝑡 𝑟𝑒𝑗𝑒𝑐𝑡𝑠 𝐻0 ∣ 𝐻0 𝑡𝑟𝑢𝑒 ) = 𝛼
The probability of correctly rejecting 𝐻0
generate data under 𝐻0 : µ = µ0
Generate data under 𝐻1 : µ ≠ µ0 calculate how often 𝐻0 is rejected, this approximates 𝛼.
calculate the proportion of rejections.
1 Compare two estimators, e.g. 𝜃̂ (1) is the mean and 𝜃̂ (2) is the median
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller NienkeUr. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $4.79. You're not tied to anything after your purchase.