This document contains lecture notes from the Statistics II: Applied Quantitative Analysis course, which is mandatory for all International Relations and Organizations students.
I. COMPARING TWO MEANS: Steps of statistical inference
1. Hypothesis
a. Null hypothesis: ∆= 0
b. Alternative hypothesis: ∆≠ 0
2. Test statistic
"
∆
a. T-test: % = " in this example %̂ = 3.45
%(∆)
#$
3. Sampling distribution of the test statistic
a. T-distribution with 11202 (+()$*(+$,( + +-.,()./ − 2 012345) degrees of freedom
4. Look up/calculate p=value for %̂ = 3.45; 67 = 11202
a. p=0.0006
5. Conclusion
a. Reject the null hypothesis at the 5% significance level (because p < 0.05)
b. Earnings are different from those who followed the training program
II. ANOVA: Comparing more than two means
• If we want to compare more than two means, we cannot use a simple t-test
• ANOVA considers the differences between groups and the differences within groups
EXAMPLE: Is there a statistically significant difference between number of TV appearances for MPs of different parties?
Figure 1. Number of TV show entries
Figure 2. Total sum of squares (990 ) | 990 = 991 + 992
6
990 = ∑7
389;<3 − <̅4)*,5 >
,991 is good to answer the question: Which part of the total sum of squares can we explain by using the group means?
992 is good to answer the question: Which part of the total sum of squares cannot be explained by using the group means?
Mean squares
• The model sum of squares (991 ) is based on the difference between 3 group means and the grand mean.
o The degrees of freedom is the number of groups minus 1 for the grand mean
991 22.89
P91 = = = 11.44
671 2
671 = 3 − 1 = 2
• The residual sum of squares (992 ) is based on the difference between each value and its group mean
o The degrees of freedom is based on the number of observations (minus the number of groups)
992 8.67
P92 = = = 1.44
672 6
672 = 9 − 3 = 6
F statistic
• The ratio between the variance explained by the model (P91 ) and the variance NOT explained by the model (P92 )
• If Q > 1, the model can explain more than what it leaves unexplained
P91 11.44
Q= = = 7.92
P92 1.44
Inference: conclusion about population
Null hypothesis: the mean of all groups is the same
We compare this score for the F-test to the F-distribution.
This distribution has two sets of degrees of freedom: 671 and 672 . Here: 2 and 6.
Critical value for a significance level (a-level) of 0.05 and 2 and 6 degrees of freedom is 5.14.
SCDEFECGH compared to SIJKLDMLN
• The observed value of F (Q.O#$)P$5 = 7.92) is greater than the correspond ding critical value (Q-)3(3-*/ = 5.14)
• Therefore, we reject the null hypothesis (null hypothesis: the mean of all groups is the same)
Reporting: There was a statistically significant difference (at the 5% level) between parties in terms of the average number of tv show entries by their
politicians, F(2, 6) = 7.92, p = 0.021.
,REGRESSION ANALYSIS
Why do we use regression for statistical inference?
• To express uncertainty about our conclusions about the relation between 2 concepts
• Assessing the strength of a relation
• Understand the population (based on a sample)
Why regression?
• What if we are not just interested in the difference between two means, but in how the mean values of a variable change as another
variable changes
• Example: Have available incomes increased in rich and poor countries, or have poor countries remained poor?
• How can we describe the strength of this association? Correlation? r = 0.961
Regression is related to correlation
• But regression can assess the impact of several independent variables on one specific dependent variable
o Not just strength of the association, but size of the effect: the expected change in Y as a result of a 1-unit change in X
• By assuming a linear association exists
• Regression can assess the null hypothesis: incomes are unrelated to incomes in the past
EXAMPLE: What is the relationship between the number of seats a party has in parliament and the number of motions it tables?
‘Line of best fit’
• Minimizing the distances between points and the line; your best guess given the data available
REGRESSION EQUATION: T = U + V<
• Intercept (constant): a; if the number of seats is 0, how many motions can we expect (according to the model)?
• Slope: b; if the number of seats increases by 1, what is the expected change in the number of motions (according to the model)?
Intercept: Slope:
• If a party has 30 seats, how many motions can we expect?
o W2%X2+5 = U + V ∗ 5ZU%5
o W2%X2+5 = 38.11 + 7.17 ∗ 5ZU%5
o \ = 38.11 + 7.17 ∗ 30 = 253.3
W2%[2+5
• We often use VQ and V9 instead of use U and V
o T3 = VQ + V9 <3
o The subscript X stands for the number of the observation,
T9 is the value of the response variable T for the first observation in the dataset,
T3 is the value of the response variable T for any observation X in the dataset.
ERROR: There are observations not on the regression line, there is error! All models are wrong
, Including error in the equation
• T3 = VQ + V9 <3 + ]3 | All models are wrong, but we make assumptions about error (e.g. it is random for all cases)
• Ε[T3 |<3 ] = VQ + V9 <3 | That’s why we work with the expected value of T3 given a value of bE
HOW DO WE DRAW THE REGRESSION LINE?
• Ordinary Least Squares: Minimizes the residual sum of squares; a residual is the difference between a data point and the regression line
• Squaring these residuals gives us squared residuals, or squares; the sum of the squared residuals is 992 = 24680.2
• The regression line is chosen in such a way that the residual sum of squares is as small as possible, least squares
Calculating the regression line
• 992 = ∑(T3 − Tc3 )6
• 992 = ∑(T3 − VQ − V9 <3 )6
• Tc3 = VQ − V9 <3 ; Tc3 refers to the predicted value of y according to the regression model
h
VQ = Tg − Vf9 <̅
hQ = 199.5 − 7.17 ∗ 22.5 = 38.17
V
h
VQ = 38.17
Multiple explanatory variables: If you have more than one explanatory variable in your model,
you can still calculate the ‘least squares’, this is what SPSS is for!
Regression: Key assumptions
1. It makes sense to treat the relationship between Ε[T3 |<3 ] and the x variable as linear and additive
2. Ε[T3 |<3 ] = 0, error exists but is assumed to be random, so not relevant for estimating point-values
T3 = VQ + V9 <3 + ]3
Ε[T3 |<3 ] = VQ + V9 <3
What variables are suitable for regression?
• Dependent variable: Interval-ratio scale response variables
o Must have the same substantive meaning anywhere on the scale, e.g. profit, GDP
• Otherwise, modification is needed:
o Nominal/Ordinal scale: Logistic regression (blue/brown, agree, strongly agree)
o Count scale (non-negative integers): Poisson and negative binomial regression models; NOT in this course (war casualties)
• Explanatory variables can be of any type (with modification)
• Variable values must vary (variance cannot be zero)
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller polscinotes. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.52. You're not tied to anything after your purchase.