QUANTITATIVE RESEARCH METHODS (STATISTICS)
#
MULTIPLE LINEAR REGRESSION ANALYSIS
A. model specification
B. model fit and inference
C. goodness of fit
D. assumptions
TIME SERIES
1. cautions
2. autocorrelation
3. stationarity
4. dynamic models (missing part)
, MULTIPLE LINEAR REGRESSION
MLR - a. model specification
MULTIPLE VERSUS SIMPLE REGRESSION
- Simple linear regression: 1 independent variable x
- Multiple linear regression: k x-variables, k>1
Example Hamburger Chain
Research question: to assess the effect of different price structures and different levels of advertising expenditure on
the sales, the management sets different prices, and spends varying amounts on advertising, in different cities. Does
an increase in advertising expenditure lead to an increase in sales? If so, is the increase in sales sufficient to justify
the increased expenditure?
Random experiment: pick a random store of a chain in a random city
Y = sales: monthly sales (in 1000$)
x1 = price: ‘average’ price for products (in $)
x2 = advert: monthly advertising expenditure (in 1000$)
MULTIPLE LINEAR REGRESSION
- As in simple linear regression, the model consists of:
- A systematic part that provides us with information on how a combination of x-outcomes results in an
average value for Y: μY|x
- A random error term ε to account for the fact that Y|x is a random variable
- Graphically:
- Multiple linear regression is not represented by a line any more. It can be visualized using a (hyper)plane.
Example Hamburger Chain
Y = Sales
x1 = Price
x2 = Advert
CLASSICAL MULTIPLE LINEAR REGRESSION
- The assumptions that were introduced for simple linear regression remain. In addition, in assumption A4 now
we make two assumptions about the explanatory variables.
- Classical assumptions for multiple linear regression (A4 is where it differs from SLR):
.
.
A1: μY|x = β0 + β1 x1 + … + βK xk (ε has mean zero for all x)
A2: ε has constant standard deviation σ- homoskedasticity
A3: cov(εi,εj)=cov(Yi,Yj) =0
A4: Variables xi are non random (which can be relaxed to the assumption that x is not correlated with the error
term) and are not exact linear functions of the other explanatory variables (means that x1 and x2 should be
different enough)
A5: (optional) ε is normally distributed
, MULTIPLE REGRESSION MODEL
INTERPRETATION OF THE PARAMETERS
- Intercept β0 : Average value for Y if all x=0 is often not relevant. However except in very special cases, we
always include an intercept in the model, even if it has no direct economic interpretation. Omitting it can lead to
a model that fits the data poorly and that does not predict well.
- Coefficients βi : A slope in the xi direction, measures the effect of a change in the variable xi upon the
expected value of y, ceteris paribus = if all other variables held constant
As such it is linked to the partial derivative
Example Hamburger Chain
Y = sales
x1 = price
x2 = advert
β0 : interpretation for price = 0, and advert = 0 is not realistic
β1 : the change in monthly sales (1000$) when the price index Price is increased by one unit (1$) and advertising
expenditure Advert is held constant.
MODEL SPECIFICATION
- It is important to carefully think about the regression model specification:
- What functional form? μY|x = f(x)
- Linear function versus non-linear functions
- How to account for qualitative x-variables ?
- How to account for interaction effects between x-variables ?
- Choice of explanatory variables ?
—————————————————————————————————————————————————————
NON-LINEAR MODELS
NON-LINEAR RELATIONSHIPS
- As in simple linear regression, non-linear relationships can be modeled using a multiple ‘linear’ regression
model through the use of appropriate transformations.
- You should be lead by economic theory, experts, taking into account eg. slope properties.
- Does model provide a good fit for the data?
Example Hamburger Chain
We initially hypothesized that sales revenue is linearly related to price and advertising expenditure:
SALES = β0+β1 PRICE + β2 ADVERT
But is this a good choice ? Remember that before we suggested that adding ADVERT² or using the logarithm of
ADVERT might be a good idea.
TRANSFORMATIONS
- The logarithmic transformation is a common transformations in economical applications.
- Polynomial functions: when we studied these models with the simple regression mode, we were constrained
by the need to have only one right-hand-side variable, such as Y = β0 + β1 x². Now, within the framework of
the multiple regression model, we can consider unconstrained polynomials with all their terms included. It is
sometimes true that having a variable and its square or cube in the same model causes collinearity problems
(see later)
- exam: you will be told if you have a transformation and which transformation you will have to do
- project: you will have to think about it yourself but you can always ask the teacher if you are lost
, NON-LINEAR MODELS
LOG TRANSFORMATIONS
Example Hamburger Chain
Consider model with ln(Advert)
Sales = β0 + β1 Price + β2 ln(Advert) + ε
instead of linking sales linearly to the advertisement expenditures, you link sales to the ln of the advertisement
expenditures
When the advertisement expenditure increases on average by 1%, the sales increase approximately by 0.03456
(1000$), =34.56$, ceteris paribus.
What about log-log model?
POLYNOMIAL MODEL
Example Hamburger Chain
Consider the quadratic model
Sales = β0 + β1 Price + β2 Advert + β3 Advert² + ε
What sign do you expect for β2, β3?
- instead of ln of adv. takes square of adv.
- take the derivative of sales in regard of advert.
- depending on what the advert. is your sales increase differently
Parabola:
- advert. has a positive sign: an increase in adv. leads to an increase in sales
- advert. has a negative sign: so the effect gets smaller as your x increases
When advertising is increased by 1 unit ($1000), this does not always have the same effect on the Sales. In
this case the effect is positive, but the effect becomes smaller as Advert increases
—————————————————————————————————————————————————————
DUMMY VARIABLES
DUMMY VARIABLES
- Variables with only 2 outcomes are called indicator variables = dummy variables
- Usually the 2 outcomes are coded by 1 or 0, to indicate the presence or absence of a characteristic or to
indicate whether a condition is true or false.
- The value D = 0 defines the reference group of elements for which the characteristic is not present.
if characteristic is present
if characteristic is not present
Example Price House
Reference group: houses not in the desirable neighbourhood.
if property is in desirable neighbourhood
if property is not in desirable neighbourhood
, DUMMY VARIABLES
QUALITATIVE VARIABLES
- Dummy variables are used to account for qualitative factors in econometric models.
- Even if numbers are used to code the outcomes of qualitative factors, do NOT use these codes as such in
the regression model. Introduce dummy variables!
QUALITATIVE VARIABLES WITH 2 OUTCOMES
Example Price Houses
Y = price
x1 = SQFT = area measured in square feet
x2 = variable indicating whether house in desirable neighbourhood => create dummy variable D
PRICE = β0 + β1 SQFT + β2 D + ε
INTERPRET COEFFICIENT LINKED TO DUMMY
- Write down separate regression models for the outcomes of the qualitative variable.
Example Price Houses
PRICE = β0 + β1 SQFT + β2 D + ε
D=1 (desirable neighbourhood):
^
(price) = 20.543 + 50.058 + 0.123 SQFT
D=0 (not desirable neighbourhood):
^
(price) = 20.543 + 0.123 SQFT
A house in the desirable neighbourhood will have a price that is on average 50.058 units higher than a house
with the same SQFT which is not in the desirable neighbourhood (reference group: D = 0), ceteris paribus.
- In general: •Y = β0 + β1 x1 +…+ βi D +…+ βK xK + ε
- Interpretation βi ? Write down separate regression models for the outcomes of the qualitative variable
- In general: if D=1, then the value of Y is on average βi units larger compared to the reference group, ceteris
paribus.
- Graphically:
- Adding the dummy variable D to a simple regression model causes a parallel shift in the relationship
by the amount β2
Example Price Houses
Conclusion we take at this time, but will change later on.
, DUMMY VARIABLES
QUALITATIVE RANDOM VARIABLE WITH SEVERAL CATEGORIES
- If a qualitative variable has M>2 outcomes, one has to introduce M-1 dummy variables
Example Test Results
Y = test score
x1 = study time in hours
x2 = highest diploma (1=master, 2=bachelor, 3=high school)
DB = 1 if bachelor, 0 else
DM = 1 if master, 0 else
High school = reference group
SPSS: new variable: DM
SPSS: Transform - Recode into different variables then do the same for DB
Master degree
Bachelor degree
High school degree = reference group = double 0
X2 and last two columns give same info, we will use the last two columns in our model.
INTERPRET COEFFICIENTS OF DUMMIES
- Write down separate regression models for the outcomes of the qualitative variable.
Example Test Results
- Interpretation: If Di = 1, then Y increases on average by βi units in group i compared to reference group,
ceteris paribus.
- Graphically:
- Adding a qualitative effect to a simple regression model results in parallel lines
Example Test Results
Maybe not realistic. If the test result is about basic things, such basic things that someone with a Master’s
degree actually learns this material more or less, so that when he/she studies, his/her test scores will not go
up that much anymore. Whereas, someone with a high school degree studying that material will have a
higher effect. Example of studying an additional hour not having the same effect for all diplomas.
, INTERACTION VARIABLES
INTERACTION
Example Price House
Y = price
x1 = SQFT = area (square feet) desirable
x2 = dummy variable:
1 desirable neighbourhood, 0 otherwise
The effect of the area of the house is not the
same in the different neighbourhoods undesirable
SPSS: Graphs - chart builder - scatter/dot -
grouped scatter - x-axis: SQFT - y-axis: price
- set color: D - double click on image -
elements - lines sub groups
- What do we see? The slope of the houses in a desirable neighbourhood is larger than the one of houses in
an undesirable neighbourhood. If we thus increase the square foot by a certain amount, it will have a higher
effect on the houses in a desirable neighbourhood than on the price in an undesirable neighbourhood.
- When the total influence of 2 explanatory variables on Y is not just the sum of the 2 separate effects, but
the effect of one variable is affected by another, there is said to be an interaction effect.
- It can occur with any 2 explanatory variables, but it rarely occurs when 2 quantitative variables are
involved. A quantitative and a qualitative variable OR two qualitative variables will interact more often.
Example Prise House
- Suppose the effect of the area of the house is not same in the different neighbourhoods. That is there is an
interaction effect between SQFT an D.
INTERACTION EFFECT IN REGRESSION MODEL
- This can be accomplished by adding another explanatory variable in the model that consists of the
product of the 2 interacting variables
SPSS: Transform - compute variable
Example Price Houses
- The interaction effect between SQFT and D can be taken up in the model by adding a variable which is
the product of both variables.
PRICE = β0+ β1 SQFT + β2 D + β3 SQFT x D + ε
Y = Price
SQFT = square feet
if in desirable neighbourhood
if not in desirable neighbourhood (reference group)
INTERACTION BETWEEN QUANTITATIVE AND QUALITATIVE
- Writing down the model for the
different dummy outcomes also
helps to interpret the parameters.
Example Price Houses ->
, INTERACTION VARIABLES
INTERACTION BETWEEN QUANTITATIVE AND QUALITATIVE
- In this case the interaction variable is also called a slope-indicator variable or a slope dummy variable.
- Examining the regression function for the different dummy outcomes illustrates best the effect of the
slope dummy graphically.
Example Price Houses
Example Test Results
two times: spss: transform -
compute variable
INTERACTION BETWEEN 2 QUALITATIVE VARIABLES
- Examining the regression function for the different dummy outcomes illustrates best the effect of the
variables. It helps to interpret the parameters.
Example Wages
- Holding the effect of education constant,
we estimate that black males earn $4.17
per hour less than white males, white
females earn $4.78 less than white males,
and black females earn $5.11 less than
white males.
, INTERACTION VARIABLES
BETWEEN 2 QUANTITATIVE VARIABLES
Example Pizza
Pizza = annual expenditure on pizza ($)
Age = age in years
Income = income (in 1000$)
For a linear model:
^
PIZZA = 342.88 - 7.576 AGE + 1.832 INCOME
Marginal propensity to spend on pizza = 1.832, that is for a given level of income, the expected
expenditure changes by 1.832$ with an additional income of 1000$.
BUT: Is it reasonable to expect that this marginal propensity does not depend on age?
It seems more reasonable to assume that as a person ages, less of each extra dollar is expected to be
spent.
—————————————————————————————————————————————————————
CHOICE OF VARIABLES
WHICH INDEPENDENT VARIABLES?
- You should be lead by economic theory, experts. But your choice will also depend on choices (priors)
eg. demand depends on prices of complements and substitutes: which ones?
- You can make two errors:
- Taking up an irrelevant variable in the model.
- Not taking up a relevant x-variable in the model, which is called an omitted variable.
ILLUSTRATION OMITTED VARIABLE
Example Female Labor Force Participation
Y = yearly family income
x1 = husband’s years of education (HE)
x2 = wife’s years of education (WE)
Literature: wife’s years of education is relevant in explaining the household’s income.
- Model 1: FAMINC = β0 + β1 HE + ε (WE is an omitted variable)
- Model 2: FAMINC = β0 + β1 HE + β2 WE + ε
, CHOICE OF VARIABLES
- Magnitude of coefficient of HE changes: B1 is biased in model 1 (model with ommited variable).
- Remember:
- if model specification correct and assumptions hold, then estimator B1 is unbiased: E(B1)=β1
(on average you are estimating the correct thing)
- now the model specification of model 1 is not a good, the bias E(B1)-β1 ≠0, if WE is omitted.
(the difference (bias) is not zero)
-The sign of the bias is positive (E(B1)-β1>0) if WE is omitted: We see that the effect of HE is
overestimated, if WE is omitted. (we systematically overestimate)
- Part of the effect of WE is taken over by HE. Since there is a positive correlation between HE and WE
and the effect of WE is positive, the bias of B1 is positive.
Another Illustration Omitted Variables
Y = yearly female income
x1 = husband’s years of education (HE)
x2 = wife’s years of education (WE)
x3 = nr of children < 6 years old (KL6)
Literature: number of children is relevant in explaining the income
- Model 1: FAMINC= β0 + β1 HE + β2 WE + ε ; KL6 omitted
- Model 2: FAMINC= β0 + β1 HE + β2 WE + + β3 KL6 + ε
- Now the coefficients of HE and WE do not change a lot, if the significant variable KL6 is omitted. This is
so because KL6 is not highly correlated with the education variables.
CONSEQUENCES
- True model: -> B2 also ends up in the error term
but the variable x2 is omitted
- That is, omitted variable is taken up in the error term.
- But the other variables are usually not independent from the omitted variable. So assumption A4 (no
correlation between x and the error term) is violated. As such the estimators are not BLUE. (U = unbiased)
- The estimators are not BLUE: more specifically:
- Bias (model is suspect), since part of the effect of the omitted variable on Y, is assigned to the
variables that are in the model.
- Bias gets worse if the omitted variable Xom is correlated more to the variables in the model
- Sign of the bias of Bin (the estimator of the coefficient βin linked to a variable Xin still in the
model) can be determined by:
I I
Bom = Beta Omitted correlation of variable that is still in the model, and variable that is omitted
DETECTION
- Hard:
- especially if small bias
- moreover: which variable to add?
- It might be worth considering an omitted variable of the resulting model does not show the expected
behaviour (eg. wrong sign)