1 and kurtosis coefficients of rBHP
QBUS2810: Statistical Modelling for Business
Assignment Task #3
Submission Due Date: Sunday, 22nd November, 2020 (Week 12) before 11:59 pm (Sydney
You are required to type up your entire assignment, including any equations. Copy and paste relevant
outputs into your text. If you are using Word, you should use the equation editor for any maths notation.
You should attach relevant analysis outputs (graphs, tables, etc.) while discussing your answer in the text.
Please answer all questions in the given order; i.e., 1a, 1b, etc. You do not need to re-write the assignment
questions again. Keep your answers clear, brief, and concise.
There is no requirement for font size and line spacing, but it must be legible and correctly oriented.
Please convert and submit your assignment in pdf, which must be uploaded to the Turnitin assignment
box on Canvas.
For hypothesis test question, use the p-value approach. Your answer should include the alternatives
(H0 and H1), decision, and conclusion.
Data used in this assignment are in the spreadsheet A3Dataset.xlsx.
You are encouraged to discuss the assignment with your classmates, tutors, and lecturer. However, you
MUST write up solutions on your own. Students caught cheating will automatically receive a mark of 0
and are subject to disciplinary action.
The capital asset pricing model (CAPM) is used in finance to determine a theoretically appropriate
required rate of return of an asset, where that asset is to be added to an already well-diversified
portfolio, given that asset’s non-diversifiable risk. Traditionally, applications of the CAPM use only
one variable to describe the returns of a portfolio or stock with the returns of the market as a whole:
rstock − rf = αstock + βstock(rm − rf ) + ut
In contrast, the Fama-French model uses three variables:
rstock − rf = αstock + βstock(rm − rf ) + β2SMB + β3HML + εt
rstock is the stock’s rate of return, rf is the risk-free return rate, and rm is the return of the
whole stock market. The parameter αstock is the stock’s ”alpha”. It measures how much the stock
outperforms its ”theoretical” predicted returns under the CAPM and βstock is the stock’s ”beta”,
which measures the stock’s exposure to the overall market. Different stocks will have different
The Fama-French model contains two additional factors to explain stock returns. Small market capitalization Minus Big (SMB) measures the historic excess returns of small cap stocks over big caps.
High book-to-market ratio (BtM) Minus Low book-to-market ratio (HML) measures the historic
excess returns of value stocks (small BtM ratio) over growth stocks (High BtM ratio). These factors
are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Capitalisation ranking) and available historical market data. Historical values are available on Kenneth
French’s web page for American stocks.
The variables used in this exercise are as follows:
rBHP = Monthly return on BHP stock as observed on the ASX.
rm = Monthly return on market index, here the All Ordinaries Index (AOI).
SMB = Small market capitalization Minus Big market capitalization factor.
HML = High book-to-market ratio Minus Low book-to-market ratio
You are to assume a risk-free rate of rf = 0.005 per month. Your task is to estimate the Fama-French
three factor model using the given data. and determine whether it is any better at explaining the
BHP stock returns compared to the market excess returns given by only the All Ordinaries Index.
(a) Write down the five-number summaries plus mean, standard deviation, skewness, and kurtosis
coefficients of rBHP .
(b) Plot and comment the rBHP series over time.
(c) Generate two new variables rBHP − rf and rm − rf and estimate the one-factor CAPM model:
rstock − rf = β0 + β1(rm − rf ) + ut
Copy and paste the regression output into your answer sheet. Write down the fitted regression
(d) Comment on the sign of the estimated coefficient β1 and state whether this is what you expect.
(e) Test whether or not the excess market returns explain the excess returns of BHP shares at the
α = 0.05 level.
(f) Test whether or not the BHP’s ”beta” is greater than one at the α = 0.05 level.
(g) Estimate the Fama-French 3-Factor CAPM model:
rstock − rf = β0 + β1(rm − rf ) + β2SMB + β3HML + εt
Copy and paste the regression output into your answer sheet. Write down the fitted regression
(h) Set up the general linear hypothesis for testing whether or not the Fama-French 3-Factor CAPM
model explains the stock returns better than the one-factor CAPM model; i.e., determine L,
β, and c for H0: Lβ = c.
(i) Conduct a hypothesis test for part (h).
(j) A Financial Analyst believes that the effect of book-to-market values (HML) on stock returns is
twice as great as the effect of market capitalization (SMB). Formulate an appropriate hypothesis
test and use re-parametrisation to convert it to a simple t-test to test the assertion. Perform
the required regression and state your conclusion at the α = 0.05 level.
(k) Obtain the variance-covariance matrix for the estimators of parameters in a regression model in
part (g). Utilize the regression result in part (g) and the variance-covariance matrix to repeat
the hypothesis test in part (j) by means of a simple t-test.
The marketing manager of a company producing a new cereal aimed for children wants to examine
the effect of the shape of the box’s logo on the approval rating of the cereal. He combined 4 colours
and 2 shapes to produce a total of 8 designs. Each logo was presented to 2 different groups (a total
of 16 groups of children) and the approval rating for each was recorded and is shown below.
Shape Red Green Blue Yellow
Circle 52, 44 67, 61 36, 44 45, 41
Square 34, 36 56, 58 36, 31 21, 25
(a) How many factors does this experiment have? Identify the factors and state how many levels
each factor has.
(b) If all combinations are compared, how many different treatments (cells) are there in the experiment? What is the response variable?
(c) Consider the following regression model:
Y = β0 + β1C + β2R + β3G + β4B + β5CR + β6CG + β7CB + ε
where C = 1 if shape = circle; 0 otherwise. R = 1 if color = red; 0 otherwise. G = 1 if color
= green; 0 otherwise. B = 1 if color = blue; 0 otherwise.
Use the regression parameters to recover the cell means µij and fill in the following table:
Shape Red Green Blue Yellow
Circle µ11 = β0 + β1 + β2 + β5
(d) The factor effects model is Yijk = µ.. + αi + βj + (αβ)ij + εijk where µ..
is a constant. αi are
constants subject to the restriction Pαi = 0. βj are constants subject to the restriction Pβj
= 0. (αβ)ij are constants subject to the restrictions P
(αβ)ij = 0. εijk are independent
), i = 1, 2, …, a; j = 1, 2, …, b; k = 1, 2, …, n.
Why are the constraints Pαi =
P(αβ)ij = 0 required? What is the advantage of this
(e) Refer to Part (d). Modify the factor effects model to apply to this study with a = 2 and b = 4.
(f) Set up the Y, X, and β matrices for the factor effects regression model.
(g) Refer to part (e). Obtain the fitted regression function.
(h) Plot the residuals against the fitted values and the QQ-plot of the residuals. Use these two
residual plots to check if the assumptions of two-way ANOVA are justifiable. Briefly explain.
(i) Plot an interaction plot. What does this plot suggest?
(j) Fill in the blanks in the following ANOVA table.
Source of Variation SS df MS
(k) Test if the two factors interact.
(l) Is it meaningful here to test for main factor effects? If so, test if the main effects for color and
shape are present.
(m) All pairwise comparisons among the color group level means via Tukey procedure with a 95
percent family confidence coefficient are constructed below:
Treatment Difference Lower 95% limit Upper 95% bound
Red Green -19.00 -35.8696 -2.1304
Red Blue 4.75 -12.1196 21.6196
Red Yellow 8.50 -8.3696 25.3696
Green Blue 23.75 6.8804 40.6196
Green Yellow 27.50 10.6304 44.3696
Blue Yellow 3.75 -13.1196 20.6196
Determine which means differ using Tukey’s multiple comparison test.
(n) Based on the above analysis, what combination of color and shape should be used for the logo
(o) Suppose that in the shape population, 60 percent are circle, and 40 percent are square. Construct a 95% percent confidence interval for the mean overall rating in the shape population.
A person’s muscle mass is expected to decrease with age. To explore this relationship in women,
a nutritionist randomly selected 4 women from each 10-year age group, beginning with age 40 and
ending with age 79. X is age, and Y is a measure of muscle mass.
(a) Below is a scatter plot of the data with muscle mass on the y axis and age on the x axis.
Based on the plot, does it seem reasonable that there are two different (but connected) regression functions – one when age ≤ 60 and one when age > 60?
(b) The nutritionist conjectures that the regression of muscle mass on age follows a two-piece linear
relation, with the slope changing at age 60 without discontinuity. State the regression model
that applies if the nutritionist’s conjecture is correct.
(c) Refer to part (b). What are respective response functions when age is 60 or less and when age
is over 60?
(d) Explain whether or not the model specified in part (b) violates the principle of marginality.
Also, discuss and show whether or not this model is continuous at X = 60. Is continuity or
marginality more important here and why?
(e) Estimate the regression model specified in part (b). Copy and paste the regression output into
your answer sheet. Write down the fitted regression equation.
(f) Test whether a two-piece linear regression function is needed at α = 0.05.
(g) Refer to part (e). What is the estimated regression function for muscle mass whose age ≤ 60?
for muscle mass whose age > 60?
(h) Based on your estimated regression function, what is the predicted muscle mass when age =
50? When age = 70?
(i) Do you get the same prediction for age = 60 regardless of which estimated regression function
in part (e) you use?
(j) Modify the regression model in part (b) with the slope changing at age 60 without continuity.
(k) Specify the regression model for the case where the slope changes at age 40 and again at age
60 with no discontinuities.
Consider the general linear regression model Y = Xβ + ε where Y is n x 1, X is n x p and of rank
p, β is p x 1, ε is n x 1, and ε is N(0, σ
(a) The hat matrix H is given by H = X(X’X)−1X’. Show that (I – H) is idempotent where I is
the n x n identity matrix.
(b) Using the least squares method, we minimize RSS = e’e = (Y – Xb)’(Y – Xb) to obtain =
b = (X’X)−1X’Y. Show that RSS can also be written as Y’(I – H)Y.
(c) Obtain an expression for the variance-covariance matrix of the fitted values Ybi
, i = 1, 2, …, n,
in terms of the hat matrix H.
(d) e = Y – Yb is the vector of residuals. Are the residuals statistically independent? Justify your
answer with an explanation.
(e) Show that e = ε – Hε. Suppose we denote by hij the (i, j) element of the HAT matrix H.
Thus, ei can be written as ei = εi –
j=1 hij εj . What does this equation of ei show?
Suppose that we partition X and β as
X = [X1
. X2] β =
where X1 is n x p1, X2 is n x p2, and p1 + p2 = p. β1 is p1 x 1, and β2 is p2 x 1.
(f) If the true model is Y = Xβ + ε, and we fit the model Y = X1β1 + u, have we underspecified
or overspecified the model?
(g) For the case in part (f), b1 = (X1’X1)
−1X1’Y. If the true model is Y = Xβ + ε, compute
(h) As a result of model misspecification in part (f), we could obtain an estimator of σ
2 which is
larger than it should be. Does this affect inferences made about the model? Explain.
Criminologists are interested in the effect of demographic characteristics and police expenditure on
crime rates. This has been studied using aggregate data on 47 states of the USA for 2016. The data
set contains the columns as described below:
M percentage of males aged 14–24 in total state population
So indicator variable for a southern state
Ed mean years of schooling of the population aged 25 years or over
Po1 per capita expenditure on police protection in 2016
Po2 per capita expenditure on police protection in 2015
LF labour force participation rate of civilian urban males in the age-group 14–24
M.F number of males per 100 females
Pop state population in 2016 in hundred thousands
NW percentage of nonwhites in the population
U1 unemployment rate of urban males 14–24
U2 unemployment rate of urban males 35–39
Wealth wealth: median value of transferable assets or family income
Ineq income inequality: percentage of families earning below half the median income
Prob probability of imprisonment: ratio of number of commitments to number of offenses
Time average time in months served by offenders in state prisons before their first release
Crime crime rate: number of offenses per 100,000 population in 2016
(a) Calculate and interpret the sample correlation between crime rate and police expenditure in
Is the sign of the correlation what you expect? Explain.
(b) In the previous question, we saw that the sample correlation between crime rate in 2016 and
police expenditure in 2015 was positive. However, the model fitted below suggests that an
increase in police expenditure in 2015 decreases the crime rate in 2016. Is there a contradiction?
Crime d = 158.2646 + 256.1526P o1 − 178.2880P o2
(c) Find the best (parsimonious) regression model for the given data. Do not forget to perform an
initial data analysis before applying the automatic search procedures such as forward selection,
backward elimination, and stepwise regression.
Use the best model found in part (c) to answer the following questions:
(d) What characteristics does a high leverage point have in general?
(e) Are there any high leverage points in the data used to fit the best model?
(f) What is the sum of all leverage values for the data used to fit the best model?
(g) Are there any outliers in the data for the best model?