1 and kurtosis coefficients of rBHP

QBUS2810: Statistical Modelling for Business

Assignment Task #3

Submission Due Date: Sunday, 22nd November, 2020 (Week 12) before 11:59 pm (Sydney

time)

Instructions:

You are required to type up your entire assignment, including any equations. Copy and paste relevant

outputs into your text. If you are using Word, you should use the equation editor for any maths notation.

You should attach relevant analysis outputs (graphs, tables, etc.) while discussing your answer in the text.

Please answer all questions in the given order; i.e., 1a, 1b, etc. You do not need to re-write the assignment

questions again. Keep your answers clear, brief, and concise.

There is no requirement for font size and line spacing, but it must be legible and correctly oriented.

Please convert and submit your assignment in pdf, which must be uploaded to the Turnitin assignment

box on Canvas.

For hypothesis test question, use the p-value approach. Your answer should include the alternatives

(H0 and H1), decision, and conclusion.

Data used in this assignment are in the spreadsheet A3Dataset.xlsx.

You are encouraged to discuss the assignment with your classmates, tutors, and lecturer. However, you

MUST write up solutions on your own. Students caught cheating will automatically receive a mark of 0

and are subject to disciplinary action.

The capital asset pricing model (CAPM) is used in finance to determine a theoretically appropriate

required rate of return of an asset, where that asset is to be added to an already well-diversified

portfolio, given that asset’s non-diversifiable risk. Traditionally, applications of the CAPM use only

one variable to describe the returns of a portfolio or stock with the returns of the market as a whole:

rstock − rf = αstock + βstock(rm − rf ) + ut

In contrast, the Fama-French model uses three variables:

rstock − rf = αstock + βstock(rm − rf ) + β2SMB + β3HML + εt

rstock is the stock’s rate of return, rf is the risk-free return rate, and rm is the return of the

whole stock market. The parameter αstock is the stock’s ”alpha”. It measures how much the stock

outperforms its ”theoretical” predicted returns under the CAPM and βstock is the stock’s ”beta”,

which measures the stock’s exposure to the overall market. Different stocks will have different

parameters.

The Fama-French model contains two additional factors to explain stock returns. Small market capitalization Minus Big (SMB) measures the historic excess returns of small cap stocks over big caps.

High book-to-market ratio (BtM) Minus Low book-to-market ratio (HML) measures the historic

excess returns of value stocks (small BtM ratio) over growth stocks (High BtM ratio). These factors

are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Capitalisation ranking) and available historical market data. Historical values are available on Kenneth

French’s web page for American stocks.

The variables used in this exercise are as follows:

rBHP = Monthly return on BHP stock as observed on the ASX.

rm = Monthly return on market index, here the All Ordinaries Index (AOI).

SMB = Small market capitalization Minus Big market capitalization factor.

HML = High book-to-market ratio Minus Low book-to-market ratio

You are to assume a risk-free rate of rf = 0.005 per month. Your task is to estimate the Fama-French

three factor model using the given data. and determine whether it is any better at explaining the

BHP stock returns compared to the market excess returns given by only the All Ordinaries Index.

2

(a) Write down the five-number summaries plus mean, standard deviation, skewness, and kurtosis

coefficients of rBHP .

(b) Plot and comment the rBHP series over time.

(c) Generate two new variables rBHP − rf and rm − rf and estimate the one-factor CAPM model:

rstock − rf = β0 + β1(rm − rf ) + ut

Copy and paste the regression output into your answer sheet. Write down the fitted regression

equation.

(d) Comment on the sign of the estimated coefficient β1 and state whether this is what you expect.

(e) Test whether or not the excess market returns explain the excess returns of BHP shares at the

α = 0.05 level.

(f) Test whether or not the BHP’s ”beta” is greater than one at the α = 0.05 level.

(g) Estimate the Fama-French 3-Factor CAPM model:

rstock − rf = β0 + β1(rm − rf ) + β2SMB + β3HML + εt

Copy and paste the regression output into your answer sheet. Write down the fitted regression

equation.

(h) Set up the general linear hypothesis for testing whether or not the Fama-French 3-Factor CAPM

model explains the stock returns better than the one-factor CAPM model; i.e., determine L,

β, and c for H0: Lβ = c.

(i) Conduct a hypothesis test for part (h).

(j) A Financial Analyst believes that the effect of book-to-market values (HML) on stock returns is

twice as great as the effect of market capitalization (SMB). Formulate an appropriate hypothesis

test and use re-parametrisation to convert it to a simple t-test to test the assertion. Perform

the required regression and state your conclusion at the α = 0.05 level.

(k) Obtain the variance-covariance matrix for the estimators of parameters in a regression model in

part (g). Utilize the regression result in part (g) and the variance-covariance matrix to repeat

the hypothesis test in part (j) by means of a simple t-test.

The marketing manager of a company producing a new cereal aimed for children wants to examine

the effect of the shape of the box’s logo on the approval rating of the cereal. He combined 4 colours

and 2 shapes to produce a total of 8 designs. Each logo was presented to 2 different groups (a total

of 16 groups of children) and the approval rating for each was recorded and is shown below.

Color

Shape Red Green Blue Yellow

Circle 52, 44 67, 61 36, 44 45, 41

Square 34, 36 56, 58 36, 31 21, 25

(a) How many factors does this experiment have? Identify the factors and state how many levels

each factor has.

(b) If all combinations are compared, how many different treatments (cells) are there in the experiment? What is the response variable?

(c) Consider the following regression model:

Y = β0 + β1C + β2R + β3G + β4B + β5CR + β6CG + β7CB + ε

where C = 1 if shape = circle; 0 otherwise. R = 1 if color = red; 0 otherwise. G = 1 if color

= green; 0 otherwise. B = 1 if color = blue; 0 otherwise.

Use the regression parameters to recover the cell means µij and fill in the following table:

3

Colour

Shape Red Green Blue Yellow

Circle µ11 = β0 + β1 + β2 + β5

Square

(d) The factor effects model is Yijk = µ.. + αi + βj + (αβ)ij + εijk where µ..

is a constant. αi are

constants subject to the restriction Pαi = 0. βj are constants subject to the restriction Pβj

= 0. (αβ)ij are constants subject to the restrictions P

i

P

j

(αβ)ij = 0. εijk are independent

N(0, σ

2

), i = 1, 2, …, a; j = 1, 2, …, b; k = 1, 2, …, n.

Why are the constraints Pαi =

Pβj =

P(αβ)ij = 0 required? What is the advantage of this

model?

(e) Refer to Part (d). Modify the factor effects model to apply to this study with a = 2 and b = 4.

(f) Set up the Y, X, and β matrices for the factor effects regression model.

(g) Refer to part (e). Obtain the fitted regression function.

(h) Plot the residuals against the fitted values and the QQ-plot of the residuals. Use these two

residual plots to check if the assumptions of two-way ANOVA are justifiable. Briefly explain.

(i) Plot an interaction plot. What does this plot suggest?

(j) Fill in the blanks in the following ANOVA table.

Source of Variation SS df MS

Between treatments

Factor A

Factor B

AB Interactions

Error

Total

(k) Test if the two factors interact.

(l) Is it meaningful here to test for main factor effects? If so, test if the main effects for color and

shape are present.

(m) All pairwise comparisons among the color group level means via Tukey procedure with a 95

percent family confidence coefficient are constructed below:

Treatment Difference Lower 95% limit Upper 95% bound

Red Green -19.00 -35.8696 -2.1304

Red Blue 4.75 -12.1196 21.6196

Red Yellow 8.50 -8.3696 25.3696

Green Blue 23.75 6.8804 40.6196

Green Yellow 27.50 10.6304 44.3696

Blue Yellow 3.75 -13.1196 20.6196

Determine which means differ using Tukey’s multiple comparison test.

(n) Based on the above analysis, what combination of color and shape should be used for the logo

design?

(o) Suppose that in the shape population, 60 percent are circle, and 40 percent are square. Construct a 95% percent confidence interval for the mean overall rating in the shape population.

4

A person’s muscle mass is expected to decrease with age. To explore this relationship in women,

a nutritionist randomly selected 4 women from each 10-year age group, beginning with age 40 and

ending with age 79. X is age, and Y is a measure of muscle mass.

(a) Below is a scatter plot of the data with muscle mass on the y axis and age on the x axis.

Based on the plot, does it seem reasonable that there are two different (but connected) regression functions – one when age ≤ 60 and one when age > 60?

(b) The nutritionist conjectures that the regression of muscle mass on age follows a two-piece linear

relation, with the slope changing at age 60 without discontinuity. State the regression model

that applies if the nutritionist’s conjecture is correct.

(c) Refer to part (b). What are respective response functions when age is 60 or less and when age

is over 60?

(d) Explain whether or not the model specified in part (b) violates the principle of marginality.

Also, discuss and show whether or not this model is continuous at X = 60. Is continuity or

marginality more important here and why?

(e) Estimate the regression model specified in part (b). Copy and paste the regression output into

your answer sheet. Write down the fitted regression equation.

(f) Test whether a two-piece linear regression function is needed at α = 0.05.

(g) Refer to part (e). What is the estimated regression function for muscle mass whose age ≤ 60?

for muscle mass whose age > 60?

(h) Based on your estimated regression function, what is the predicted muscle mass when age =

50? When age = 70?

(i) Do you get the same prediction for age = 60 regardless of which estimated regression function

in part (e) you use?

(j) Modify the regression model in part (b) with the slope changing at age 60 without continuity.

(k) Specify the regression model for the case where the slope changes at age 40 and again at age

60 with no discontinuities.

5

Consider the general linear regression model Y = Xβ + ε where Y is n x 1, X is n x p and of rank

p, β is p x 1, ε is n x 1, and ε is N(0, σ

2

I).

(a) The hat matrix H is given by H = X(X’X)−1X’. Show that (I – H) is idempotent where I is

the n x n identity matrix.

(b) Using the least squares method, we minimize RSS = e’e = (Y – Xb)’(Y – Xb) to obtain =

b = (X’X)−1X’Y. Show that RSS can also be written as Y’(I – H)Y.

(c) Obtain an expression for the variance-covariance matrix of the fitted values Ybi

, i = 1, 2, …, n,

in terms of the hat matrix H.

(d) e = Y – Yb is the vector of residuals. Are the residuals statistically independent? Justify your

answer with an explanation.

(e) Show that e = ε – Hε. Suppose we denote by hij the (i, j) element of the HAT matrix H.

Thus, ei can be written as ei = εi –

Pn

j=1 hij εj . What does this equation of ei show?

Suppose that we partition X and β as

X = [X1

.

.

. X2] β =

β1

β2

where X1 is n x p1, X2 is n x p2, and p1 + p2 = p. β1 is p1 x 1, and β2 is p2 x 1.

(f) If the true model is Y = Xβ + ε, and we fit the model Y = X1β1 + u, have we underspecified

or overspecified the model?

(g) For the case in part (f), b1 = (X1’X1)

−1X1’Y. If the true model is Y = Xβ + ε, compute

E(b1).

(h) As a result of model misspecification in part (f), we could obtain an estimator of σ

2 which is

larger than it should be. Does this affect inferences made about the model? Explain.

6

Criminologists are interested in the effect of demographic characteristics and police expenditure on

crime rates. This has been studied using aggregate data on 47 states of the USA for 2016. The data

set contains the columns as described below:

Variable Description

M percentage of males aged 14–24 in total state population

So indicator variable for a southern state

Ed mean years of schooling of the population aged 25 years or over

Po1 per capita expenditure on police protection in 2016

Po2 per capita expenditure on police protection in 2015

LF labour force participation rate of civilian urban males in the age-group 14–24

M.F number of males per 100 females

Pop state population in 2016 in hundred thousands

NW percentage of nonwhites in the population

U1 unemployment rate of urban males 14–24

U2 unemployment rate of urban males 35–39

Wealth wealth: median value of transferable assets or family income

Ineq income inequality: percentage of families earning below half the median income

Prob probability of imprisonment: ratio of number of commitments to number of offenses

Time average time in months served by offenders in state prisons before their first release

Crime crime rate: number of offenses per 100,000 population in 2016

(a) Calculate and interpret the sample correlation between crime rate and police expenditure in

Is the sign of the correlation what you expect? Explain.

(b) In the previous question, we saw that the sample correlation between crime rate in 2016 and

police expenditure in 2015 was positive. However, the model fitted below suggests that an

increase in police expenditure in 2015 decreases the crime rate in 2016. Is there a contradiction?

Explain.

Crime d = 158.2646 + 256.1526P o1 − 178.2880P o2

(c) Find the best (parsimonious) regression model for the given data. Do not forget to perform an

initial data analysis before applying the automatic search procedures such as forward selection,

backward elimination, and stepwise regression.

Use the best model found in part (c) to answer the following questions:

(d) What characteristics does a high leverage point have in general?

(e) Are there any high leverage points in the data used to fit the best model?

(f) What is the sum of all leverage values for the data used to fit the best model?

(g) Are there any outliers in the data for the best model?