KTH Mathematics


Mathematical Statistics

SF2930 Regression Analysis - Course log and updates

Spring, 2018


This page presents the latest information about what is addressed in lectures and schedule changes. During the lectures the basic theory will be presented according to the plan. Observe that not all topics will be covered during the lectures and additional reading is required. Reading instructions will be provided below after each lecture.



Lecture 15: Guest lecture 2 on Generalized linear models by Filip Allard, Data Analyst at If P&C Insurance See slides GLM_2.

Lecture 14: Guest lecture 1 on Generalized linear models by Filip Allard, Data Analyst at If P&C Insurance See slides GLM_1.

Lecture 13: Guest lecture on logistic regression by Ekaterina Kruglov, Data Analyst at Intrum Justitia AB. See slides Logistic regression.

Lecture 12: Guest lecture on tree-based regression methods by Alexandre Chotard. See slides CART.

Lecture 11: Obs! The material of this lecture will be presented during the reserve lecture, 26/02.

Lecture 10: summary and reading instructions.

We further recall the properties of ridge regression which shrinks the regression coefficients by imposing a penalty on the their size, and derive an equivalent (constraint) form of writing the ridge problem. It is important to understand that there is one to one correspondence between the shrinkage and constraint parameters in both formulations (see sections 6.2, 6.8 in the book Intro to statistical learning, and compare (6.5) with (6.9)). It is also important to understand that when there are many correlated variables in the linear model (i.e multicollinearity problem), their coefficients can become poorly determined and demonstrate high variance. This problem can be alleviated by imposing the size constraint on the coefficients, i.e performing ridge regression.

We then discuss the Lasso regression, which is also a shrinkage method like the ridge, with subtle but important difference. Due to the structure of its penalty term, the Lasso does a kind of continuous variable selection, unlike the ridge which only shrinks. Computing the Lasso solution is a quadratic programming problem, and efficient algorithms are available for obtaining the entire path of solutions, with the same computational costs as for ridge regression and the optimal value of the penalty parameter can be selected by cross-validation. Go through the section 6.2.2 in Intro to statistical learning, with the focus on the Lasso variable selection properties. Think about the constraint Lasso formulation given by (6.9) and it connection to the theoretical statement of the variable selection problem stated in (6.10). Read yourself about elastic-net penalty, a compromise between ridge and Lasso, see section 3.4.3 of Element of statistical learning.
A more detailed presentation of the relationship between subset selection, ridge and Lasso regression along with Bayesian problem formulation is also provided in section 3.4.3 of Element of statistical learning. Further, sections 3.5 and 3.6 are recommended for those who will work with the Scenario II of the project 1.


For all of the standard regression analyses performed so far, it was assumed that all the regressor variables are relevant, i.e should be included in the model. This is usually not the case in practical applications; more often there is a large set of candidate regressors from which a set of the most appropriate ones must be identified to include in the final regression model. We start to consider theoretically the consequences of the model misspecification (e.g effect of deleting a set of variables on the bias and variance of the coefficients of the retained ones). Check in detail the whole summary 1.- 5. in section 10.1.2 of MPV and motivations for the variable selection. Two natural strategies for variable selection, stepwise (backward and forward) regression and the best subsets regression have been presented.

The best subsets regression approach was discussed in detail (also called for all possible subsets regression models, which unfortunately will quickly huge, be sure that you understand why). Objective criteria for selecting the "best" model have been discussed. It is important to understand that different criteria can lead to different "best" models. Go through section 10.1.3 where the R²-value adjusted version, MSE and Mallows' Cp-statistic are presented and the relationship between these is discussed. Be sure that you understand why these measures are suitable for selection of the optimal modes when using all possible subsets strategy. Check example 10.1 and related tables and graphs and be sure that you understand how to choose an optimal model based on above mentioned criterion. During the next lecture, I will present more details on some measures of the "best" models

Read yourself section 10.2.2 of MPV about the the general idea behind stepwise regression, be sure that you understand how to conduct stepwise regression using partial F-statistic, check examples 10.3 and 10.4 to see the strategy of adding or removing a regressor. Think also about limitations of the best subsets and the stepwise variable selection (see general comments of stepwise-type approach on p. 349) in regression models. Go through sections 10.3-10.4 which present the main steps of good model building strategy along with the case study (unfortunately the only SAS output is presented but similar graphs and tables can be obtained with R).

Lecture 9: summary and reading instructions.

After repetition of the common methods used for detecting multicollinearity (with special focus on the eigensystem analysis presented in section 9.4.3), we turn to the methods for overcoming multicollinearity. We have discussed two strategies: 1) ridge regression, which shrinks the LS regression coefficients by imposing a penalty on their size (see section 9.5.3) and 2) principal component regression, PCR, where the principal components are first obtained by transforming the original predictors, and then these components are used as new derived predictors in the regression model (see section 9.5.4).

It is important to understand how the ridge estimators are constructed (check bias-variance trade-off), what is the role of biasing parameter (sometimes also called for penalty or tuning parameter) and how this parameter can be selected. Observe that instead of one solution as we had in LS, the ridge regression generates a path/trace of solutions which is a function of biasing parameter. Check carefully example 9.2 where the choice of the parameter by inspection of the ridge trace is discussed. Another approach of optimizing the biasing parameter is presented in the book by Izenman (see section 5.7, algorithm on table 5.7, p. 138, see the course home page for the e-book). This approach considers cross-validatory (CV) choice of the ridge parameter and is more suitable if the model suppose to be used for prediction.

The key idea of overcoming multicollinearity using PCR is to exclude those principal components which correspond to the lowest eigenvalues (think why just these components must be dropped). Be sure that you understand how the the principal components are constructed from the original data matrix X. Check example 9.3, observe that the final PCR estimators of the original beta-coefficients can be obtained by back transforming.

Important! Both methods assume that the original data are scaled to unit length (see p. 114 for scaling step), so that Y and each of p columns of X have zero empirical mean.

Lecture 8: summary and reading instructions.

We further discuss techniques for identifying influential data points. It is important to understand that Cook's distance measure summarizes how much all of the fitted values change when the i'th data point is deleted. A data point having large value of Cook's distance measure has strong influence on the fitted values. We further introduced DFFITS, difference in fits, which quantifies the number of standard deviations that the fitted value changes when the i'th data point is removed. Finally I recommended some strategies for dealing with problematic data points in practical applications.

We then turn to the problem of multicollinearity discussed in Chapter 9. Multicollinearity is present when two or more of the predictor variables in the model are moderately or highly correlated (linearly dependent). It is important to understand the the impact of multicollinearity on various aspects of regression analyses. The main focus of the present lecture was on the effects of multicollinearity on the variance of the estimated regression coefficients, the length of the estimated vector of coefficients and prediction accuracy. Go through the section 9.3 and example 9.1. Specifically, this example demonstrates that the multicollinearity among regressors does not prevent a good accuracy of predictions of the response within the scope of the model (interpolation) but seriously harms the prediction accuracy when performing extrapolation.

Go through the whole section 9.4, focus especially on the example with simulated data on p. 294 which demonstrates the need of measures of multiple correlation (not only pairwise, such as examination of the matrix X'X in its correlation form) for detection multicollinearity. We have also discuss some more general methods of multicollinearity diagnostics such as VIF, and eigensystem analysis of X'X to explain the nature of linear dependence, check the example by Webster at al (simulated data presented on p. 294) to see how to use elements of eigenvectors to catch the linear relationship between the predictors.
Lecture 7: summary and reading instructions.

One special case of GLS, weighted LS, was discussed along with one empirical method of estimating weights using "near neighbors" clustering of values of x, (this method is suitable for the case when the variance of the errors is proportional to one of the regressors), check example 5.5 and effect of this type of weighting on the residuals for the model fitted to the transformed data (Figure 5.11). Another approach for obtaining weighted LS estimators in the linear model with unequal error variances (called also heteroscedastic) is presented below We then turn to Chapter 6 where the methods for detecting influential observations are presented. It is important to lean the distinction between an outlier, the data point whose response y does not follow the general trend of the data, and the data point which has high leverage, i.e. the point which has unusual combination of predictor values. Both outliers and high leverage data points can be influential, i.e. can dramatically change the results of regression analysis such as predicted responses, coefficient of determination, estimated coefficients and results of the tests of significance. During this lecture, we discuss various measures used for determining whether a point is outlier, high leverage or both. Once such data points are identified we then investigate whether they are influential. We first have considered a measure of leverage (see section 6.2), and then discussed two measures of influence, Cook's distance and DFBETAS (difference in fits of beta). It is important to understand the general idea behind these measures; both are based on deletion diagnostics, i.e, they measure the influence of the i'th observation if it is removed from the data. It is also important to see that both these measures combine residual magnitude with the location of the point of interests in x-space.



Lecture 6: summary and reading instructions.

During Lecture 5 we considered methods for detecting problems with a linear regression model. Once the problems with the model were identified we have a number of solutions which are discussed during the current lecture. Section 5.2 presents variance stabilizing transforms and section 5.3 summarizes a number of transforms to linearize the model. Go through these sections, check examples 5.1 and 5.2, and Figure 5.4. You are expected to understand when (and which) transform of the response variable might help, the same for the transforming predictor variables. Observe that sometimes it is needed to transform both to meet the three conditions of the linear regression model. Check carefully that you understand how to fit the regression model to the transformed data and how to check the the model adequacy.

Observe that the methods of variable transforms above involve subjective decisions, this means that the model you select as the good one can be different from that selected by your colleague, both models can be appropriate! An analytic strategy for selecting the "best" power transform of the response variable is presented in Section 5.4.1, this is Box-Cox transform. Go through this section (notes from the lecture) and check example 5.3. It is important to understand when the Box-Cox transform is suitable and how to choose the optimal value of the power parameter. Check different strategies of maximizing the likelihood function and making inference about power parameter. For the overview of the Box-Cox method with a number of examples see the link Box-Cox transformations: An Overview , and for the R implementation of Box-Cox transformations for different purposes, graphical assessment of the success of transforms and inference on the transformation parameter, see the link Box-Cox power transformations: Package "AID".

The common problem of non-constant error variance can also be solved by using the generalized LS, GLS, and its special version, weighted LS. We quickly discuss GLS strategy presented in Section 5.3, read further sections 5.5.1-5.5.3, think why these methods are suitable for fitting linear regression model with unequal error variance, go through the example 5.5. and think about practical issues with GLS and weighted LS. I will be back to the problem of weight estimation during the next lecture.

Lecture 5: summary and reading instructions.

After revisiting the confidence sets for the regression coefficients, F-test for comparing various subtypes of models and the problem with hidden extrapolation in multiple regression, we turn to the model evaluation strategies. The main question is whether the assumption underlying the linear regression model seem reasonable when applied to the dataset in question. Since these assumptions are stated about the populations (the true) regression errors, we perform the model adequacy checking through the analysis of the sample based (estimated) errors, residuals.

Main ideas of residual analysis are presented in sections 4.2.1-4.2.3, and section 4.3 where the PRESS residuals are used to compute R² like statistic for evaluating capability of the model. Go through the sections 4.2.1-4.2.3, check the difference between internal and external scaling of residuals and go through numerical examples 4.1 and 4.2 (and relates graphs and tables). Specifically, be sure that you understand why we need to check the assumptions of the model and how we can detect various problems with the model by using residual analysis. Think which formulas and methods we used are at risk to be incorrect when specific model assumptions are violated.

We consider various plots of residuals that are standard for the model diagnostics, go through the section 4.2.3 and be sure that you understand how to "read" these plots, e.g. how to detect specific problems in practice. For example, how does non-constant error variance show up on a residual vs. fits plot? How to use residuals vs predictor plot to identify omitted predictors that can improve the model?

Lecture 4: summary and reading instructions.

We continue discussion of the test procedures in the multiple linear regression. We start by repetition of the global test on the model adequacy and turn to the test procedures for individual regression coefficients, testing a subset of coefficient and test in the general linear hypothesis, see sections 3.3.1-3.3.4. It is important to understand why the partial F-test, presented in the section 3.3.2, measures the contribution of the subset of regressors into the model given that the the other regressors are included in the model. Check Appendix C.3.3-C.3.4 for details and go through example 3.5 where the partial F-test is illustrated. Go through the examples 3.6 and 3.7 of section 3.3.4 which demonstrate the unified approach for testing linear hypothesis about regression coefficients.

We further discuss the confidence intervals for the coefficients and the mean response. Read yourself sections 3.4.1-3.5. It is important to understand the difference between one-at-a-time confidence interval (marginal inference) for a single regression coefficient, and a simultaneous (or joint) confidence set for the whole vector of coefficients. Go through example 3.11, think about advantages and disadvantages of the two methods which have been presented: the joint confidence set given by (3.50) (confidence ellipse, see Fig. 3.8) and the Bonferroni-type correction strategy.

Standardization (centering and scaling) of the regression coefficients is presented in section 3.9. Check yourself the two approaches for standardization and the interpretation of the standardized regression coefficients. One application of the standardization step is presented further in section 3.10 where the problem of multicollinearity is presented. Check why and how the standardization is applied here, we will discuss the problem of multicollinearity in detail during lectures 8 and 9.

The phenomena of hidden extrapolation in prediction of a new observation using the fitted model will be discussed in detail in the beginning of Lecture 5. Go through the section 3.8, it is important to understand the structure of RVH and the role of hat matrix H in specifying the location of the new data point in the x-space. Go through the example 3.13 and inspect the related figures.


Lecture 3: summary and reading instructions.

Test and confidence regions for various parameters in the simple linear model were discussed with specific focus on the confidence region for the mean response and prediction interval. Be sure that you understand the difference between these type of intervals.

Multiple linear regression model was introduced, starting with matrix notations and turning then to the LS normal equations, their solutions and geometrical interpretation of the LS estimators. It is important to remember that, in general, any regression model that is linear in the coefficients (beta's) is a linear regression model, regardless of the shape of the surface it generates. Go through section the 3.2.1, be sure that understand the structure of the matrix X'X and the structure and the role of the hat matrix H. Go through the example 3.1 and graphical data presentation in section 3.2.1.

Go through the sections 3.2.3- 3.2.6 and check the properties of the parameter estimators obtained by both LS and ML approaches. Check also Appendix C.4 where the optimality of the LS estimators are stated in Gauss-Markov theorem.

We discuss shortly the global test in multiple linear regression. Go through the section 3.3.1, check the assumptions for constructing the tests of significance, computation formulas for ANOVA representation and read about checking the model adequacy using adjusted coefficient of determination. Think why this adjustment is needed? I will preset the details during the next lecture.

The exercises selected for the second exercise session on Monday 24th of January are on the home page, see link Exercises.


Lecture 2: summary and reading instructions.

Tests of significance and confidence intervals for the slope, intercept and the variance of the error term were discussed for the simple linear regression model. Go through numerical examples and check graphs in Sections 2.3.1-2.3.2 of MPV. Fundamental analysis-of-variance (ANOVA) identity was presented along with the test of significance of regression. It is very important to understand how the partition of the total variability in the response variable is obtained and how the ANOVA-based F-test is derived, this strategy will be used through the whole course, specifically in the multiple linear regression models which will be presented during the next two lectures. Go through the Section 2.3.3 and check why F-test is equivalent to the t-test when testing significance of regression in the simple regression model.

The concepts of confidence interval for the mean response and prediction interval for the future observation were presented. Go through the Section 2.4.2, check numerical examples 2.6 and 2.7, it is important to understand what is the principle difference between these two types of intervals and how they suppose to be used in the regression analysis.

Read yourself Section 2.9 where some abuse of regression modeling are discussed and Section 2.10 where no-intercept regression model is presented as a special type of modeling (the idea is to force the intercept to be zero). Check numerical examples of Section 2.10 and think about the differences with previously presented model (that includes intercept term), focus specifically on the properties of the coefficient of determination.

Go through the Section 2.11 and convince yourself that the ML estimators of the slope and intercept are identical to those obtained by LS approach, this does not hold for the variance estimator, check why.

A short discussion of the case of the random regressor is presented in Section 2.12, check it, I will shortly discuss it during the next lecture.

Observe that the exercises selected for the first exercise session on Friday 19th of January are on the home page, see link Exercises.


Lecture 1: summary and reading instructions.

Introduction to the regression analysis was presented. See slides below: The simple linear regression model was discussed in detail, including basic assumptions on equal variance of the error term, linearity and independence. LS fitting strategy was discussed along with the properties of the obtained estimators of the regression coefficients. Go through these properties once again, read Sections 2.2.2--2.2.3 of MPV, check normal equations given by (2.5), p. 14 and their solutions, show that both LS estimators of the slope and intercept are unbiased and find their variances. There are three sources

Go through the Ex 2.1 and Ex 2.2 to see the numerical calculations for the LS fit, read about residual properties and check the general properties 1.--5. of the LS fit presented on p. 20 of MPV.

Go through the Section 2.3.1--2.3.2 and check which additional assumptions are needed to perform the tests of significance on the slope and intercept. I will discuss this in detail during the next lecture.

To Mathematical Statistics
To Mathematical Statistics Courses
Published by: Tatjana Pavlenko.
Uppdated: 07/12-2017