KTH Mathematics

Mathematical Statistics

SF2930 Regression Analysis - Course log and updates

Spring, 2017

This page presents the latest information about what is addressed in lectures and schedule changes. During the lectures the basic theory will be presented according to the plan. Observe that not all topics will be covered during the lectures and additional reading is required. Reading instructions will be provided below after each lecture.

Recent info: 29/03.

Exam is corrected. Students with grade Fx may ask for an oral exam to enhance their grade to E. Please contact Tatjana Pavlenko by email. Deadline: April 10.

Project 1 is evaluated and can be fetched from the student office, Teknikringen, 8. Students with grade Fx for the project 1 may improve the project presentation to enhance their grade to G (passed the project 1 part). The improved project report must be sent by e-mail to Felix Rios along with the Fx commented version. Deadline: April 10.

Project 2 is corrected. Students with grade Fx for the project 2 will be informed by e-mail about the completion needed to enhance their grade to G (passed the project 2 part). Deadline: April 10.

The written re-exam is scheduled for Friday June 8, 8.00-13.00. Please register for the re-exam of June 8 before May 15.

Requests for review of grade have to be filed to the SCI student affairs office. Please do not contact the examiner directly.

  • Questions for the exam pdf.

  • Obs! Correct time for the Exercise 7 (If): 10.00-12.00 at 1st of March.

    Lectures 14 and 15 given by Henrik Bosaeus and Marianne Fjelberg: reading instructions, pages from MPV.

    Introduction to GLM: p. 421. Likelihood ratio test, Deviance: p. 430, p. 433--434. Wald test and confidence intervals: p. 436-- 437. Poisson regression, multiplicative form: p. 444--445, GLM model, Exponential family and link functions: p. 450--451.

    Lecture 13 given by Timo Koski: Slides and lecture notes.

    Lecture 12 given by Alexandre Chotard: Slides and summary.

    Binary decision trees were first defined as functions that partition the input space through axis-parallel splits, and associates to each subspace of such a partition the mean of the values associated to the data points from this subspace (or for classification, the most common class in this subspace). We then discussed how decision trees are grown from the data by minimizing a prediction error measure, the minimization following a top-down greedy approach. Limits of the top-down greedy approach were introduced, and a discussion was made on the power and limits of decision trees, which can fully represent any data set, and must therefore be prevented to overfit the data. Cost-complexity tree pruning was then introduced as a method to prevent overfitting. Advantages and limits of decision trees were then discussed, underlining the difficulty for representing simple problems such as problems with correlated variables, XOR or pairing problem.

    Bagging was then introduced as a way to improve the predictive power of trees: different data sets are bootstraped from the same training data set, which are used to create many bootstraped trees. These may then be aggregated, taking the mean of the predictions (or the most voted class in classification), thus constructing a much more robust and reliable predictor with the same data. We discussed that the error made by the aggregated trees converges when the number of bootstraped trees goes to infinity, and that this error can be upper bounded by a term depending on the correlation of the error of the trees. We then introduced random forests, which when constructing the bootstraped trees randomizes at each split which predictors are considered, therefore diminishing correlation between trees, and further decreasing the generalization error of the forest.

    Lecture 11: summary and reading instructions.

    We start by the discussion of the variable selection using the cross-validation. The main idea here is that all the aspects of the model fitting, including the variable selection, are recommended to be performed within the cross-validation loop. See details in the section 6.5.3 of the book Intro to statistical learning (e-book is available on the home page).

    We further recall the properties of ridge regression which shrinks the regression coefficients by imposing a penalty on the their size, and derive an equivalent (constraint) form of writing the ridge problem. It is important to understand that there is one to one correspondence between the shrinkage and constraint parameters in both formulations (see sections 6.2, 6.8 in the book Intro to statistical learning, and compare (6.5) with (6.9)). It is also important to understand that when there are many correlated variables in the linear model (i.e multicollinearity problem), their coefficients can become poorly determined and demonstrate high variance. This problem can be alleviated by imposing the size constraint on the coefficients, i.e performing ridge regression.

    We then discuss the Lasso regression, which is also a shrinkage method like the ridge, with subtle but important difference. Due to the structure of its penalty term, the Lasso does a kind of continuous variable selection, unlike the ridge which only shrinks. Computing the Lasso solution is a quadratic programming problem, and efficient algorithms are available for obtaining the entire path of solutions, with the same computational costs as for ridge regression and the optimal value of the penalty parameter can be selected by cross-validation. Go through the section 6.2.2 in Intro to statistical learning, with the focus on the Lasso variable selection properties. Think about the constraint Lasso formulation given by (6.9) and it connection to the theoretical statement of the variable selection problem stated in (6.10).

    Lecture 10: summary and reading instructions.

    For all of the regression analyses performed so far, it was assumed that all the regressor variables are relevant, i.e should be included in the model. This is usually not the case in practical applications; more often there is a large set of candidate regressors from which a set of the most appropriate ones must be identified to include in the final regression model. We start to consider theoretically the consequences of the model misspecification (e.g effect of deleting a set of variables on the bias and variance of the coefficients of the retained ones). Check in detail the whole summary 1.- 5. in section 10.1.2 and motivations for the variable selection. Two natural strategies for variable selection, stepwise (backward and forward) regression and the best subsets regression have been presented.

    The best subsets regression approach was discussed in detail (also called for all possible subsets regression models, which unfortunately will quickly huge, be sure that you understand why). Objective criteria for selecting the "best" model have been discussed. It is important to understand that different criteria can lead to different "best" models. Go through section 10.1.3 where the R²-value, its adjusted version, MSE and Mallows' Cp-statistic are presented and the relationship between these is discussed. Be sure that you understand why these measures are suitable for selection of the optimal modes when using all possible subsets strategy. Check example 10.1 and related tables and graphs and be sure that you understand how to choose an optimal model based on above mentioned criterion.

    Read yourself section 10.2.2 about the the general idea behind stepwise regression, be sure that you understand how to conduct stepwise regression using partial F-statistic, check examples 10.3 and 10.4 to see the strategy of adding or removing a regressor. Think also about limitations of the best subsets and the stepwise variable selection (see general comments of stepwise-type approach on p. 349) in regression models. Go through sections 10.3-10.4 which present the main steps of good model building strategy along with the case study (unfortunately the only SAS output is presented but similar graphs and tables can be obtained with R).

    Lecture 9: summary and reading instructions.

    After repetition of the common methods used for detecting multicollinearity (with special focus on the eigensystem analysis presented in section 9.4.3), we turn to the methods for overcoming multicollinearity. We have discussed two strategies: 1) ridge regression, which shrinks the LS regression coefficients by imposing a penalty on their size (see section 9.5.3) and 2) principal component regression, PCR, where the principal components are first obtained by transforming the original predictors, and then these components are used as new derived predictors in the regression model (see section 9.5.4).

    It is important to understand how the ridge estimators are constructed (check bias-variance trade-off), what is the role of biasing parameter (sometimes also called for penalty or tuning parameter) and how this parameter can be selected. Observe that instead of one solution as we had in LS, the ridge regression generates a path/trace of solutions which is a function of biasing parameter. Check carefully example 9.2 where the choice of the parameter by inspection of the ridge trace is discussed. Another approach of optimizing the biasing parameter is presented in the book by Izenman (see section 5.7, algorithm on table 5.7, p. 138, see the course home page for the e-book). This approach considers cross-validatory (CV) choice of the ridge parameter and is more suitable if the model suppose to be used for prediction.

    The key idea of overcoming multicollinearity using PCR is to exclude those principal components which correspond to the lowest eigenvalues (think why just these components must be dropped). Be sure that you understand how the the principal components are constructed from the original data matrix X. Check example 9.3, observe that the final PCR estimators of the original beta-coefficients can be obtained by back transforming.

    Important! Both methods assume that the original data are scaled to unit length (see p. 114 for scaling step), so that Y and each of p columns of X have zero empirical mean.

    Lecture 8: summary and reading instructions.

    We further discuss techniques for identifying influential data points. It is important to understand that Cook's distance measure summarizes how much all of the fitted values change when the i'th data point is deleted. A data point having large value of Cook's distance measure has strong influence on the fitted values. We further introduced DFFITS, difference in fits, which quantifies the number of standard deviations that the fitted value changes when the i'th data point is removed. Finally I recommended some strategies for dealing with problematic data points in practical applications.

    We then turn to the problem of multicollinearity discussed in Chapter 9. Multicollinearity is present when two or more of the predictor variables in the model are moderately or highly correlated (linearly dependent). It is important to understand the the impact of multicollinearity on various aspects of regression analyses. The main focus of the present lecture was on the effects of multicollinearity on the variance of the estimated regression coefficients, the length of the estimated vector of coefficients and prediction accuracy. Go through the section 9.3 and example 9.1. Specifically, this example demonstrates that the multicollinearity among regressors does not prevent a good accuracy of predictions of the response within the scope of the model (interpolation) but seriously harms the prediction accuracy when performing extrapolation.

    Go through the whole section 9.4, focus especially on the example with simulated data on p. 294 which demonstrates the need of measures of multiple correlation (not only pairwise, such as examination of the matrix X'X in its correlation form) for detection multicollinearity. We also discuss some more general methods of multicollinearity diagnostics such as VIF and, in sort I have mentioned the eigensystem analysis of X'X which will be presented in more detail during the next lecture.

    Obs! Questions and discussion of the project 1 in sf2930. You will have time to state questions about your project work at 16th of February, 2017, 14:00--16:00 in the hall 3721.

    Lecture 7: summary and reading instructions.

    One special case of GLS, weighted LS, was discussed along with one empirical method of estimating weights using "near neighbors" clustering of values of x, (this method is suitable for the case when the variance of the errors is proportional to one of the regressors), check example 5.5 and effect of this type of weighting on the residuals for the model fitted to the transformed data (Figure 5.11). Another approach for obtaining weighted LS estimators in the linear model with unequal error variances (called also heteroscedastic) is presented below We then turn to Chapter 6 where the methods for detecting influential observations are presented. It is important to lean the distinction between an outlier, the data point whose response y does not follow the general trend of the data, and the data point which has high leverage, i.e. the point which has unusual combination of predictor values. Both outliers and high leverage data points can be influential, i.e. can dramatically change the results of regression analysis such as predicted responses, coefficient of determination, estimated coefficients and results of the tests of significance. During this lecture, we discuss various measures used for determining whether a point is outlier, high leverage or both. Once such data points are identified we then investigate whether they are influential. We first have considered a measure of leverage (see section 6.2), and then discussed two measures of influence, Cook's distance and DFBETAS (difference in fits of beta). It is important to understand the general idea behind these measures; both are based on deletion diagnostics, i.e, they measure the influence of the i'th observation if it is removed from the data. It is also important to see that both these measures combine residual magnitude with the location of the point of interests in x-space.

    Go through the sections 6.1-6.7, check examples 6.1- 6.4 and think about treatment of influential observations in practice. I will be back to the strategies of dealing with problematic data points during the next lecture.

    Lecture 6: summary and reading instructions.

    During Lecture 5 we considered methods for detecting problems with a linear regression model. Once the problems with the model were identified we have a number of solutions which are discussed during the current lecture. Section 5.2 presents variance stabilizing transforms and section 5.3 summarizes a number of transforms to linearize the model. Go through these sections, check examples 5.1 and 5.2, and Figure 5.4. You are expected to understand when (and which) transform of the response variable might help, the same for the transforming predictor variables. Observe that sometimes it is needed to transform both to meet the three conditions of the linear regression model. Check carefully that you understand how to fit the regression model to the transformed data and how to check the the model adequacy.

    Observe that the methods of variable transforms above involve subjective decisions, this means that the model you select as the good one can be different from that selected by your colleague, both models can be appropriate! An analytic strategy for selecting the "best" power transform of the response variable is presented in Section 5.4.1, this is Box-Cox transform. Go through this section (notes from the lecture) and check example 5.3. It is important to understand when the Box-Cox transform is suitable and how to choose the optimal value of the power parameter. Check different strategies of maximizing the likelihood function and making inference about power parameter. For the overview of the Box-Cox method with a number of examples see the link The common problem of non-constant error variance can also be solved by using the generalized LS, GLS, and its special version, weighted LS. We quickly discuss GLS strategy presented in Section 5.3, read further sections 5.5.1-5.5.3, think why these methods are suitable for fitting linear regression model with unequal error variance, go through the example 5.5. and think about practical issues with GLS and weighted LS. I will be back to the problem of weight estimation during the next lecture.

    Obs! Extra Exercise session is booked 7th of February, 17.00-19.00 in hall 3418, Lindstedtsvägen 25, floor 4.

    Lecture 5: summary and reading instructions.

    After revisiting the joint confidence sets for the regression coefficients and the problem with hidden extrapolation in multiple regression, we turn to the model evaluation strategies. The main question is whether the assumption underlying the linear regression model seem reasonable when applied to the dataset in question. Since these assumptions are stated about the populations (the true) regression errors, we perform the model adequacy checking through the analysis of the sample based (estimated) errors, residuals.

    Main ideas of residual analysis are presented in sections 4.2.1-4.2.3, and section 4.3 where the PRESS residuals are used to compute R² like statistic for evaluating capability of the model. Go through the sections 4.2.1-4.2.3, check the difference between internal and external scaling of residuals and go through numerical examples 4.1 and 4.2 (and relates graphs and tables). Specifically, be sure that you understand why we need to check the assumptions of the model and how we can detect various problems with the model by using residual analysis. Think which formulas and methods we used are at risk to be incorrect when specific model assumptions are violated.

    We consider various plots of residuals that are standard for the model diagnostics, go through the section 4.2.3 and be sure that you understand how to "read" these plots, e.g. how to detect specific problems in practice. For example, how does non-constant error variance show up on a residual vs. fits plot? How to use residuals vs predictor plot to identify omitted predictors that can improve the model?

    During the next lecture we will discuss some remedies for the cases when the model assumptions for the linear regression fail.

    Lecture 4: summary and reading instructions.

    We continue discussion of the test procedures in the multiple linear regression. We start by repetition of the global test on the model adequacy and turn to the test procedures for individual regression coefficients, testing a subset of coefficient and test in the general linear hypothesis, see sections 3.3.1-3.3.4. It is important to understand why the partial F-test, presented in the section 3.3.2, measures the contribution of the subset of regressors into the model given that the the other regressors are included in the model. Check Appendix C.3.3-C.3.4 for details and go through example 3.5 where the partial F-test is illustrated. Go through the examples 3.6 and 3.7 of section 3.3.4 which demonstrate the unified approach for testing linear hypothesis about regression coefficients.

    We further shortly discuss the confidence intervals for the coefficients and the mean response. Read yourself sections 3.4.1-3.5. It is important to understand the difference between one-at-a-time confidence interval (marginal inference) for a single regression coefficient, and a simultaneous (or joint) confidence set for the whole vector of coefficients. Go through example 3.11, think about advantages and disadvantages of the two methods: the joint confidence set given by (3.50) (confidence ellipse) and the Bonferroni method. I will discuss the Bonferroni-type methods during the next lecture.

    The phenomena of hidden extrapolation in prediction and estimation using the fitted model was discussed. Go through the section 3.8, it is important to understand the structure of RVH and the role of hat matrix in specifying the location of the new data point in the x-space. Go through the example 3.13 and inspect the related figures.

    Standardization (centering and scaling) of the regression coefficients is presented in section 3.9. Check yourself the two approaches for standardization and the interpretation of the standardized regression coefficients. One application of the standardization step is presented further in section 3.10 where the problem of multicollinearity is presented. Check why and how the standardization is applied here, we will discuss the problem of multicollinearity in detail during lectures 8 and 9.

    Lecture 3: summary and reading instructions.

    Short discussion was given on the case when both response and explanatory variables in the simple regression are random. Check details of the test procedure for the correlation coefficient and numerical example 2.9, section 2.12.2.

    Multiple linear regression model was presented, starting with matrix notations and turning then to the LS normal equations, their solutions and geometrical interpretation of the LS estimators. Go through section the 3.2.1, be sure that understand the structure of the matrix X'X and the structure and the role of the hat matrix. Go through the example 3.1 and graphical data presentation in section 3.2.1.

    Go through the sections 3.2.3- 3.2.6 and check the properties of the parameter estimators obtained by both LS and ML approaches. Check also Appendix C.4 where the optimality of the LS estimators are stated in Gauss-Markov theorem.

    We start with the global test in multiple regression. Go through the section 3.3.1, check the assumptions for constructing the tests of significance, computation formulas for ANOVA representation and read about checking the model adequacy using adjusted coefficient of determination. Think why this adjustment is needed?

    The exercises selected for the second exercise session on Monday 30rd of January are already on the home page, see link Exercises.

    Lecture 2: summary and reading instructions.

    Tests of significance and confidence intervals for the slope, intercept and the variance of the error term were discussed for the simple linear regression model. Go through numerical examples and check graphs graphs in Sections 2.3.1-2.3.2 of MPV,. Fundamental analysis-of-variance (ANOVA) identity was presented along with the test of significance of regression. It is very important to understand how the partition of the total variability in the response variable is obtained and how the ANOVA-based F-test is derived, this strategy will be used through the whole course, specifically in the multiple linear regression models which will be presented during the next two lectures. Go through the Section 2.3.3 and check why F-test is equivalent to the t-test when testing significance of regression in the simple regression model.

    The concepts of confidence interval for the mean response and prediction interval for the future observation were presented. Go through the Section 2.4.2, check numerical examples 2.6 and 2.7, it is important to understand what is the principle difference between these two types of intervals and how they suppose to be used in the regression analysis.

    Read yourself Section 2.9 where some abuse of regression modeling are discussed and Section 2.10 where no-intercept regression model is presented as a special type of modeling (the idea is to force the intercept to be zero). Check numerical examples of Section 2.10 and think about the differences with previously presented model (that includes intercept term), focus specifically on the properties of the coefficient of determination.

    Go through the Section 2.11 and convince yourself that the ML estimators of the slope and intercept are identical to those obtained by LS approach, this does not hold for the variance estimator, check why.

    A short discussion of the case of the random regressor is presented in Section 2.12, check it, I will shortly discuss it during the next lecture.

    Observe that the exercises selected for the first exercise session on Monday 23rd of January are already on the home page, see link Exercises.

    Lecture 1: summary and reading instructions.

    Introduction to the regression analysis was presented. The simple linear regression model was discussed in detail, including basic assumptions on equal variance of the error term, linearity and independence. LS fitting strategy was discussed along with the properties of the obtained estimators of the regression coefficients. Go through these properties once again, read Sections 2.2.2-2.2.3 of MPV, check normal equations given by (2.5), p. 14 and their solutions, show that both LS estimators of the slope and intercept are unbiased and find their variances.

    Go through the Ex 2.1 and Ex 2.2 to see the numerical calculations for the LS fit, study residual properties and check the general properties of the LS fit presented on p. 20 of MPV.

    Go through the Section 2.3.1-2.3.2 and check which additional assumptions are needed to perform the tests of significance on the slope and intercept.

    To Mathematical Statistics
    To Mathematical Statistics Courses
  • Published by: Tatjana Pavlenko.
    Uppdated: 15/01-2017