*Tid:* **20 oktober 2017 kl 10.15-11.15.**
**Seminarierummet 3418**, Institutionen för
matematik, KTH, Lindstedtsvägen 22.
*Föredragshållare:*
**
Kristoffer Brodin
**
**Titel:**
Statistical Machine Learning from Classification
Perspective:
Prediction of Household Ties for Economical Decision Making
**Abstract**
In modern society, many companies have large data records over their individual customers, containing information about attributes, such as name, gender,
marital status, address, etc. These attributes can be used to link costumers
together, depending on whether they share some sort of relationship with each
other or not. In this thesis the goal is to investigate and compare methods to
predict relationships between individuals in the terms of what we define as a
household relationship, i.e. we wish to identify which individuals are sharing
living expenses with one another. The objective is to explore the ability of
three supervised statistical machine learning methods, namely, logistic
regression (LR), artificial neural networks (ANN) and the support vector machine
(SVM), to predict these household relationships and evaluate their predictive
performance for different settings on their corresponding tuning parameters.
Data over a limited population of individuals, containing information about
household affiliation and attributes, were available for this task. In order to
apply these methods, the problem had to be formulated on a form enabling
supervised learning, i.e. a target and input predictors,
based on the set of attributes associated with each individual, had to be
derived. We have presented a technique which forms pairs of individuals un-
der the hypothesis H_{0}, that they share a household relationship, and then a
test of significance is constructed. This technique transforms the problem into
a standard binary classification problem. A sample of observations could be
generated by randomly pair individuals and using the available data over each
individual to code the corresponding outcome on Y and X for each random
pair. For evaluation and tuning of the three supervised learning methods, the
sample was split into a training set, a validation set and a test set.
We have seen that the prediction error, in term of misclassification rate, is
very small for all three methods since the two classes, H_{0} is true, and H_{0} is
false, are far away from each other and well separable. The data have shown
pronounced linear separability, generally resulting in minor differences in
misclassification rate as the tuning parameters are modified. However, some
variations in the prediction results due to tuning have been observed, and
if also considering computational time and requirements on computational
power, optimal settings on the tuning parameters could be determined for
each method. Comparing LR, ANN and SVM, using optimal tuning settings,
the results from testing have shown that there is no significant difference
between the three methods performances and they all predict well. Nevertheless,
due to difference in complexity between the methods, we have concluded that
SVM is the least suitable method to use, whereas LR most suitable. However,
the ANN handles complex and non-linear data better than LR, therefore, for
future application of the model, where data might not have such a pronounced
linear separability, we find it suitable to consider ANN as well.
This thesis has been written at Svenska Handelsbanken, one of the large major
banks in Sweden, with offices all around the world. Their headquarters are
situated in Kungsträdgården, Stockholm. Computations has been performed
using SAS software and data have been processed in SQL relational database
management system.
The full report (pdf)
Till seminarielistan
To the list of
seminars |