Seminarier i Matematisk Statistik

KTH Matematik

$Matematisk Statistik$

Tid: 1 juni 2018 kl 9.50-10.25.

Seminarierummet F11, KTH, Lindstedtsvägen 22. Karta!

Föredragshållare: Henrik Sjökvist (Master thesis)

Titel: Text feature mining using pre-trained word embeddings

Abstract This thesis explores a machine learning task where the data contains not only numerical features but also free-text features. In order to employ a supervised classifier and make predictions, the free-text features must be converted into numerical features. In this thesis, an algorithm is developed to perform that conversion. The algorithm uses a pre-trained word embedding model which maps each word to a vector. The vectors for multiple word embeddings belonging to the same sentence are then combined to form a single sentence embedding. The sentence embeddings for the whole dataset are clustered to identify distinct groups of free-text strings. The cluster labels are output as the numerical features. The algorithm is applied on a specific case concerning operational risk control in banking. The data consists of modifications made to trades in financial instruments. Each such modification comes with a short text string which documents the modification, a trader comment. Converting these strings to numerical trader comment features is the objective of the case study. A classifier is trained and used as an evaluation tool for the trader comment features. The performance of the classifier is measured with and without the trader comment feature. Multiple models for generating the features are evaluated. All models lead to an improvement in classification rate over not using a trader comment feature. The best performance is achieved with a model where the sentence embeddings are generated using the SIF weighting scheme and then clustered using the DBSCAN algorithm.

The full report (pdf)

Till seminarielistan
To the list of seminars

Sidansvarig: Filip Lindskog
Uppdaterad: 25/02-2009