Linguistic and Political Differences in Large Online Social Media Discourse

Jamie Neubert Pedersen, Valentin Ibanez

AbstractUnsupervised and data-driven language understanding methods makes it possible to quantify large amounts of unstructured natural language corpora. We collect and construct a dataset of 1848 Danish political actors along with their tweet histories, totalling 584; 156 tweets. We find linguistic differences across political groups and also demonstrate the diachronic nature of language, by showing the meaning of words shifting over time. Our dataset is used to construct word embeddings of political, semantic and syntactic similarities, which we use in machine learning algorithms to classify the political partisanship of text. We demonstrate that combining, context specific semantic and syntactic information with class representative word weights results in better accuracy, when classifying political affiliation of tweets from political actors. Through these analyses of the constructed dataset, we show the viability of using written text to find political patterns and linguistic differences. We demonstrate how these can be used for feature engineering in supervised machine learning algorithms and the possibility of classifying text's political partisanship.
TypeMaster's thesis [Academic thesis]
Year2018
PublisherTechnical University of Denmark, Department of Applied Mathematics and Computer Science
AddressRichard Petersens Plads, Building 324, DK-2800 Kgs. Lyngby, Denmark, compute@compute.dtu.dk
SeriesDTU Compute M.Sc.-2018
NoteSupervised by Sune Lehmann, sljo@dtu.dk (Associate Professor, DTU, DTU Compute).
Electronic version(s)[pdf]
Publication linkhttp://www.compute.dtu.dk/English.aspx
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing