Linguistic and Political Differences in Large Online Social Media Discourse

Linguistic and Political Differences in Large Online Social Media Discourse
Jamie Neubert Pedersen, Valentin Ibanez
Abstract	Unsupervised and data-driven language understanding methods makes it possible to quantify large amounts of unstructured natural language corpora. We collect and construct a dataset of 1848 Danish political actors along with their tweet histories, totalling 584; 156 tweets. We find linguistic differences across political groups and also demonstrate the diachronic nature of language, by showing the meaning of words shifting over time. Our dataset is used to construct word embeddings of political, semantic and syntactic similarities, which we use in machine learning algorithms to classify the political partisanship of text. We demonstrate that combining, context specific semantic and syntactic information with class representative word weights results in better accuracy, when classifying political affiliation of tweets from political actors. Through these analyses of the constructed dataset, we show the viability of using written text to find political patterns and linguistic differences. We demonstrate how these can be used for feature engineering in supervised machine learning algorithms and the possibility of classifying text's political partisanship.
Type	Master's thesis [Academic thesis]
Year	2018
Publisher	Technical University of Denmark, Department of Applied Mathematics and Computer Science
Address	Richard Petersens Plads, Building 324, DK-2800 Kgs. Lyngby, Denmark, compute@compute.dtu.dk
Series	DTU Compute M.Sc.-2018
Note	Supervised by Sune Lehmann, sljo@dtu.dk (Associate Professor, DTU, DTU Compute).
Electronic version(s)	[pdf]
Publication link	http://www.compute.dtu.dk/English.aspx
BibTeX data	[bibtex]
IMM Group(s)	Intelligent Signal Processing