Linguistic and Political Differences in Large Online Social Media Discourse | Jamie Neubert Pedersen, Valentin Ibanez
| Abstract | Unsupervised and data-driven language understanding methods makes it possible to quantify large amounts of unstructured natural language corpora. We collect and construct a dataset of 1848 Danish political actors along with their tweet histories, totalling 584; 156 tweets. We find linguistic differences across political groups and also demonstrate the diachronic nature of language, by showing the meaning of words shifting over time. Our dataset is used to construct word embeddings of political, semantic and syntactic similarities, which we use in machine learning algorithms to classify the political partisanship of text. We demonstrate that combining, context specific semantic and syntactic information with class representative word weights results in better accuracy, when classifying political affiliation of tweets from political actors. Through these analyses of the constructed dataset, we show the viability of using written text to find political patterns and linguistic differences. We demonstrate how these can be used for feature engineering in supervised machine learning algorithms and the possibility of classifying text's political partisanship. | Type | Master's thesis [Academic thesis] | Year | 2018 | Publisher | Technical University of Denmark, Department of Applied Mathematics and Computer Science | Address | Richard Petersens Plads, Building 324, DK-2800 Kgs. Lyngby, Denmark, compute@compute.dtu.dk | Series | DTU Compute M.Sc.-2018 | Note | Supervised by Sune Lehmann, sljo@dtu.dk (Associate Professor, DTU, DTU Compute). | Electronic version(s) | [pdf] | Publication link | http://www.compute.dtu.dk/English.aspx | BibTeX data | [bibtex] | IMM Group(s) | Intelligent Signal Processing |
|