Skip to main content


Detecting trolls on Reddit: Introduction to Computational Text Analysis and Supervised Machine Learning in R

Presented by Vlad Achimescu, University of Mannheim

Computational propaganda is becoming a non-negligible presence on news forums and social media, and it is crucial to be able to separate between real users and social bots or trolls. Following Twitter, Reddit released a list of accounts suspected of being state-sponsored trolls, users who wrote more than 15.000 posts and comments between 2015 and 2018. How precisely can these posts be detected based on their content and the available metadata and what techniques can be used to achieve maximum accuracy?

Event details

This workshop offers an introduction to supervised machine learning techniques for classification in R. In the first part of the workshop, we will learn to work with text data, by pre-processing the corpus of Reddit submissions, step-by-step: tokenization, stop word removal, stemming, trimming and weighting, resulting in a document-term matrix serving as input for the next step.

In the second part of the workshop, two machine learning methods are introduced: regularized logistic regression and random forests. All the basic steps of supervised machine learning will be covered: feature selection, model specification, training, tuning, cross-validation and evaluation with different performance measures, all applied to the dataset of Reddit submissions. Feature importance and partial dependence plots will be examined to find the variables or keywords most associated with political trolling.

By the end of the workshop, attendants should be able to run their own models on a training set and check the performance of their predictions on a test set.

Pre-requisites: basics of R, logistic regression

R-Packages required: quanteda, caret, glmnet, randomForest, dplyr

Datasets will be provided by the instructor.


Clayden Computational Lab