Intro
I worked with Max Gannett, a fellow Information Science masters student here at CU, on this and we wanted to see if we could predict how different localities (cities and states) lean politically based on the network of authorship between different subreddits.
We started by building off of a study from Waller, I., Anderson, A. in which they identified different political polarization levels across different communities (subreddits) on Reddit. We then compiled a list of locality based subreddits (/r/washingtondc) ourselves and a list of political based subreddits (/r/democrats) from the Waller, I., Anderson, A. study, which were prelabeled with a political partisanship score. This political partisanship score ranged from -1 to 1, with -1 representing an extreme liberal bias and 1 representing an extreme conservative bias. I then used PRAW (The Python Reddit API Wrapper) to scrape the most recent 150 posts from each of our selected subreddits, resulting in 46,797 posts total. After cleaning the data to omit null values and formatting it for network science/analysis, the data looked like this:
In this we can see the structure of a bipartite network with the author nodes being a distinct class from the subreddit nodes and the count representing the number of times the author posted to the subreddit, serving as the edge weight for this network. Authors connect to subreddits via posting to them, but do not connect to other authors. Similarly, subreddits do not connect to other subreddits. So we hypothesized that the best way to see which political subreddits had the most influence over locality based subreddits would be to analyze how those subreddits were connected through shared authors. Below, we can see a visualization of this network structure created by Max Gannett:
Feature Engineering
We then decided to use this data to try and predict the political leanings of these different localities using a variety of machine learning models. To transform this network data into a feature vector required for machine learning, I had to perform several complex data transformations. In order to do so, I had to iterate through the nodes list data frame of the locality-based subreddits and then iterate through our edges list data frame nested within our first iterative process. As I iterated through the edge list within our locality-based nodes list iteration, I checked to see if the author in the edge list posted to the locality-based subreddit. If it passed this check, I then iterated through the list of political subreddits and checked to see if the author also posted to any of the political subreddits in addition to the locality. If they passed this check, I added the edge weight, (the number of times the author posted to the political subreddit), to our new feature vector. In this feature vector, every observation/row is a locality-based subreddit, with the features/columns, representing the political subreddits. For the columns, we have the aforementioned sum of counts of how many posts by users who posted in the locality-based subreddit, posted to the political subreddits. In addition, we had a column that was the “raw_partisan” score for the political subreddit that was calculated by Waller and Austin’s study. I then added 0.25, to account for zero values, to each count and performed a log10 transformation on the ‘count’ feature to normalize the data. Then I multiplied this log10 count feature by the partisan score and calculated a new influence metric. Finally, I dropped the partisan score as it has no variance. We gathered electoral results for each of these localities in the 2020 elections to serve as our labels, with 0 representing a Democrat victory and 1 representing a Republican victory calculated on a vote share total of all votes. Now it was ready for machine learning.
Baseline Models
Now that we had our feature vector prepared, it was time to conduct our machine learning for the purposes of predicting whether that locality leans Democrat or Republican. This was an imbalanced dataset. We had a ratio of 74 Democratic localities to 26 Republican localities. In order to have a more balanced training set, I randomly sampled 80% of the Republican observations and matched it with the same count of Democratic observations. This resulted in a ratio of 21 Democratic localities to 21 Republican localities in our training dataset. The other observations were lumped into our testing set with a ratio of 51 Democratic localities to 5 Republican localities. Given how 74% of the data was Democratic, I felt that sampling by class for the training data was the best way for our model to predict Republican observations.
With our train/test split finalized, we moved forward to creating a ‘dummy model’ for baseline performance. A dummy model simply predicts everything as being the class with the highest frequency. For instance, our dummy model predicted everything in the test set as being Democratic, resulting in an accuracy of 0.9137931034482759, but having 0 recall and 0 precision since it merely classified everything as Democrat.
I then moved towards tuning and training a logistic regression model as another baseline to measure against future models’ performance. After tuning and training the logistic regression model, we found an accuracy of 0.08620689655172414 and precision equally at 0.08620689655172414. The recall for the model was 1.0. Overall, this model expectedly performed very poorly, indicating that this is not a simple problem that can be solved via logistic regression.
Random Forest Model
I then created a random forest model to try and actually predict the political affiliations of these localities with strong performance and accuracy. I decided to try a random forest model because random forest models are very fast and handle large numbers of features very well. I tuned and trained the random forest model using a randomized search, searching the optimal number of trees, number of features to consider at every split, the maximum number of levels in the trees, the minimum number of samples required to split a leaf node, the minimum number of samples required at each leaf node, the method of selecting samples for training each tree, and whether the class weights should be considered balanced for our hyperparameter tuning. After performing this randomized search for the optimal hyperparameter values we fit and trained the random forest model on our training dataset. In examining its performance on the test data, we found a Mean Absolute Error of 0.36 degrees, an accuracy of 0.6379310344827587, precision at 0.1, and recall at 0.4. We then calculated shapley values to conduct feature importance analysis to understand our model’s output. A visualization of the shapley values can be viewed below, Class 0 being Democrat and Class 1 being Republican:
Below, we can see how this random forest model performs at different classification thresholds:
Gradient Boosted Trees Model
While the random forest model performed much better than the logistic regression model, we were still not satisfied with that model’s performance. So we decided to tune and train an XGB model. XGB simply means gradient boosted trees. After performing a randomized search for the best hyperparameters, including the max depth, the learning rate, number of estimators, the fraction of samples per tree, alpha levels, lambda levels, and finally which tree method to use, we fit and trained our model on the training data set. In examining its performance on the test data set, we can observe a mean absolute error of 0.29 degrees, an accuracy of 0.7068965517241379, with recall at 0.6 and precision at 0.16666666666666666. This has better performance than our random forest model. We analyzed the feature importance of this model by calculating and visualizing the shapley values for this model on the test data, seen below:
As this was our best performing model, we analyzed how this model performed at different classification thresholds as visualized below:
Discussion
In predicting political leanings and/or electoral outcomes in different localities using a variety of machine learning models, including logistic regression, random forest, and gradient boosted trees models, we’ve found that the gradient boosted trees model had the strongest performance. These results of 50% precision and over 90% accuracy at the 90% classification threshold are very promising, though it is clear that this model does not completely capture the political sentiments of these localities. There is certainly a liberal bias in the data we collected and that likely plays a role in how effective our models can be. While the social media network data of connections between locality-based subreddits and political subreddits seem to be useful in predicting political affiliations, it does not provide a complete picture. To gain a more complete picture and provide better predictions, perhaps other data is needed or a larger sample.
In the future, we might change the post count we collected to comment count and maybe weight that by the average comment score of that user divided by the average comment score of the subreddit. Since discussion subreddits are largely driven by comments we think comment numbers are a better indicator of political influence. Weighting by average user comment score/average subreddit comment score will better quantify the fact that people who get more upvotes are more influential than those who don't get many or who often have negative comment scores. This type of scoring could lead to stronger performance of our models. Additionally, our dataset was very sparse and so I believe that scraping more posts would result in stronger performance for our models. However, transforming the network data with our relatively small sample into a feature vector for machine learning took a very significant amount of time, so this form of analysis is very computationally expensive.
Code/Data Resources
Github link will be posted at a later date once the code has been cleaned and identifying information removed.