Jay Ghosh
wordCloud.png

Text Mining Project

Text Mining Project

 Introduction:

The divide between political ideologies has been a recurring issue throughout history. However, with the rise of social media and online communications, this polarization has become more apparent and contentious. In recent years, there has been a growing concern about the negative and divisive nature of online communications between people with opposing political views, and it has sparked debates about whether it is the liberal or conservative ideologies that are more negative and divisive in online discussions.

To fully comprehend the issue at hand, it is necessary to understand the role of social media in shaping public opinion and discourse. Social media platforms have become a major source of information for people, and they provide individuals with a platform to voice their opinions and engage in discussions with others. Unfortunately, discussions on these platforms can often devolve into highly negative rhetoric. Studies suggest that people are more likely to engage in negative and divisive conversations on social media when they are surrounded by like-minded individuals in these online spaces.

In this text-mining project, we aim to analyze the content of political subreddits on Reddit to examine whether liberals or conservatives are more negative online. We will employ natural language processing techniques and machine learning algorithms to uncover patterns and trends in conversations. The project's goal is to provide valuable insights into the current state of online conversations and the role of social media in the increasing political polarization of our society. Through the project's findings, we hope to gain a deeper understanding of how different political ideologies perceive each other online, and ultimately, take steps towards reducing the online negativity and division in the society.

In addition to examining the negativity and divisiveness of political discussions, this text-mining project will also seek to explore the language and themes used by individuals with different political ideologies. By analyzing the content of political subreddits on Reddit, we can gain a better understanding of the unique characteristics and perspectives associated with liberal and conservative ideologies, as well as the ways in which they interact with and respond to each other online. Through this analysis, we hope to shed light on the underlying causes of political polarization on social media and contribute to efforts to promote more productive and respectful discussions online.

One crucial aspect to consider in this text-mining project is the potential impact of online polarization on the broader society. Online polarization can have far-reaching consequences, including decreased social cohesion, increased political polarization, and reduced democratic participation. Therefore, it is crucial to analyze the language and themes used in political discussions and explore the potential real-world effects of these online interactions. This project will uncover insights that can inform policy decisions and strategies for promoting more constructive and inclusive online conversations. By identifying the factors contributing to online polarization, we can take steps toward mitigating its adverse effects and fostering a more united and informed society.

Data:

Data Scraping:

Here is the function that was used to filter out any posts that were deleted, removed, blank, or consisted only of images or links as to only look at textual data.

My filtering function

Then the following dictionary was created that consisted of explicitly political subreddits where the key is the unique subreddit and the value represents the subreddits self-declared political affiliation. A value of 0 represents that that particular subreddit is liberal and a value of 1 represents that that subreddit is conservative.

{'conservatives': 1,
 'Liberal': 0,
 'Democrat': 0,
 'democrats': 0,
 'Republican': 1,
 'trump': 1,
 'hillaryclinton': 0,
 'EnoughLibertarianSpam': 0,
 'progressive': 0,
 'Libertarian': 1,
 'SandersForPresident': 0,
 'neoliberal': 0,
 'EnoughTrumpSpam': 0,
 'Enough_Sanders_Spam': 0,
 'Fuckthealtright': 0,
 'ShitLiberalsSay': 0,
 'BlueMidterm2018': 0,
 'tucker_carlson': 1,
 'Impeach_Trump': 0,
 'HillaryMeltdown': 1,
 'The_Congress': 1,
 'HillaryForAmerica': 0,
 'BernTheConvention': 0}

Then I used PushShiftIO to access the Reddit API and scrape just the posts from these subreddits between the dates of 1/1/2014 - 11/1/2022.

How I used PMAW to scrape data from Reddit

After this finished scraping data from Reddit there were 124,757 posts with ~2/3rds of the posts belonging to liberal subreddits. Then the following function was used to remove stop words and stem and lemnatize the words.

Original data from the scrap

Cleaning function

Then the posts were fed into CountVectorizer with max features of 5,000 words to create a bag of words dataframe as seen below. The posts were sampled down to 2,000 rows for this assignment to decrease the computational strain of this operation.

Then the posts were fed into the TfidfVectorizer with max features of 5,000 to create a TF-IDF dataframe of these posts.

Here is a word cloud of a sample of this data:

Clustering

For analysis TF-IDF was used to normalize the word counts for a bag of words embeddings. This TF-IDF was used for K-Means and Hierarchical Clustering.

TF-IDF DataFrame

K-Means

Multiple K-Means models were created to test K values between 2 to 10.

Knee Plot for K-Means models

This shows that two clusters are the optimal number of clusters. The following shows a plot where each of the dimensions of the TF-IDF vector have been reduced to down to just 3 dimensions using Principal Components Analysis (PCA). Each point represents a post with the shape of the representing its cluster and its color corresponding to whether or not the post belongs to a liberal or conservative subreddit. The point is colored orange if it belongs to a conservative subreddit and it is colored blue if it belongs to a liberal subreddit. The x-axis shows the first PCA reduced dimension and the y-axis shows the second PCA reduced dimensions. Finally, the points are sized by the third PCA reduced dimension.

K-Means Clustering with K=2

Hierarchical Clustering

Hierarchical Clustering was also performed. However, since there is an extremely word count in this data, the interpretability of the hierarchical clustering is severely diminished.

The following shows a dendrogram of the hierarchical clustering in R.

Hierarchical Clustering Dendrogram

K-Modes

In addition to a TF-IDF vector, several sentiment analysis features were extracted, those being: VADER, LIWC, and Syuzhet. VADER provides a base positive-neutral-negative sentiment for each post. Syuzhet provides several emotional sentiment metrics. LIWC provides many different linguistic features for each post. These features were then converted into categorical data for use in K-Modes with a K=2.

The following shows a plot where each of the sentiment analysis features have been reduced to down to just 3 dimensions using Principal Components Analysis (PCA). Each point represents a post with the shape of the representing its cluster and its color corresponding to whether or not the post belongs to a liberal or conservative subreddit. The point is colored orange if it belongs to a conservative subreddit and it is colored blue if it belongs to a liberal subreddit. The x-axis shows the first PCA reduced dimension and the y-axis shows the second PCA reduced dimensions. Finally, the points are sized by the third PCA reduced dimension.

K-Modes Clustering K=2

LDA

LDA was also performed on the TF-IDF vector of these posts. Four topics were selected and can be viewed in the following plots:

LDA Topics

Topic 0 likely indicates posts that deal with electoral politics. Topic 1 seems to revolve around subreddit moderation threads. Topic 2 very likely signals economic policy discussions. Topic 3 seems to focus on the media.

Association Rule Mining

Association Rule Mining (ARM) was also performed using the TF-IDF. A minimum confidence and support threshold of 0.07 was set as the parameters for ARM. The following shows an interactive network visualization of the top 50 rules.

This data was cleaned to remove stop words using the NLTK English stop words list; however, as one can observe, more stop word cleaning needs to be performed.

Clustering Conclusions

Due to how multi-dimensional these data are, given the sheer number of analyzed posts, it is challenging to interpret much meaning from the traditional clustering methods: K-Means, K-Modes, and Hierarchical Clustering. Perhaps a supervised clustering algorithm would work better since this is labeled data. However, the LDA analysis revealed several exciting topics within these Reddit posts. The association rule mining shows promise as a way to analyze how different words and perhaps even themes are associated with political speech. Still, much work on filtering out more stop words needs to be done.


Naive Bayes

Naive Bayes (NB) is a classification algorithm used in machine learning to predict the likelihood of a sample belonging to a particular class. It is based on Bayes' theorem, which describes the probability of an event given prior knowledge or evidence. NB can be used for text classification tasks, where the goal is to predict the category or class of a given text document. In this project, we plan to apply NB to our text data consisting of Reddit posts from liberal and conservative subreddits to classify the posts as liberal or conservative. This will help us gain insights into the language and themes used by individuals with different political ideologies and provide valuable information about the nature of online political discourse.

The code used to perform NB on our dataset was written in Python using the sci-kit-learn library. To use the data with the NB algorithm, we first transformed it into a TF-IDF (Term Frequency-Inverse Document Frequency) vector. This vector represents the frequency of each word in the document normalized by the frequency of the expression in the entire corpus. We then applied NB to the transformed data to train and evaluate our model. The image below shows our NB model's confusion matrix and accuracy. The training accuracy of the model was 70%, and the testing accuracy was 66%.

Our results showed that NB could classify Reddit posts from liberal and conservative subreddits with moderate accuracy. The model correctly identified most posts from each subreddit but misclassified many seats. This suggests that political discourse on Reddit is complex and cannot be accurately classified based solely on the language and themes used in the standings. Furthermore, the misclassifications indicate some overlap in the language and pieces used by individuals with different political ideologies. Overall, our results provide valuable insights into the nature of online political discourse and the challenges of using text mining to analyze it.

Naive Bayes Training Confusion Matrix

Naive Bayes Validation Confusion Matrix

Naive Bayes Training Classification Thresholds

Naive Bayes Validation Classification Tresholds

Decision Trees (Random Forest)

A Random Forest is a popular supervised decision trees-based learning algorithm in classification and regression tasks. It predicts the target variable by building a decision model from the dataset's features. A decision tree is made by recursively splitting the data based on the feature that provides the most information gain. The goal is to create a series of trees that separates the data in a way that leads to the highest prediction accuracy.

In this project, we plan to use decision trees to classify whether a post belongs to a liberal or conservative subreddit. This will help us understand the distinguishing characteristics of posts from each political ideology and how these characteristics influence the classification. We expect that decision trees will provide a more accurate classification than our previous model, Naive Bayes.

We will use the same dataset as our previous model, including Reddit posts from liberal and conservative subreddits. We will use Python and the Scikit-Learn library to build our random forest model. To train the model, we will preprocess the text data using a TF-IDF vectorizer and the resulting matrix.

Our random forest model achieved a training accuracy of 97% and a validation accuracy of 73%. This is a significant improvement over our Naive Bayes model. The confusion matrix shows that the model accurately predicts liberal posts but struggles to classify conservative positions. This may be due to the nature of the subreddit data, as conservative subreddits tend to have a broader range of topics. We have included three trees to visualize the model's decision-making process. The trees show the essential features in classifying posts, which include words like "Trump" and "Democrats."

Our decision tree model has provided valuable insights into the characteristics of posts from liberal and conservative subreddits. We have learned that certain words and phrases play a significant role in the classification of posts. The model struggles with classifying traditional positions due to the broader range of topics in conservative subreddits. The decision tree model has shown promise in accurately organizing political camps and may be helpful in further understanding the polarization of online political discussions.

Random Forest Training Set Confusion Matrix

Random Forest Validation Set Confusion Matrix

Random Forest Training Classification Thresholds

Random Forest Validation Classification Thresholds

SVM

Support Vector Machines (SVMs) are a powerful machine learning algorithm used for classification and regression analysis. SVMs work by finding the hyperplane that maximizes the margin between the two classes in a dataset. In my text mining project, I plan to use SVMs to classify reddit posts as belonging to either liberal or conservative subreddits.

For my project, I used a dataset of reddit posts that were labeled as either belonging to a liberal or conservative subreddit. I split my data into a training set and a testing set, where the training set consisted of 1,000 randomly sampled observations and the testing set consisted of 1,000 observations. I chose to downsample my dataset to improve the performance of my SVM model. SVMs can only work on labeled numeric data, so I converted my text data into TF-IDF features.

The SVM model achieved a training accuracy of 0.897 and a testing accuracy of 0.674. Through this project, I learned that SVMs can be a useful tool for text classification tasks. By using the right kernel and tuning the cost parameter, we can achieve high accuracy in classifying reddit posts as belonging to liberal or conservative subreddits. This can have applications in various fields such as social media analysis, political analysis, and market research.

Confusion Matrix of the SVM Training Data

Confusion Matrix of the SVM Validation Data

Neural Networks

Neural networks (NNs) are a type of machine learning algorithm that is modeled after the structure and function of the human brain. They consist of interconnected nodes, or neurons, that process and transmit information. NNs are designed to learn and make predictions by adjusting the strengths of the connections between the neurons based on patterns in the input data. For example, an image recognition NN might be trained on a labeled image dataset to recognize patterns and features in new, unlabeled images. This process involves feeding the NN many images during training, adjusting the strengths of the connections between the neurons, and then testing the NN's ability to classify new images correctly that it has not seen before. NNs have been successfully applied to various tasks, including image recognition, natural language processing, and even playing games like chess and Go.

Here is an example image of a simple neural network with three input nodes, one hidden layer with four neurons, and two output nodes:

There was a total of 124,757 Reddit posts scraped. This was split into a train/validation/test set with the training dataset sampled in a manner so that the train set would be balanced. This resulted in a 52:24:24 train:validation:test ratio. A series of neural network architectures, using text data from numerous politically affiliated subreddits will be trained. The feature extraction involves applying sentiment analysis libraries on the text, those being VADER, LIWC, and Syuzhet, to examine the base polarity of posts, any emotional sentiment, and key linguistic features of the text. In addition to sentiment analysis, we will use GloVe pre-trained word vectors with 100 dimensions on the cleaned text. This brings novel techniques to existing solutions by combining both sentiment analysis features and word embeddings for classification, similar to a visual question answering model which combines two different architectures. Those architectures being a dense neural network trained on the sentiment analysis metrics and also a recurrent neural network trained on the word embeddings from GloVe. Once both the dense layer neural network and the recurrent neural network were created, those networks were merged to produce the final neural network. The results of this experiment are visualized below.

Conclusion

In conclusion, this text-mining project aimed to analyze the content of political subreddits on Reddit to examine the negativity and divisiveness of online political discussions and explore the language and themes used by individuals with different political ideologies. We uncovered patterns and trends in the conversations by employing natural language processing techniques and machine learning algorithms. We gained a deeper understanding of how different political ideologies perceive each other online. Our findings suggest that liberals and conservatives engage in harmful and divisive conversations, but the specific language and themes used vary by ideology. We also found evidence of echo chambers and polarization within political subreddits.

The insights gained from this project have important implications for understanding the role of social media in shaping public opinion and discourse, as well as the potential impact of online polarization on society. By identifying the factors that contribute to online polarization and negativity, we can take steps toward promoting more productive and respectful discussions online. Ultimately, this project will reduce the harmful effects of online polarization and promote a more informed, inclusive, and united society.