Since we first learned about it, I’ve been interested in studying a dataset of Reddit posts from the popular “Am I the Asshole?” People can post their experiences in confrontations where they weren’t sure if they did the right thing or were actually the asshole on the (AITA) subreddit.
Voting and comments are made by other users in the following categories: You’re the Asshole (YTA), Not the Asshole (NTA), Everyone Sucks Here (ESH), No Assholes Here (NAH), Not the Asshole (NTA), or Not Enough Info (INFO).
In addition to the vote results and the number of comments for each article, the dataset also includes the text of nearly 97,000 entries. Only roughly 27% of the cases received a verdict of either YTA or ESH from users, meaning that almost 75% of the cases were deemed to be assholery-free.
That gives us some comfort about our nature and our propensity to worry about acting morally.
We can go deeper into this rich dataset of complex human situations even though the assholes turned out to be in the minority. The Alteryx Intelligence Suite’s Text Mining palette and Alteryx Designer give us the tools we need to process even large amounts of text.
We chose to investigate intriguing patterns in the AITA posts using those tools and the Data Investigation tool palette. Enjoy this somewhat nasty review of correlations, topic modeling, and sentiment analysis. We might learn more about how people behave along the way.
Three Simple Steps To Text Analysis: Don’t Hastily Judge
We only needed to fix a few minor text formatting issues and create a new variable to represent the duration of the original post because the dataset was fairly clean (in the data sense, at least).
We thought it would be interesting to observe how the length of a post, the difficulty of the situation, and/or the extent to which a person felt they needed to explain themselves might relate to the other factors.
We used the Sentiment Analysis tool to evaluate each post’s title and body for emotional weight or valence, or whether they were positive, neutral, or negative, before performing any more text processing.
The algorithm that powers this tool, called VADER, is made to function effectively even on a text that contains NSFW terms, emoticons, excessive punctuation, and other peculiarities found in social media content. For sentiment analysis, each of those needs to remain intact.
But we gave the text a little more preparation before subject modeling. The Text Pre-processing Tool handled that significant duty. Using this tool, which is based on the Python NLP library spaCy, the text will be normalized and filtered.
One strange thing it does is replace pronouns with the symbol -PRON-. Considering how much time you’ve spent online, you might think that spaCy is referring to anything other than pronouns. In actuality, this abbreviation serves as a textual replacement for pronouns.
Using a Formula Tool’s REGEX Replace function, we eliminated all of those notations from the post’s processed text and its titles.
The Topic Modeling Tool was then included in the workflow, and it was set up to find three subjects in the articles. Check out the GIF below to see the main themes that appeared in the visualization that was produced.
The lists of key terms for each issue and understanding of the AITA context allow us to classify the three topics as “family troubles,” “romantic/friend relationship conflicts,” and “work/job challenges.”
The Intertropical Distance Map does a good job of separating the three topics, and each topic’s word lists make logic. Each post in the dataset receives an additional score from the Subject Modeling Tool for each topic, indicating how frequently that topic appears in the post.
Finding the main topics in more than 97,000 posts rapidly and analyzing the sentiment is great. But were there any connections between those themes and sentiment scores and the AITA user ratings?
We opened the Data Investigation tool palette to see what patterns we could find in these postings and the comments.
Assholery And Sentiment Analysis
It is simple to compare categorical variables and determine whether their values coincide using the Contingency Table Tool. It’s a terrific approach to examine the sentiment analysis findings and AITA assessments in more detail.
With the “is an asshole” variable included in the dataset, we can compare the titles’ and postings’ positive or negative sentiment.
Unexpectedly, the emotional impact of titles and posts that were judged to contain assholery and those that didn’t differ much in terms of amount. Positive posts were actually evaluated somewhat more favorably than unfavorable ones by YTA or ESH.
Going a little further, we can use the Association Analysis Tool to look at the associations between our sentiment valence scores, topic scoring, and the post-length variable we introduced.
We selected the “Target a field for more in-depth analysis” option to obtain p-values for the correlation between these factors and the “is asshole” variable.
Based on these situations and judges, it appears that bad behaviour is distributed fairly evenly in our lives. The numbers aren’t all that different, but the Reddit voters were a little more understanding of family and job difficulties and harsher in their assessments of romantic and friendship concerns.
You’re the asshole
Also Read About –