YouTube Chatter: Understanding Online Comments Discourse on Misinformative and Political YouTube Videos

Undergraduate research conducted by Aarash Heydari*, Janny Zhang*, Shaan Appel, and Xinyi Wu under Professor Gireeja Ranade in the University of California — Berkeley’s Electrical Engineering & Computer Science department, as part of Misinformation Research at UC Berkeley (MRUCB).

The full paper (which includes the appendix) can be found at https://arxiv.org/abs/1907.00435.

* primary contributors.

Abstract

Motivation

Inflammatory articles and stories often incorporate video content, and their comment sections provide an approximate measure of how people react to a given story. Despite this, comment threads on YouTube videos have not been studied in the context of fake news as much as on other social media platforms such as Facebook and Reddit, likely since YouTube emphasizes hosting videos whereas websites like Reddit focus on encouraging discourse. Our aim is to understand the difference in YouTube comment responses to misinformative and factual videos, as well as to political and apolitical videos.

Note: We avoid using the term “disinformation” in our work because it implies that said information is purposefully spreading misleading or factually incorrect information for political, profitable, or other reasons, and as of now we cannot precisely determine the intent of our sources. Like many in the field, we also avoid using the term “fake news” to refrain from bringing its politically charged connotations into our work.

Data

Channel Selection

The biases and factual ratings of our sources are depicted in Figure 1.

Figure 1: A graphic depicting the MBFC categorizations and MBFC factual ratings (where applicable) for the fourteen channels we scraped. We made some categories of our own, marked as [unofficial], for later reference.

We chose our channels to cover a wide range of media. We included popular news sources such as Fox, BBC, and New York Times; “Questionable Sources” such as Breitbart, American Renaissance, and Infowars; and sources less-focused on politics such as The Dodo, Numberphile and Science Magazine to act as a control group. Note that even channels with a high MBFC Factual Rating can be considered biased.

Six of our sources support left-leaning views (New York Times, The Guardian, BBC, Democracy Now, Young Turks, Syriana Analysis); six of our sources support right-leaning views (Fox, InfoWars/ Alex Jones, RT, Breitbart, American Renaissance, PragerU); and 3 of our sources are relatively apolitical (The Dodo, Numberphile, and Science Magazine). All of these categorizations save for The Dodo, Numberphile, and PragerU are by MBFC; we unofficially labelled the last three channels based on Wikipedia descriptions and separate those from MBFC-labelled sources in our analyses. For details on each channel, see item 2 in the appendix.

Figure 2 displays the average number of views per video for each of the different bias groups to provide context for our later work.

Figure 2: Views across all sampled videos for the YouTube channels, normalized by the number of videos per group. Grouped visually by category and factual rating. Views measured in thousands. A chart of the views by channel can be found in the appendix, item 3.

Note: Among our chosen channels, we see significant differences in viewership, which could add a confounding variable to our later analyses. Channels with higher view-counts often have lower engagement rates, so if one of our categories has many highly-viewed channels, that category may be erroneously labelled as having less engagement than another category.

As this is an exploratory work, we have included a Released Datasets & Code section to allow others to continue and build off our work.

Data Collection

For each video, we retrieve all video metadata, all comments on the video (direct comments), and all replies to these comments (replies). The video metadata consists of view count, like count, dislike count, favorite count, publish date, date accessed by the scraper, the publisher’s channel title, and the duration of the video. Each comment (which includes both direct comments and replies) also has its own metadata consisting of the author’s display name, timestamp of the comment, and a like count. For more details on comment structure and how the metadata is gathered, please refer to the YouTube API we linked earlier in this section.

Released Datasets & Code

Datasets

Each dataset, save for American Renaissance, has information on 201 videos. The American Renaissance dataset has information on 172 videos.

Code

Results

Comment Engagement For Political Content:

> Comments per view (CPV)
Some channels were significantly more popular than others, with disparities up to a factor of 200 between the total views on the channels we chose (Figure 2). We wanted to see how well our different channels solicited responses from their viewers, and measure this by seeing how many comments per view (CPV) each channel received. We measure CPV by channel by going through each video in a channel and finding how many comments that video received per view, then averaging that ratio over all the videos in the channel (Figure 3A). To measure CPV by category, we take the CPV’s of each of the channels in that category and average those (Figure 3B).

Figure 3A: Total CPV, organized by channel, within our dataset.
Figure 3B: Total CPV, by category. Total view counts per channel are provided in our appendix item 3.

After normalizing, we find that both left-leaning and right-leaning political content resulted in more comments and replies per view as compared to non-political content. Content labelled as Left-Bias, Right-Bias, Conspiracy/Pseudoscience, or Questionable Source has 7.5x more comments per view and 10.42x more replies per view on average compared to content labelled as Apolitical or Pro-Science (Figure 3B). This indicates greater engagement in the comment sections on political channels. For our readers’ convenience, we also include a finer-grain breakdown of comments per view per channel (Figure 3A).

Our data also suggests that more polarized content receives greater engagement than less polarized content. Most notably, we find 8.3x more comments per view on channels labelled as Conspiracy-Pseudoscience and Questionable Sources compared to channels labelled as Apolitical or Pro-Science. We also see that Left-Bias channels have 2.56x more comments per view than Left-Center channels on average.

Although we saw the difference between Conspiracy-Pseudoscience/ Questionable Sources and Apolitical/ Pro-Science, that does not take into account the great difference in views between the four categories; there are far more views on our Apolitical channels than on channels in the other three categories (Figure 2).

> Comments per video
Focusing on comments per view can sometimes downplay engagement on more popular channels, since channels garnering high viewership do not always receive a proportional number of direct comments. Thus, we also wanted to make a baseline measurement by tracking the average responses per video, as opposed to per view, by channel and by category (Figure 4).

Figure 4: (Left) Total comments, including both direct comments and replies, divided by videos per channel. (Right) Total comments divided by videos per channel, averaged over category.

> Average thread length (ATL)
Since we are also interested in the depth of interaction between users in YouTube comment sections, we also measure the average thread length (ATL) across each channel (Figure 5A) and category (Figure 5B).

Figure 5A: Average thread length for each channel. Average thread length is the average number of replies on each direct comment over all the videos we scraped for a particular channel.

We consider ATL a good measure of deeper engagement because they require viewers to interact with other commenters over longer periods of time. We calculate ATL’s for channels by taking the average number of replies on each direct comment in that channel, then for category by averaging the ATL’s of the channels in that category.

Figure 5B: Average thread length organized by category.

Despite having higher total view counts per video than most of the news sources (Figure 2), the apolitical channels The Dodo and Numberphile both have significantly shorter ATL’s than other news sources. All our apolitical channels have ATL’s < 1. Shorter ATL’s is not unique to Apolitical channels; we do see that some Left-Bias, Questionable Source, Pro-Science and Conspiracy-Pseudoscience channels have ATL’s < 1 (Figure 5A). However, those particular channels do not have view counts per video which are comparable to those of the Apolitical channels. If we compare the Apolitical ATL (with an average of 400k views per video) to the PragerU ATL (with 1,050k views per video), we see that PragerU’s ATL is 1.5 comments more than the Apolitical ATL.

For our chosen channels, the Apolitical ATL is much less than the ATL of any other category. This broadly hints that political content leads to greater engagement. Some further work on ATL can be found in our appendix, items 5 & 6.

> Average comment lengths (ACL)

Figure 6A: Average number of characters in a comment, organized by channel.

As another measure of comment engagement, we analyze the average comment lengths (ACL’s) for each channel. ACL by channel takes the average of the number of characters per comment for every video in the channel’s dataset (Figure 6A). ACL by category is taken by taking the averages of the ACL’s of the channels in each category (Figure 6B).

Figure 6B: Average comment length, organized by category.

Immediately, we see that comments on PragerU were longer than comments on any other channel (Figure 6A); people are not only creating long threads (Figure 5A), but also using more words in each comment. We see similar behavior across the Democracy Now and American Renaissance channels, suggesting there may be longer comments on videos reflecting more polarized political views (Figure 6A). The Dodo lags behind.

> Profanity trends on political content
We wanted to see if political channels inspired more profanity than apolitical channels, and whether or not there would be a relationship between MBFC ratings and profanity.

We calculate channel profanity by finding the percentage of profane comments per video, then averaging those percentages over all the videos in a channel (Fig. 7A); category profanity is calculated similarly but averaged over all videos in an MBFC category (Fig. 7B). Profane comments contain at least one profane word. Profanityfilter, a universal Python library for detecting profane words, checks each comment against a dictionary of profane words.

Figure 7A: Channel profanities for our dataset.
Figure 7B: Category profanities for our dataset.

Channels with political content have 1.82x more profane comments across their videos when compared to our apolitical and pro-science sources. We also see that Left-Bias channels are 1.82x more profane than Left-Center Bias channels.

We note that Science Magazine (Pro-Science, High Factual Rating), BBC and The Guardian (both Left-Center Bias, High Factual Rating) have comparable levels of profanity compared to Russia Today (Questionable Source), Fox News (Right-Bias, Mixed Factual Rating), and Syriana Analysis (Left-Bias, Mixed Factual Rating), which suggests that factual rating is not necessarily immediately correlated with profanity. Even so, it is interesting to note that the most profane channels, American Renaissance (Questionable Source), Democracy Now (Left-Bias, Mixed Factual Rating), Young Turks (Left-Bias, Mixed Factual Rating), and Breitbart (Questionable Source), had some of the lowest factual ratings.

We made a number of word clouds per channel in the hopes of seeing what commenters on each channel were saying, but drew no conclusions. The word clouds can be found in our appendix (item 4).

Predicting the MBFC Category of a Video using Metadata

> Summary
For our limited dataset, the Random Forest and SVM with RBF kernel are the highest performing models, achieving test accuracies of 80.2% and 79.0% respectively in this 8-class classification problem. The most important features used by the Decision Tree and Random Forest models to determine the bias category of a video were “Like Count” of the video, “Comments Per View”, and “Profanity Rate”. In future work, we hope to increase the size of our dataset and leverage NLP techniques on the text of comments to improve classification performance and gain more insight on differences in engagement and language between videos from different political bias categories.

> Predicting the MBFC category of a video
We combined the videos of our 15 YouTube channels into one large dataset of 2904 videos. Each video was given 10 features, shown in Figure 8A below. These features consist of those engagement statistics which we compared earlier (ex. “Profanity Rate”) and other metadata which we had easy access to (ex. “Like Count”).

Figure 8A: A list of the features we used, as well as each one’s importance in Decision Tree and Random Forest.

Figure 8A depicts the importance of these features according to our Decision Tree and Random Forest models, where feature importance is defined as the normalized total reduction of the split criterion (Gini impurity) provided by that feature.

It is worth noting the slight redundancy in using “Comments Per View”, “Number of Comments”, and “Number of Views” as features; the first is the ratio of the other two. Interestingly, the Decision Tree and Random Forest models find that the “Comments Per View” is a substantially more informative feature than the two statistics alone.

We visualized the 10-dimensional data, labelled by its bias category, using 2-dimensional t-distributed Stochastic Neighbor Embedding in Figure 8B. The most visible trait of this visualization is that Apolitical videos, which had the lowest average value for many of the features in the original feature space (e.g. “Comments Per View”, “Average Thread Length”, “Profanity Rate”), form a relatively independent cluster.

Figure 8B: A 2-dimensional t-SNE visualization of the training data.

For each model applied to the data, we used 5-fold cross validation on our training set (85% of the total dataset) to find optimal hyper-parameters. After training our models on those hyperparameters, we evaluated them on an unseen test set (15% of the total dataset). The resulting confusion matrices on the unseen test data for each model are in Figure 8C below.

Figure 8C: Confusion matrices for models with hyper-parameters chosen by 5-fold cross validation. (Top left) Decision tree with maximum depth = 12: test accuracy = 65.7%. (Top right) Random forest with unlimited max depth, 61 estimators: test accuracy = 80.2%. (Bottom left) SVM with C = 100: test accuracy = 62.5%. (Bottom right) SVM with RBF kernel, with gamma = 0.1, C = 175: test accuracy = 79.0%.

We find that the Random Forest and SVM with RBF kernel perform best, with test accuracies of 80.2% and 79.0% respectively.

We know that Random Forest models generally perform well on any dataset with minimal regularity because of their ensemble nature and their robustness to outliers. It is no surprise that the linear SVM underperforms because there is no reason to believe that the optimal decision function for this problem would be a linear decision boundary. Thus, we hypothesize that the linear SVM suffers from bias to a greater extent than the other models.

The SVM with RBF kernel likely performs well because the RBF kernel function can be interpreted as a similarity score based on Euclidean distance between points in the feature space. Videos from the same channel tend to have similar feature representations, and each bias category in our dataset is composed of videos from no more than three unique channels.

For future work, it would be interesting to see if the SVM with RBF kernel continues performing well with greater amounts of broader data. We are also interested in applying NLP techniques to the comments of videos to gain novel insights about differences in the language of comments in videos from different bias categories.

Discussion

Having limited our scope to 15 YouTube channels, a majority of which were American, it remains to be seen if the trends discussed in this paper occur across a larger and more comprehensive dataset. Furthermore, the differences in viewership across the chosen channels may have led us to different conclusions than if the compared channels had more similar viewership. These uncertainties could be resolved by running the same analyses on a larger, more diverse dataset. We would also like to expand our scope to analyze why certain comments receive greater traction and why some channels have longer average thread lengths, as well as how discussions and arguments develop on YouTube.

We sincerely welcome fellow researchers and the open-source community to use and build upon our work. We hope our readers remember that online comments and discussions may never accurately reflect the nature of people’s attitudes towards the content they see — many interactions happen offline, “in real life” — and look to stimulate further analysis in the domain of discourse surrounding misinformation online.

References

Misinformation:

Comments analysis:

Edits

UC Berkeley EECS ’19 → Google SWE. Interested in misinformation prevention, tech for social good, and education.