Using machine learning to quantify and understand internet discourse.
HackerNews (HN) is a news-feed where users can post links to articles from anywhere on the web. It is very similar to Reddit except that the rules for posting and commenting are quite strict. The idea being that allowing only vetted users to contribute will keep conversational quality intact. That being said how “high quality” are HN conversations anyway?
Step 1: Extract, transform, and load data from HackerNews
Extract
Thankfully, HN has their own public API; no account or API token needed! Unfortunately, HN structures comment data in as a linked list where each comment either is attached to a story (post) or to another comment as a reply. For example, if you want to count the number of comments on a given post you must begin at the tree head and traverse down all comments and sub-comments until you reach comments without any further replies. And only then work your way back up back to the original posting counting up each comment along the way. In order to fix this problem I wrote a web crawler specifically for HN data. It is able to traverse that comment tree and extract all the data we need.
Side note: In the future I will have the crawler run 24/7 to capture as much data over a longer period of time. But for now I am limiting myself to the last 500 stories on HackerNews as of May 4, 2023. My hope is to eventually scale up my crawler to get multi-years worth of data but Rome wasn’t built in a day.
Transform
Unfortunately, we cannot cluster on unstructured text data. We need to convert from unstructured text into structured vectors because our clustering algorithms only understand lists of numbers not text. In order to do this I am using the SentenceTransformers package and the pre-trained “all-mpnet-base-v2” model. To learn more about text embeddings check out this page. Using my pre-trained model I apply a text embedding to each comment within our data set.
Load
Once I am done encoding each comment, I save the output to a pandas data frame object. Since we want to maintain the type definitions in tact I export the data frame object to a pickle file. Once that is done I uploaded the pickle file directly to S3 for future use.
Step 2: Clustering Comments
From those 500 stories, I pulled 2,107 comments. I was able to assign a cluster label to ~25% of those 2,107. Why are 75% of the comments left without a cluster? Well without getting to deep on the technical specifics, HDBSCAN, the clustering algorithm I am using looks for pockets of density when assigning clusters. Some comment clusters have a really low density which the algorithm ends up missing. For example, suppose there are maybe 1-2 comments relating to pet ownership. In a sea of 2.1k comments it would be difficult for HDBSCAN to spot that comment clusters as significant. In the future, as I scale up data across longer time ranges density across all clusters increases which will increase the chances of picking up any missed comments. But because our data set is quite small we must settle with 25% which is not too bad given the circumstances. Using HDBSCAN I was able to produce 27 distinct cluster groups.
Step 3: Creating Topic Categories
Now at this point we have clustered our comments. In other words we can say comments a, b, c are associated to cluster 1, comments x, y, z are associated cluster 2, etc. However, we do not just want cluster groups number we want to know what that collection of comments are actually talking about. In other words we want to create a topic category for the body of text that is each cluster group. Since I have API access, I am using GPT-4 (3.5 works too) to read thru a subsection of comments in a defined cluster and then provide a topic category. Below is the prompt I used.
You are topicGPT, you can read a large body of text and then create a topic category that describes the body of text. For example if you read comments about ways to fix broken engines, smog checks, and check engine lights you would reply with auto maintenance as the topic category. You will limit yourself to 4 words or less per topic category. After you create your topic category double check you meet the word limit requirement before giving your final answer.
After GPT-4 gives me the topic categories for all cluster, we can ask more meaningful questions. What are the top clusters? Turns out the HackerNews audience really enjoys talking about the AI industry, housing affordability, and web development (not surprising). What about the smallest clusters? Towards the bottom end we have GPT and chatbot creation. GPT and chatbot creation could plausibly be include in the AI industry cluster but it seems HDBSCAN finds the conversation sufficiently different to warrant a different cluster. Below is a bar chart showing the comment count for each cluster (y-axis) and on the x-axis is each cluster topic category.