Extracting Subreddit Names From Urls
16 Dec 2017In my research on the platform Discord, I had gathered a list of URLs mentioned the platform during the beginning of Discord’s rise in popularity. I wanted to investigate how Discord grew on Reddit.
After gathering a list of those URLs using a Google Scraper I had modified, I then needed to extract the URLs from that data to then get a list of subreddits where the Discord platform was being mentioned on.
The data I collected looked like this:
[{
"effective_query": "",
"id": "1721",
"no_results": "False",
"num_results": "10",
"num_results_for_query": "About 2,150,000 results (0.33 seconds)\u00a0",
"page_number": "9",
"query": "discord.gg site:reddit.com",
"requested_at": "2017-11-21 06:58:48.987283",
"requested_by": "localhost",
"results": [
{
"domain": "www.reddit.com",
"id": "2665",
"link": "https://www.reddit.com/r/HeistTeams/comments/6543q7/join_heistteams_offical_discord_server_invite/",
"link_type": "results",
"rank": "1",
"serp_id": "1721",
"snippet": "http://discord.gg/gtao. ... The good thing about discord is if you're like me and your Mic don't work there's a .... Sub still kinda active, but the discord is much more.",
"time_stamp": "Apr 13, 2017 - 100+ posts - \u200e100+ authors",
"title": "Join HeistTeam's Offical Discord Server! Invite: discord.gg/gtao - Reddit",
"visible_link": "https://www.reddit.com/r/HeistTeams/.../join_heistteams_offical_discord_server_invite..."
},
{
"domain": "www.reddit.com",
"id": "2666",
"link": "https://www.reddit.com/r/NeebsGaming/comments/6q3wlk/the_official_neebs_gaming_discord/",
"link_type": "results",
"rank": "2",
"serp_id": "1721",
"snippet": "Ive changed the link in the sidebar over to the official discord or you can join it by following this link here. http://discord.gg/neebsgaming. Here are the rules for\u00a0...",
"time_stamp": "Jul 28, 2017 - 5 posts - \u200e4 authors",
"title": "The Official Neebs Gaming Discord! : NeebsGaming - Reddit",
"visible_link": "https://www.reddit.com/r/NeebsGaming/.../the_official_neebs_gaming_discord/"
},
I first needed to extract solely the links from the “results” part of the data. So, I used a simple nested for loop to extract just the links. Then, I appended those links’ values to an empty list which will be used later on:
#Load Json
data = json.load(open('discordgg/November2015December2016.json'))
#Get Only Links from JSON
links=[]
for a in data:
for b in a['results']:
links.append(b['link'])
#pprint(b['link'])
Then, I needed to somehow just extract the part of the url that corresponds to the subreddit name. For example, for the url “https://www.reddit.com/r/NeebsGaming/comments/6q3wlk/the_official_neebs_gaming_discord/”, I wanted to extract just “NeebsGaming”. Luckily, all of the links I collected from Reddit followed the same pattern where the subreddit name appeared between “/r/” and the next “/”, so I just used regex to splice and then just selected the correct index of that slice for the list of links:
#Process data using regex to get subreddits
subReddits=[]
for y in links:
subReddits.append(y.split('/')[4])
pprint(y.split('/')[4])
Code in its totality:
import urllib, json
from pprint import pprint
#Load Json
data = json.load(open('discordgg/November2015December2016.json'))
#Get Only Links from JSON
links=[]
for a in data:
for b in a['results']:
links.append(b['link'])
#pprint(b['link'])
#Process data using regex to get subreddits
subReddits=[]
for y in links:
subReddits.append(y.split('/')[4])
pprint(y.split('/')[4])
Right now, I’m using the Reddit API and getting short descriptions of those subreddits and then using a simple bags of words algorithm to categorize them. Stay tooned!