Associating Reddit Links to Descriptions using Selenium and MatPlotLib

19 Dec 2017

Recently, I wrote about a script I wrote for extracting subreddit names from URLs.

From there, I programmed a way to extract the time from the JSON data so that I could eventually construct a time series using the Pandas library. Note that I had to take a substring of the result due to some bugs in the GoogleScraper:

#Process time stamps correctly for data. 
#Note that some are in the general data instead of the correct time slot
timestamp=[]
for a in data:
	for b in a['results']:
		#Fix why none is not working
		#workaround with length of 4?
		if 'None' in b['time_stamp']:
			timestamp.append(b['snippet'][0:12])
		else:
			timestamp.append(b['time_stamp'][0:12])

Next, I worked on programming a way to find more info about the subreddit and associated to said link. I originally tried to find a JSON dataset with information already or a list of subreddits with descriptions. Unfortunately, I couldn’t find any list that had ALL subreddits. So, I used the official reddit website to get the description utilizing their https://www.reddit.com/redditshttps://www.reddit.com/reddits website.

First, I had to setup Selenium. I had to install Selenium and Jupyter notebooks. It was pretty simple. Just had to install the package via terminal with apt-get and download the Chrome web driver from Google and move it to /usr/bin. I added the following libraries to the top of my code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

After testing the Jupyter notebook to make sure that their libraries were running expectantly, I went back to the https://www.reddit.com/reddits website to identify patterns within classes and structure to extract descriptions. From there, I simply ran a for loop for the subreddit list which searched each of the subreddit names and then extracted the description of the first result (since those names in the list are 100% accurate). I also included a try and except loop to prevent a out of index error because if the surbeddit doesn’t have a description, the HTML element “md” is not created and thus would create an error that stops the whole script:

# Path = where the "chromedriver" file is
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')
browser.get('https://www.reddit.com/reddits')

#Get info on each search term
inputBox = browser.find_element_by_name("q")
info = []
for channel in subReddits:
    inputBox.clear()
    inputBox.send_keys(channel)
    inputBox.send_keys(Keys.ENTER)
    time.sleep(3)
    #find element 
    try:
        elem = browser.find_elements_by_class_name("md")[1]
        info.append(elem)
    except IndexError:
        info.append("")
    #Solution to stale element issue since element changes from the original q element
    inputBox = browser.find_element_by_name("q")

And finally, creating a table for readability:

#Combine Data for data processing
arrays = zip(links, subReddits, timestamp, info)
df = pd.DataFrame(data=arrays)
df

Full code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import urllib, json
from pprint import pprint
import time

#Load Json
data = json.load(open('discordgg/December2016November2017.json'))

#Get Only Links from JSON
links=[]

for a in data:
	for b in a['results']:
		links.append(b['link'])
		#pprint(b['link'])
#Process time stamps correctly for data. 
#Note that some are in the general data instead of the correct time slot
timestamp=[]
for a in data:
	for b in a['results']:
		#Fix why none is not working
		#shitty workaround with length of 4?
		if 'None' in b['time_stamp']:
			timestamp.append(b['snippet'][0:12])
		else:
			timestamp.append(b['time_stamp'][0:12])
#Slice to appropriate date from original data
#for stamp in timestamp:
#Pattern => 11 or 12 characters. the next character is a space.
#Better way is to find 2016 or 2017 and then slice till that point

for what in timestamp:
	pprint(what)

#Process data using regex to get subreddits
subReddits=[]
for y in links:
	subReddits.append(y.split('/')[4])
	
# Path = where the "chromedriver" file is
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')
browser.get('https://www.reddit.com/reddits')

#Get info on each search term
inputBox = browser.find_element_by_name("q")
info = []
for channel in subReddits:
    inputBox.clear()
    inputBox.send_keys(channel)
    inputBox.send_keys(Keys.ENTER)
    time.sleep(3)
    #find element 
    info.append(browser.find_elements_by_class_name("md")[1].text)
	
#Combine Data for data processing
arrays = zip(links, subReddits, timestamp, info)
df = pd.DataFrame(data=arrays)

And that’s it! ^_^

Later, I plan to categorize each of these links using a bags of words algorithm.

Thanks for reading!

Ethan Chiu My personal blog

Associating Reddit Links to Descriptions using Selenium and MatPlotLib

Related Posts

Lessons Learned 3 Cities 16 Mar 2024

Favorite Sayings 19 Feb 2024

Great vs. Bad Engineering Managers 29 Dec 2022