Associating Reddit Links to Descriptions using Selenium and MatPlotLib
19 Dec 2017Recently, I wrote about a script I wrote for extracting subreddit names from URLs.
From there, I programmed a way to extract the time from the JSON data so that I could eventually construct a time series using the Pandas library. Note that I had to take a substring of the result due to some bugs in the GoogleScraper:
#Process time stamps correctly for data.
#Note that some are in the general data instead of the correct time slot
timestamp=[]
for a in data:
for b in a['results']:
#Fix why none is not working
#workaround with length of 4?
if 'None' in b['time_stamp']:
timestamp.append(b['snippet'][0:12])
else:
timestamp.append(b['time_stamp'][0:12])
Next, I worked on programming a way to find more info about the subreddit and associated to said link. I originally tried to find a JSON dataset with information already or a list of subreddits with descriptions. Unfortunately, I couldn’t find any list that had ALL subreddits. So, I used the official reddit website to get the description utilizing their https://www.reddit.com/redditshttps://www.reddit.com/reddits website.
First, I had to setup Selenium. I had to install Selenium and Jupyter notebooks. It was pretty simple. Just had to install the package via terminal with apt-get and download the Chrome web driver from Google and move it to /usr/bin. I added the following libraries to the top of my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
After testing the Jupyter notebook to make sure that their libraries were running expectantly, I went back to the https://www.reddit.com/reddits website to identify patterns within classes and structure to extract descriptions. From there, I simply ran a for loop for the subreddit list which searched each of the subreddit names and then extracted the description of the first result (since those names in the list are 100% accurate). I also included a try and except loop to prevent a out of index error because if the surbeddit doesn’t have a description, the HTML element “md” is not created and thus would create an error that stops the whole script:
# Path = where the "chromedriver" file is
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')
browser.get('https://www.reddit.com/reddits')
#Get info on each search term
inputBox = browser.find_element_by_name("q")
info = []
for channel in subReddits:
inputBox.clear()
inputBox.send_keys(channel)
inputBox.send_keys(Keys.ENTER)
time.sleep(3)
#find element
try:
elem = browser.find_elements_by_class_name("md")[1]
info.append(elem)
except IndexError:
info.append("")
#Solution to stale element issue since element changes from the original q element
inputBox = browser.find_element_by_name("q")
And finally, creating a table for readability:
#Combine Data for data processing
arrays = zip(links, subReddits, timestamp, info)
df = pd.DataFrame(data=arrays)
df
Full code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import urllib, json
from pprint import pprint
import time
#Load Json
data = json.load(open('discordgg/December2016November2017.json'))
#Get Only Links from JSON
links=[]
for a in data:
for b in a['results']:
links.append(b['link'])
#pprint(b['link'])
#Process time stamps correctly for data.
#Note that some are in the general data instead of the correct time slot
timestamp=[]
for a in data:
for b in a['results']:
#Fix why none is not working
#shitty workaround with length of 4?
if 'None' in b['time_stamp']:
timestamp.append(b['snippet'][0:12])
else:
timestamp.append(b['time_stamp'][0:12])
#Slice to appropriate date from original data
#for stamp in timestamp:
#Pattern => 11 or 12 characters. the next character is a space.
#Better way is to find 2016 or 2017 and then slice till that point
for what in timestamp:
pprint(what)
#Process data using regex to get subreddits
subReddits=[]
for y in links:
subReddits.append(y.split('/')[4])
# Path = where the "chromedriver" file is
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')
browser.get('https://www.reddit.com/reddits')
#Get info on each search term
inputBox = browser.find_element_by_name("q")
info = []
for channel in subReddits:
inputBox.clear()
inputBox.send_keys(channel)
inputBox.send_keys(Keys.ENTER)
time.sleep(3)
#find element
info.append(browser.find_elements_by_class_name("md")[1].text)
#Combine Data for data processing
arrays = zip(links, subReddits, timestamp, info)
df = pd.DataFrame(data=arrays)
And that’s it! ^_^
Later, I plan to categorize each of these links using a bags of words algorithm.
Thanks for reading!