Scraping Google
17 Nov 2017Recently, I’ve been working on scraping millions of Google results for a research project of mine tracing how anonymous social platforms has been getting popular.
I tried many of the attempts to scrape Google efficiently and none of them worked except the GoogleScraper library developed by the github user NikolaiT.
When I saw what kind of data was produced by the scraper, I was amazed. The scraper was able to get link and meta data as well as scrape using proxies.
Unfortunately, there was no way of getting the timestamps of posts for all of the links that are scraped from the Google’s Search Engine Page.
So, I programmed it.
hile the code was well commented and written, there were many different functions and variables that made it hard to piece everything together.
I first inspected element in a random Google search of mine and identified a few common classes between links. Then, I searched that within GoogleScraper’s code to see if I could find the matching code which parses through the HTML of Google’s results page:
normal_search_selectors = {
'results': {
'us_ip': {
'container': '#center_col',
'result_container': 'div.g ',
'link': 'h3.r > a:first-child::attr(href)',
'snippet': 'div.s span.st::text',
'title': 'h3.r > a:first-child::text',
'visible_link': 'cite::text'
},
'de_ip': {
'container': '#center_col',
'result_container': 'li.g ',
'link': 'h3.r > a:first-child::attr(href)',
'snippet': 'div.s span.st::text',
'title': 'h3.r > a:first-child::text',
'visible_link': 'cite::text'
},
'de_ip_news_items': {
'container': 'li.card-section',
'link': 'a._Dk::attr(href)',
'snippet': 'span._dwd::text',
'title': 'a._Dk::text',
'visible_link': 'cite::text'
},
},
After, I inspected element in a Google search results with time stamps and was able to find a similar class: “slp f”.
From there, I basically followed it’s track and added another instance, using my own “timestamp” variable:
for key, value in parser.search_results.items():
if isinstance(value, list):
for link in value:
parsed = urlparse(link['link'])
# fill with nones to prevent key errors
[link.update({key: None}) for key in ('snippet', 'time_stamp','title', 'visible_link') if key not in link]
Link(
link=link['link'],
snippet=link['snippet'],
time_stamp=link['time_stamp'],
title=link['title'],
visible_link=link['visible_link'],
domain=parsed.netloc,
rank=link['rank'],
serp=self,
link_type=key
)
normal_search_selectors = {
'results': {
'us_ip': {
'container': '#center_col',
'result_container': 'div.g ',
'link': 'h3.r > a:first-child::attr(href)',
'snippet': 'div.s span.st::text',
'time_stamp' : 'div.slp::text',
'title': 'h3.r > a:first-child::text',
'visible_link': 'cite::text'
},
'de_ip': {
'container': '#center_col',
'result_container': 'li.g ',
'link': 'h3.r > a:first-child::attr(href)',
'snippet': 'div.s span.st::text',
'time_stamp' : 'div.slp::text',
'title': 'h3.r > a:first-child::text',
'visible_link': 'cite::text'
},
'de_ip_news_items': {
'container': 'li.card-section',
'link': 'a._Dk::attr(href)',
'snippet': 'span._dwd::text',
'time_stamp' : 'div.slp::text',
'title': 'a._Dk::text',
'visible_link': 'cite::text'
},
},
You can see my whole process through my forked version or my pull request! :)
Now, I’m working on finding a way to search through a specific time range. This will require me to actually program different functions to translate commonalities of time range searches into a feasible solution. Stay tooned to see the next post!