The first rule of Google scraping is "Don't scrape Google".
If you are someone that deals with web data - an analyst, programmer, data scientist, journalist perhaps - then sooner or later you are going to want to grab the URLs from Google in a format you can use.
But there's no export, and no API. Boo!
If you are anything like me, your journey will go something like this:
Dry your eyes, man! I'm awesome at this type of thing - I'll just quickly write a script that grabs the results from Google. Bet I can do it in less than 5 minutes. I'll drink a beer to celebrate my awesome powers.
30 minutes later:
Ok, I said the word beer, so I really want a beer now. I'll just run this awesome, untested script and grab those results while I drink a beer. I bet it works first time.
0 seconds later:
Mmm....so why....is that not working? Probably just a cheeky bug somewhere, I did write this pretty quick (even if I wasn't quite as quick as I thought I would be).
Another 30 minutes later:
Right, it seems Google doesn't want me to request pages programmatically. They can't defeat me though, I know fancy tricks like setting request headers. They'll never know, and I'll be drinking that beer.
A little while later:
Nailed it! And it's only taken me a couple of hours. I'll set this going and get all the results I need. I can drink that beer as a reward.
Around 5 minutes later:
WTF? Why is that not working anymore? I didn't change anything.
A couple more hours later:
Okay, it finally works. I have my results. Screw that beer, I'm going to bed, it's late.
Sadly, I did this so many times I ended up keeping a little command line script around to help me avoid it. The script followed some simple rules that allowed me to easily grab results, and actually get that beer.
I recently packaged this up into a little tool called Googlespider. It extracts Google search results and exports them to TSV, CSV, or JSON. Here it is in action:
It's open source, if you are a python-head you can install it (using a virtualenv) like so:
pip install googlespider
However, if you'd prefer to roll your own and apply the golden rules, read on:
Rule 1: Set the user-agent request header
If you don't set the user-agent header, Google will throw you a
403 error straight off the bat. They'll show you a page like this:
You can find a list of valid user-agent strings at UserAgentString - pick a recent one from the browser list.
Rule 2: Use a consistent user-agent
Related to #1, this is more about not causing yourself trouble. Some advice out there will suggest that you randomly rotate the User-Agent string. I've found this just causes issues - Google will occasionally present different markup for different browsers, breaking your HTML parsing.
Rule 3: Be polite - request 2 pages per minute maximum
Yes, that's crazy slow. But you want something that works, right? You might get away with a little more, but if you want that beer, slow and steady is the name of the game. Increase it and Google will start forcing all requests originating from that IP to solve captchas before proceeding. Try it out. Be careful though - keep it up and Google will temporarily ban the IP. Once an IP is tarnished it is less reliable. Which leads me to #4...
Rule 4: Use clean IPs
Don't use cheap proxies in an attempt to circumvent this rate limit. Just don't. I promise you will never get that beer. If you need to use proxies run them yourself or get good dedicated ones and test them before paying. Then care for them like they are your own child (or beer).
Rule 5: Prevent redirection
Be careful - when you request a page like https://www.google.com/search?q=hyperion+gray, Google will redirect you to the domain that relates to the country the request originates from, e.g. https://www.google.ca/search?q=hyperion+gray. These results are different. You can control this behaviour by appending the following parameter
Rule 6: Exclude universal results
Google intermittently inserts image/news/video results into the organic results. For most data jobs this probably isn't what you are looking for, so ensure your xpaths/css selectors exclude them.
Googlespider helps to localise results for you by setting appropriate search domains and languages. I used this handy list of Google domains by Distilled as a reference.
Scraping Google is against their TOS, I'm not encouraging it and this post is only for research purposes. ↩︎