Bizarro Google - Crawling All Disallows in Robots.txt

So I've been writing a thing... it's been a while since I wrote a utility to do something so I figured it was about time. While discussing crawling and scraping with Mark of HG he had the idea to do an "anti-google crawl". Essentially this would be a crawl of the stuff that google is purposefully not crawling.

That translates into geek-speak as crawling the Disallows in robots.txt. What does that mean? Robots.txt is a file that is put into the web root that instructs Google to crawl or not crawl certain parts of the site. Typically crawlers will follow it and it's become a sort of standard, though not legally binding, file. It's polite to follow robots.txt. But with the advent of Google Hacking back in ye olden days we've seen that stuff that is indexed can contain a lot of sensitive info, so how about the stuff that's not? Let's take a look at a typical robots.txt file:

User-agent: msnbot-media 
Disallow: /
Allow: /th?

User-agent: Twitterbot
Disallow: 

User-agent: *
Disallow: /account/
Disallow: /amp/
Disallow: /bfp/search
Disallow: /bing-site-safety
Disallow: /blogs/search/
Disallow: /entities/search
Disallow: /fd/
Disallow: /history
Disallow: /hotels/search
Disallow: /images?
Disallow: /images/search?
Disallow: /images/search/?
Disallow: /images/searchbyimage
Disallow: /local
Disallow: /maps/adsendpoint
Disallow: /news/apiclick.aspx
Disallow: /news/search?
Disallow: /notifications/
Disallow: /offers/proxy/dealsserver/api/log
Disallow: /offers/proxy/dealsserver/buy
Disallow: /ping
Disallow: /profile/history?
Disallow: /proFile/history?
Disallow: /Proxy.ashx
Disallow: /results
Disallow: /rewardsapp/
Disallow: /search
Disallow: /Search
Disallow: /settings
Disallow: /shenghuo
Disallow: /shop$
Disallow: /shop?
Disallow: /shop/
Disallow: /social/search?
Disallow: /spbasic
Disallow: /spresults
Disallow: /static/
Disallow: /th?
Disallow: /th$
Disallow: /translator/?ref=
Disallow: /travel/css
Disallow: /travel/flight/flightSearch
Disallow: /travel/flight/flightSearchAction
Disallow: /travel/flight/search?
Disallow: /travel/flight/search/?
Disallow: /travel/hotel/hotelMiniSearchRequest
Disallow: /travel/hotel/hotelSearch
Disallow: /travel/hotels/search?
Disallow: /travel/hotels/search/?
Disallow: /travel/scripts
Disallow: /travel/secure
Disallow: /url
Disallow: /videos?
Disallow: /videos/?
Disallow: /videos/search?
Disallow: /videos/search/?
Disallow: /widget/cr
Disallow: /widget/entity/search/?
Disallow: /widget/render
Disallow: /widget/snapshot
Disallow: /work
Disallow: /academic/search
Disallow: /academic/profile
Disallow: /fun/g/
Disallow: /fun/api/
Disallow: /merchant/reviews?
Disallow: /product/reviews?
Disallow: /hotel/reviews?
Disallow: /hpm
Disallow: /hpmob
Disallow: /HpImageArchive.aspx


Sitemap: http://cn.bing.com/dict/sitemap-index.xml
Sitemap: http://cached.blob.core.windows.net/tmp/sitemap_all_v2.xml
Sitemap: https://www.bing.com/api/maps/mapcontrol/isdk/sitemap.xml
Sitemap: https://www.bing.com/travelguide/sitemaps/sitemap.xml
Sitemap: https://www.bing.com/maps/sitemap.xml

We have a bunch of stuff we don't really care about and then a bunch of Disallow entries. These are entries that the site is telling Google (and most other crawlers) to NOT crawl. Oftentimes this can give away important secrets the site is trying to hide and becomes a beacon to attackers to look into those portions of the site. This is typically done by attackers, red teamers, and pen testers (shameless plug for ourselves). Some prime examples of fun stuff are administrative logins, password resets, and even stuff like phpmyadmin running on the server.

So we ran a crawler against all the Disallow entries on Alexa's top 10k (top 1 million is still running) and without further ado, here are the crawl results. The format is simple, a URL as the key (this is a Disallowed URL) and the response as HTML, there is also a "form" field giving you information about forms on the page (more on this later). A quick search utility helps you navigate the information quickly. This let's you do fun stuff like this:

(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py searchurl admin
https://www.twitch.tv//admin/*%0A
https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship/
https://en.wikipedia.org//wiki/wikipedia-diskusjon%3aadministratorer%0A
https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship
https://en.wikipedia.org/wiki/Wikipedia_talk:Requests_for_adminship
https://wordpress.com/log-in?redirect_to=https%3A%2F%2Fwordpress.com%2Fwp-admin%2F
http://www.espn.com/*/admin/%0A
https://www.pinterest.com//admin/%0A
https://www.pinterest.com//admin/%0A
https://www.pinterest.com//admin/%0A
https://www.thestartmagazine.com//admin/*%0A
https://www.breitbart.com//wp-admin%0A
https://weather.com//admin/%0A
https://weather.com//?q=admin/%0A
https://www.zillow.com/captchaPerimeterX/?url=%2f*%2fadmin-ajax.php%250A&uuid=2cb3c030-c732-11e8-8cdc-ab933f2f1a5a&vid=
https://www.forbes.com/*wp-admin*%250A/
https://admincp.alwafd.news/index.php/auth/login
https://alwafd.news/ad/www/admin/index.php
http://www.hp.com//whpadmin/%0A
https://www.pinterest.com//admin/%0A
https://www.pinterest.com//admin/%0A
https://www.pinterest.com//admin/%0A
https://www.goodreads.com//admin%0A
https://www.zoho.com/crm/help/data-administration/import-data.html
https://www.zoho.com//people/help/administrator/toil-and-over-time.html%0A
https://www.zoho.com/crm/help/data-administration/import-data.html
https://wordpress.com/log-in?redirect_to=https%3A%2F%2Fwordpress.com%2Fwp-admin%2F
https://patch.com//?q=admin/%0A
https://slickdeals.net//admincp/%0A
https://slickdeals.net//forums/admincp/%0A
https://infourok.ru/admin
https://prezi.com//admin/%0A
https://prezi.com//featured/admin/%0A
https://myanimelist.net//admin/%0A

...allowing you to view possible administrative logins hidden from Google. Let's do a slightly more fun one:

(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py searchurl phpmyadmin
http://www.ufrgs.br//lucianoes/phpmyadmin/%0A
https://www.programcreek.com//phpmyadmin/%0A

The first one is a false positive (likely they used to use it), but we find that the second one appears to be a phpmyadmin exposed to the Inernet. If you're a security engineer or something like it you should be shaking your head in exasperation at this point - possibly crying a little.

Anyway I didn't want to stop there because at this point I was having fun. I went ahead and used Formasaurus - a tool we developed as a part of the DARPA Memex program that detects and extracts forms from HTML. Once the forms are extracted it uses Logistic Regression (a form of machine learning) to classify the form as a certain type. For you ML folks out there check out the docs for more info or the repo. Allowed form types are:

search
login
registration
password/login recovery
contact/comment
join mailing list
order/add to cart
other

Once it gets the form type, field types of the forms are found. Formasaurus detects the following field types:

    username
    password
    password confirmation - “enter the same password again”
    email
    email confirmation - “enter the same email again”
    username or email - a field where both username and email are accepted
    captcha - image captcha or a puzzle to solve
    honeypot - this field usually should be left blank
    TOS confirmation - “I agree with Terms of Service”, “I agree to follow website rules”, “It is OK to process my personal info”, etc.
    receive emails confirmation - a checkbox which means “yes, it is ok to send me some sort of emails”
    remember me checkbox - common on login forms
    submit button - a button user should click to submit this form
    cancel button
    reset/clear button
    first name
    last name
    middle name
    full name
    organization name
    gender
    day
    month
    year
    full date
    time zone
    DST - Daylight saving time preference
    country
    city
    state
    address - other address information
    postal code
    phone - phone number or its part
    fax
    url
    OpenID
    about me text
    comment text
    comment title or subject
    security question - “mother’s maiden name”
    answer to security question
    search query
    search category / refinement - search parameter, filtering option
    product quantity
    style select - style/theme select, common on forums
    sorting option - asc/desc order, items per page
    other number
    other read-only - field with information user shouldn’t change
    all other fields are classified as other.

I ran this through the dataset and made the results searchable using the aforementioned search utility. The dataset is in JSON lines format and easily useable for all you data and security folk out there. Here is the search utility in action:

(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py 
usage: python search.py searchurl|searchbody|searchformtype your_search_term

Results are printed to the terminal.

Serarching forms for exposed Disallowed logins is easy, for example:

(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py searchformtype login
===============================================================
https://gfycat.com//*/signup?redirecturi=*%0A
========
search 0.9940972341455536
registration 0.6790357456275667
login 0.30660737096769225
===============================================================
https://gfycat.com//signup?redirecturi=*%0A
========
search 0.9940972341455536
registration 0.6790357456275667
login 0.30660737096769225
===============================================================
https://www.amazon.es/ap/signin?_encoding=UTF8&openid.assoc_handle=esflex&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.es%2Fgp%2Fyourstore%2Fcard%3Fie%3DUTF8%26ref_%3Dcust_rec_intestitial_signin
========
registration 0.15412381291741203
login 0.8342829078544318
===============================================================
https://dashbird.buzzfeed.com/
========
login 0.9517729344040474
other 0.8236991721104611
search 0.10444331762136966
===============================================================
</snip>

The above gives the URL along with how sure Formasaurus is that it is the type of form listed. Because we searched for "login" all of the results will have been guessed to be of the "login" form type (it doesn't check to make sure it's the top score though - sorry). Forms are usually the type with the highest score.

As this project continues (and our Alexa Top 1 Million crawl finishes) we'll be doing more data science with this stuff. We'll keep you posted, let me know of interest or disinterest in this so I know how much to prioritize this through Twitter (@_hyp3ri0n).

  • _hyp3ri0n

PS For reference the bizarro-google code can be found here: https://github.com/HyperionGray/bizarro-google and the dataset (put it in the project root please) is here.