So I've been writing a thing... it's been a while since I wrote a utility to do something so I figured it was about time. While discussing crawling and scraping with Mark of HG he had the idea to do an "anti-google crawl". Essentially this would be a crawl of the stuff that google is purposefully not crawling.
That translates into geek-speak as crawling the Disallows in robots.txt. What does that mean? Robots.txt is a file that is put into the web root that instructs Google to crawl or not crawl certain parts of the site. Typically crawlers will follow it and it's become a sort of standard, though not legally binding, file. It's polite to follow robots.txt. But with the advent of Google Hacking back in ye olden days we've seen that stuff that is indexed can contain a lot of sensitive info, so how about the stuff that's not? Let's take a look at a typical robots.txt file:
User-agent: msnbot-media Disallow: / Allow: /th? User-agent: Twitterbot Disallow: User-agent: * Disallow: /account/ Disallow: /amp/ Disallow: /bfp/search Disallow: /bing-site-safety Disallow: /blogs/search/ Disallow: /entities/search Disallow: /fd/ Disallow: /history Disallow: /hotels/search Disallow: /images? Disallow: /images/search? Disallow: /images/search/? Disallow: /images/searchbyimage Disallow: /local Disallow: /maps/adsendpoint Disallow: /news/apiclick.aspx Disallow: /news/search? Disallow: /notifications/ Disallow: /offers/proxy/dealsserver/api/log Disallow: /offers/proxy/dealsserver/buy Disallow: /ping Disallow: /profile/history? Disallow: /proFile/history? Disallow: /Proxy.ashx Disallow: /results Disallow: /rewardsapp/ Disallow: /search Disallow: /Search Disallow: /settings Disallow: /shenghuo Disallow: /shop$ Disallow: /shop? Disallow: /shop/ Disallow: /social/search? Disallow: /spbasic Disallow: /spresults Disallow: /static/ Disallow: /th? Disallow: /th$ Disallow: /translator/?ref= Disallow: /travel/css Disallow: /travel/flight/flightSearch Disallow: /travel/flight/flightSearchAction Disallow: /travel/flight/search? Disallow: /travel/flight/search/? Disallow: /travel/hotel/hotelMiniSearchRequest Disallow: /travel/hotel/hotelSearch Disallow: /travel/hotels/search? Disallow: /travel/hotels/search/? Disallow: /travel/scripts Disallow: /travel/secure Disallow: /url Disallow: /videos? Disallow: /videos/? Disallow: /videos/search? Disallow: /videos/search/? Disallow: /widget/cr Disallow: /widget/entity/search/? Disallow: /widget/render Disallow: /widget/snapshot Disallow: /work Disallow: /academic/search Disallow: /academic/profile Disallow: /fun/g/ Disallow: /fun/api/ Disallow: /merchant/reviews? Disallow: /product/reviews? Disallow: /hotel/reviews? Disallow: /hpm Disallow: /hpmob Disallow: /HpImageArchive.aspx Sitemap: http://cn.bing.com/dict/sitemap-index.xml Sitemap: http://cached.blob.core.windows.net/tmp/sitemap_all_v2.xml Sitemap: https://www.bing.com/api/maps/mapcontrol/isdk/sitemap.xml Sitemap: https://www.bing.com/travelguide/sitemaps/sitemap.xml Sitemap: https://www.bing.com/maps/sitemap.xml
We have a bunch of stuff we don't really care about and then a bunch of
Disallow entries. These are entries that the site is telling Google (and most other crawlers) to NOT crawl. Oftentimes this can give away important secrets the site is trying to hide and becomes a beacon to attackers to look into those portions of the site. This is typically done by attackers, red teamers, and pen testers (shameless plug for ourselves). Some prime examples of fun stuff are administrative logins, password resets, and even stuff like phpmyadmin running on the server.
So we ran a crawler against all the
Disallow entries on Alexa's top 10k (top 1 million is still running) and without further ado, here are the crawl results. The format is simple, a URL as the key (this is a Disallowed URL) and the response as HTML, there is also a "form" field giving you information about forms on the page (more on this later). A quick search utility helps you navigate the information quickly. This let's you do fun stuff like this:
(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py searchurl admin https://www.twitch.tv//admin/*%0A https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship/ https://en.wikipedia.org//wiki/wikipedia-diskusjon%3aadministratorer%0A https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship https://en.wikipedia.org/wiki/Wikipedia_talk:Requests_for_adminship https://wordpress.com/log-in?redirect_to=https%3A%2F%2Fwordpress.com%2Fwp-admin%2F http://www.espn.com/*/admin/%0A https://www.pinterest.com//admin/%0A https://www.pinterest.com//admin/%0A https://www.pinterest.com//admin/%0A https://www.thestartmagazine.com//admin/*%0A https://www.breitbart.com//wp-admin%0A https://weather.com//admin/%0A https://weather.com//?q=admin/%0A https://www.zillow.com/captchaPerimeterX/?url=%2f*%2fadmin-ajax.php%250A&uuid=2cb3c030-c732-11e8-8cdc-ab933f2f1a5a&vid= https://www.forbes.com/*wp-admin*%250A/ https://admincp.alwafd.news/index.php/auth/login https://alwafd.news/ad/www/admin/index.php http://www.hp.com//whpadmin/%0A https://www.pinterest.com//admin/%0A https://www.pinterest.com//admin/%0A https://www.pinterest.com//admin/%0A https://www.goodreads.com//admin%0A https://www.zoho.com/crm/help/data-administration/import-data.html https://www.zoho.com//people/help/administrator/toil-and-over-time.html%0A https://www.zoho.com/crm/help/data-administration/import-data.html https://wordpress.com/log-in?redirect_to=https%3A%2F%2Fwordpress.com%2Fwp-admin%2F https://patch.com//?q=admin/%0A https://slickdeals.net//admincp/%0A https://slickdeals.net//forums/admincp/%0A https://infourok.ru/admin https://prezi.com//admin/%0A https://prezi.com//featured/admin/%0A https://myanimelist.net//admin/%0A
...allowing you to view possible administrative logins hidden from Google. Let's do a slightly more fun one:
(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py searchurl phpmyadmin http://www.ufrgs.br//lucianoes/phpmyadmin/%0A https://www.programcreek.com//phpmyadmin/%0A
The first one is a false positive (likely they used to use it), but we find that the second one appears to be a phpmyadmin exposed to the Inernet. If you're a security engineer or something like it you should be shaking your head in exasperation at this point - possibly crying a little.
Anyway I didn't want to stop there because at this point I was having fun. I went ahead and used Formasaurus - a tool we developed as a part of the DARPA Memex program that detects and extracts forms from HTML. Once the forms are extracted it uses Logistic Regression (a form of machine learning) to classify the form as a certain type. For you ML folks out there check out the docs for more info or the repo. Allowed form types are:
search login registration password/login recovery contact/comment join mailing list order/add to cart other
Once it gets the form type, field types of the forms are found. Formasaurus detects the following field types:
username password password confirmation - “enter the same password again” email email confirmation - “enter the same email again” username or email - a field where both username and email are accepted captcha - image captcha or a puzzle to solve honeypot - this field usually should be left blank TOS confirmation - “I agree with Terms of Service”, “I agree to follow website rules”, “It is OK to process my personal info”, etc. receive emails confirmation - a checkbox which means “yes, it is ok to send me some sort of emails” remember me checkbox - common on login forms submit button - a button user should click to submit this form cancel button reset/clear button first name last name middle name full name organization name gender day month year full date time zone DST - Daylight saving time preference country city state address - other address information postal code phone - phone number or its part fax url OpenID about me text comment text comment title or subject security question - “mother’s maiden name” answer to security question search query search category / refinement - search parameter, filtering option product quantity style select - style/theme select, common on forums sorting option - asc/desc order, items per page other number other read-only - field with information user shouldn’t change all other fields are classified as other.
I ran this through the dataset and made the results searchable using the aforementioned search utility. The dataset is in JSON lines format and easily useable for all you data and security folk out there. Here is the search utility in action:
(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py usage: python search.py searchurl|searchbody|searchformtype your_search_term Results are printed to the terminal.
Serarching forms for exposed Disallowed logins is easy, for example:
(antigenv) punk@punk-dtsp ~/hg-dev/antig $ python search.py searchformtype login =============================================================== https://gfycat.com//*/signup?redirecturi=*%0A ======== search 0.9940972341455536 registration 0.6790357456275667 login 0.30660737096769225 =============================================================== https://gfycat.com//signup?redirecturi=*%0A ======== search 0.9940972341455536 registration 0.6790357456275667 login 0.30660737096769225 =============================================================== https://www.amazon.es/ap/signin?_encoding=UTF8&openid.assoc_handle=esflex&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.es%2Fgp%2Fyourstore%2Fcard%3Fie%3DUTF8%26ref_%3Dcust_rec_intestitial_signin ======== registration 0.15412381291741203 login 0.8342829078544318 =============================================================== https://dashbird.buzzfeed.com/ ======== login 0.9517729344040474 other 0.8236991721104611 search 0.10444331762136966 =============================================================== </snip>
The above gives the URL along with how sure Formasaurus is that it is the type of form listed. Because we searched for "login" all of the results will have been guessed to be of the "login" form type (it doesn't check to make sure it's the top score though - sorry). Forms are usually the type with the highest score.
As this project continues (and our Alexa Top 1 Million crawl finishes) we'll be doing more data science with this stuff. We'll keep you posted, let me know of interest or disinterest in this so I know how much to prioritize this through Twitter (@_hyp3ri0n).