In today's installment of our Dark Web Map series we are going to take a deeper dive into its contents. Mainstream coverage of the dark web portrays it as a spooky morass of stolen identities and top secret documents. But is that a fair representation? My goal here is to draw objective conclusions and to let the data speak for itself.
Surveying The Dark Web
In the first post of this series, we looked at a few random examples of dark web sites. The handful of examples in that article showed a wide range of topics on the dark web: from the banal (an Indian post office) to the illicit (underground casinos) to the pro-democratic (anonymous sourcing for journalists). Of course, that was just a handful of examples out of the 6.6K sites on the map.
In this post, we will develop a broader and more quantitative approach to determining what's out there on the dark web. Let's start by building a topic model. Topic modeling is a natural language processing (NLP) technique that tries to make sense of a large collection of documents by finding common topics and the keywords associated with each topic. In our case, we will treat each dark web page as a "document". We will extract the text content from each page (ignoring the HTML markup) and do some cleanup before building topic models. (For full details, take a look at this Jupyter notebook.)
Below, I have used a topic modeling algorithm to produce the top 25 topics for all of the onions, i.e. the topics that are most common on the dark web. This allows us to quickly summarize this large collection of data.
|1||121||tor, post, forum, 2018, 2017|
|2||71||daniel, hosting, bitch, ~brown, eof|
|3||57||btc, investment, lotery, 100, bonus|
|4||55||congratulate, demo, dark, html, sign|
|5||49||403, forbid, nginx, permission, unix|
|6||44||register, login, password, logout, username|
|7||44||card, credit, number, birth, atm|
|8||44||$, key, person, email, plan|
|9||41||login, password, speedstepper, captcha, surround|
|10||41||coin, blender, mix, deposit, transaction|
|12||37||file, apache2, configuration, web, default|
|13||36||502, disconnected, bad, gateway, nginx|
|14||34||nginx, 504, gateway, 502, welcome|
|15||32||index, modify, size, description, 2017|
|16||31||respond, didn, 504, gateway, awesome|
|17||31||password, session, log, inactivity, length|
|18||30||submit, document, click, securedrop, slider|
|19||29||request, url, port, apache, debian|
|20||26||404, error, nginx, page, internal|
|21||25||http, maintainance, untill, deep, tor|
|22||19||401, authorization, nginx, require, unauthorized|
|23||19||test, myhacker, visit, page, demonstration|
|24||17||room, red, join, leave, day hour|
|25||15||invalid, header, client, receive, credential|
Topic modeling is an inherently noisy process. The algorithm looks for words that occur together more frequently than by random chance. This process misses some obvious connections, like ontologies of concepts (e.g. the relationship between "cryptocurrency" and "bitcoin") and translations of keywords into different languages ("drugs" vs "drogas").
Nevertheless, some of the topics above are interpretable:
- Topic #2 is related to Daniel's Hosting, which is a service that hosts dark web sites.
- Topic #7 seems to be related to credit card numbers, ATM pins, and birthdays.
- Topic #18 is related to SecureDrop, which is a service that lets anonymous sources submit documents to journalists.
- Topic #24 is based on the infamous Red Room urban legend.
Some topics are difficult to interpret or are the result of errors on the onion web pages:
- Some topics (e.g. #1 and #4) are difficult to interpret. It's not clear how those keywords are related to each other.
- Several topics (#5, #13, #14, #16, #20, and #22) are related to errors stemming from misconfigured webservers.
- Topic #11 is derived from web pages that have incorrectly formatted HTML markup.
One way to interpret a topic is to view some of the documents associated with that topic. Let's try that for topic #4 by listing its top 10 documents:
1 congratulate you you have made site the dark side … 2 congratulate you you have made site the dark side … 3 congratulate you you have made site the dark side … 4 congratulate you you have made site the dark side … 5 congratulate you you have made site the dark side … 6 congratulate you you have made site the dark side … 7 congratulate you you have made site the dark side … 8 congratulate you you have made site the dark side … 9 congratulate you you have made site the dark side … 10 congratulate you you have made site the dark side …
Hey, this "topic" is just the same exact web page repeated over and over! What is going on here? Here's an example of one of these pages in the map:
So this looks like a tool that generates an onion site for you, and it has default content that congratulates you. (How sweet!) If we zoom out, we can see how many different copies of this same page exist:
That is 184 identical pages! This suggests any onions running identical web pages are going to have undue weight on our topic model, leading to a distortion. Overall, this approach gives us some intuition about what's on the dark web (bitcoin, stolen credit cards, red rooms) but it's muddled by misconfigured websites and repeated pages. Let's take a step back and think about repetition on the dark web.
Repetition & Clustering
Why do all of the congratulations pages cluster together in the first place? As explained in the previous post, the map is organized by placing sites that are structually similar close together and connecting them with a line. When many sites are all similar or identical to each other, they form into clusters.
If we can understand these clusters better, maybe we can solve some of the problems that arose from our first attempt at topic modeling. We will eyeball some of the clusters and then see if we can formalize this analysis later.
First, let's take a look at a zoomed-out overview of the map. (Note that as we update the map over time, it may look different than what you see here.)
An overview of the Dark Web Map.
The biggest cluster appears right in the middle of the map. This cluster has two lobes: a large bottom lobe and a smaller top lobe. Let's zoom in a bit to see it better.
Zoomed in on biggest cluster.
In order to get a sense of what kinds of web sites are in this cluster, I have highlighted three different areas of the cluster in yellow. Let's zoom into those three areas.
Big cluster: zoom #1
This first zoom is from the top lobe of the big cluster. It displays 4 very similar pages: a bold heading at the top followed by a line or two of text. Two of these pages display an error, and the other two display default messages that indicate a site has been installed but no content has been added. Most of the pages in this top part of the big cluster look a lot like the 4 pages seen here.
Big cluster: zoom #2
This second zoom is from the bottom lobe of the big cluster. It displays 4 very similar web pages that each contains a line of text that contains a link. Most of the pages in this bottom part look a lot like the 4 pages seen here.
Big cluster: zoom #3
Zoom #3 comes from the middle area between the top and bottom lobes. These pages are all quite simple: a few lines of text and (in one case) an image. This demonstrates the concept of clustering very well:
- Pages in top and middle parts are similar to each other.
- Pages in middle and bottom parts are similar to each other.
- Pages in top and bottom parts are not similar to each other.
Therefore, the map organizes this cluster into these two different lobes, connected together by the middle pages.
The next cluster contains marketplaces that appear to sell illegal goods, such as counterfeit currency and passports. This cluster is smaller than the previous cluster, but it has the same two-lobed structure.
A cluster of illegal marketplaces.
Zoom in on illegal marketplaces.
This cluster demonstrates the way we measure similarity: even though sites look different and sell different products, they still share common page structure. For example, each page has a title at the top, four buttons (Products, Help, Register, and Login), a picture, and a table of prices.
The next cluster is an apparent bitcoin scam. Compare the shape of this cluster to the shape of the previous two clusters. The circular shape (i.e. lack of lobes) indicates that these sites are all nearly identical to each other.
A circular cluster of bitcoin scams.
Zoom in on bitcoin scams: "How to multiply your Bitcoins hundredfold in a day?"
Why are there so many duplicated bitcoin scams on the dark web? Is it one scammer who is operating a bunch of different sites, or are multiple scammers copying each other (and possibly trying to scam the scammers)?
Here is another circular cluster containing "red room" sites, one of the topics identified by our earlier topic modeling effort.
A cluster of red rooms.
Zoom in on red rooms.
I have no reason to believe that these red rooms are real. The interesting question is why there are so many different red room sites that all have identical structure. The clustering and shape of the red room cluster has more in common with the bitcoin scams than with legitimate sites.
So what can we observe just by looking at these different types of clusters? We see that largest clusters on the dark web are largely just noise: sites that contain spurious error messages, template pages from when they were set up, or blank/incoherent content. Other large clusters contain cloned sites with indicators of scams and/or criminality.
Now that we have skimmed over a few clusters, let's formalize the analysis a bit. We begin by plotting cluster sizes in a histogram.
The horizontal axis shows cluster sizes. The leftmost column represents small clusters (one or a few sites) and the rightmost column represents the very biggest cluster (866 sites). The vertical axis shows cluster sizes in log scale: the tallest bar is almost 3,000 and the shortest bars are 1. (This histogram is dependent on the thresholding described in our previous blog post.)
The data illustrate how unusual the large clusters (>50 sites) are. There are only 10 of these large clusters, but they contain roughly one third of all the sites in the Dark Web Map! The middle third of clusters have between 2 and 49 sites each. The remaining third are clusters of size 1, i.e. a single site that is not connected to any other sites.
The distribution reveals an important caveat for our topic modeling effort: the topics are biased towards sites that are heavily repeated. E.g. if there are 184 identical "congratulations" sites on the dark web, then our topic model will be biased towards the "congratulations" topic.
Topic Model Redux
Now that we know that repetition of similar sites biases our topic model, can we use our knowledge of clustering to build an unbiased topic model? Here's the gameplan: for each cluster (including clusters of size 1), pick one site at random to represent that cluster and then throw the rest away. This process should reduce the bias towards cloned sites, error messages, and other semantically meaningless junk.
Our new top 30 topics are:
|1||177||address, website, want, send, need|
|2||50||password, username, register, log, forget|
|3||49||card, credit, balance, buy, price|
|4||44||post, topic, view, board, total|
|6||39||que, para, por, con, como|
|7||39||forum, member, mybb, board, thread|
|8||37||btc, address, transaction, wallet, send|
|9||37||$, 000, buy, 100, personal|
|10||36||server, port, apache, xmpp, index|
|11||36||die, und, der, ist, von|
|12||36||kazino, igry, tak, den, prosto|
|13||35||key, pgp, public, block, begin|
|14||35||div, class=, <, 0px, >|
|15||33||vous, les, des, sur, pas|
|16||33||file, upload, image, size, png|
|17||32||http, www, wiki, php, index|
|18||29||eur, mg, product, pure, 100|
|19||28||cart, add, sort, sale, g|
|20||27||host, daniel, website, hosting, blog|
|21||25||sign, account, remember, address, forget|
|22||24||debian, package, debconf, update, dsa|
|23||23||casino, game, online, money, play|
|24||21||hello, world, wordpress, find, placeholder|
|25||20||`, ysh, hblvg, =, wly|
The topics are noticeably better now!
- Topic #3 appears to be stolen credit cards.
- Topic #4 depicts common web forum terms.
- Topic #8 is related to bitcoin.
- Topic #12 is Russian-language casinos. ("kasino" is the romanized form of "казино", which is the Russian word for casino.)
- Topic #13 contains cryptography terms related to PGP software.
- Topic #16 reflects sites where users upload and host images.
- Topic #18 appears to reflect illegal drug sales.
Other topics are still problematic:
- Topics #5 and #14 show that broken HTML is common on the dark web.
- Topics #6 and #11 are short, meaningless words in Spanish and Russian, respectively. (This does suggest that Spanish and Russian are popular languages on the dark web, after English.)
- Topic #25 is completely unintelligible.
These sorts of artifacts are pretty typical for topic modeling. For a quick analysis, though, this is a pretty good start. Let's dig deeper on a couple of these topics, starting with #16. Can we confirm our intuition about what this topic means? Let's look at the top 10 pages associated with this topic:
These pages do indicate that this topic centers around private image uploading and hosting. Here is a screenshot of the fifth one:
Screenshot of the share3g3ssxs**** onion in the "image uploading" topic.
Finally, let's dig deeper on topic #25, which looks like gibberish. The top 10 documents are:
1 fpqmnpsb7fjq**** syt khbry thlyly khlmh brgdhry swrty khh sr`t rtbt yntrnty shm khm twnyd nskhh khm hjm jht dstrsy mt… 2 nwsw6u3emdvr**** lpny yvshb ||||||| `=t=`… 3 3jdxfoiidvcz**** bwbh l`rb lry ysyh mwdw`t link link link nsy mhmh rwbt tsl tbr` ldyb wyb ldyb wyb shbkh twr shbkh mt… 4 darkblogkjmw**** hblvg hpl hbyt vdvt tsrv qshr prsmv pvrvm hblvg hpl hblvg hpl nvld b`qbvt rdyph khbrynv nkhqrym vn`n… 5 autshpxnxhuv**** skip navigation skip content autoshop automatic purchases search for menu home all products cart che… 6 kamellkelv2h**** kamel zitouni kamel zitouni kamel zitouni mrhb smy zytwny kml `ysh ljzy bwly wrql whwyty mmrs lshdhw… 7 vmfucmt62525**** khbr lmslmyn mwq` mstql y`ny khbr lmslmyn mry tqryr mswr khbr lmslmyn l`m swtyt ltrykh ljhdy qbl lkh… 8 64cptkueswgy**** jysh mhmd wwrywrz lmqds d`w ll`bd lrhmn lrhym l`ly read more wnhn nshhd mkn `bdtn sqwt wnr~ khwnn kh… 9 x36n2vixwf2y**** hello world you are reading this you either are the sans sec487 osint course and you are starting on… 10 wpcxzq4ykmsx**** mwq` gyr rsmy l`rd sdrt ldwl lslmy categories|lqsm archives| rshyf mwq` dh` lbyn mwq` `mq fri januar…
Although there are two recognizable sites in here (#5 and #9), the others still look like gibberish to my eyes. Let's look at screenshots of a couple of these sites:
Screenshot of fpqmnpsb7fjq****.
Screenshot of 3jdxfoiidvcz****.
Now we can clearly see that these are Arabic sites. As part of the pre-processing, I romanized foreign alphabets, which is what causes these sites to look like gibberish to an English speaker. This topic suggests that Arabic is another popular langauge on the dark web.
As I explained above, topic modeling is a noisy process with significant limitations. Furthermore, the dataset we have here is limited by the fact that we only look at the home page of each site. Indeed, topic #2 appears to relate just to login screens, suggesting that there may be a lot of content on the dark web that we can't even see without acquiring a login. Our dataset does not include the traffic going to these websites, so we have to assign equal weight to the most popular and least popular sites alike. And, of course, if an onion operator really wants to make his/her site hard to find, our simple data collection process is likely to miss it.
I assume that these biases in our process probably underrepresent some categories of content, including drug marketplaces, pornography, and child exploitation. Browsing visually through the Dark Web Map, for example, demonstrates the prevalence of pornography. Porn probably doesn't show up in the topic model because that content is mostly visual and the associated terminology is broad.
So what conclusions can we actually draw? The topic model does reveal the prevalence of certain functional roles in the dark web ecosystem: forums, chat rooms, marketplaces, and file/image hosts. These functional roles certainly support both licit and illicit activities on the dark web. For example, porn is underrepresented in topic model, but presumably that content is distributed through the forums, chat rooms, and file/image hosts.
The topic model also shows which written languages are prominent on the dark web. English and Russian are unsurprisingly dominant. The importance of Spanish, Portugese, and Arabic definitely caught me by surprise, as did the notable absence of Chinese.
Thanks for following along today! I hope this was informative without being unbearably technical. If you want to see more detailed analysis, take a look at this Python notebook that shows how I derived the results in this post.
Questions? Feeback? Hyperion Gray fanfic? Hit us up on Twitter