Here at Hyperion Gray, crawling the web is a major part of our business. The company's first major project was an open source web crawler/fuzzer hybrid called PunkSPIDER, which was the subject of a research grant and a DEFCON talk. Since then, we have developed awesome new crawling technology with our partners on the DARPA Memex program, applying machine learning techniques to make crawlers that are more efficient and more focused. And we frequently deploy custom crawlers to support various internal programs.
Crawling is second nature to us.
On the other hand, crawling is still a complex affair, a nuanced and deep specialization. While several high quality crawlers are available as open source, all of the most popular crawlers are difficult to install, configure, and operate. These crawlers require a great deal of expertise just to get started. Customizing out-of-the-box crawlers requires wading through arcane XML files or even writing custom code.
Today, we are introducing a new, open source crawler called Starbelly. It is named after the Starbellied Orbweaver (scientific name Acanthepeira stellata) a small spider that has a super funky, star-shaped back. (Do spiders even have backs? Hmm… well we're not experts on that type of spider.)
There are several powerful, open source crawlers already in wide use, like Nutch, Scrapy, and Heretrix. There are also new crawlers being written on top of new tech stacks, like Sparkler, a streaming crawler built on top of Apache Spark.
Where does Starbelly fit into this landscape? Let's start with the project's goals:
- Simple deployment: should be easy to install.
- Easy to use: should not require editing XML or writing custom code.
- Real-time: return crawl results in real-time.
- Flexible: adapt quickly to common crawling issues, such as sites that return incorrect HTTP status codes.
- Showcase innovation: consolidate advanced crawling technologies that we have developed under various research programs into a single application.
It may also be helpful to contrast Starbelly with other mainstream crawlers:
- Nutch, Heretrix: highly scalable but complex to install, configure, and operate. Results are batch-oriented, not real-time.
- Scrapy: combines crawling and scraping into a single process. Requires coding to customize many behaviors.
- Sparkler: highly scalable and real-time, but still fairly complicated to deploy and operate.
In contrast, Starbelly is a user-friendly crawler that is completely GUI-driven and it streams crawl data in real-time over a Websocket. The main tradeoff is that Starbelly does not have the same scalability as the applications mentioned above: instead, we're focusing on the best user experience possible. If you want high scalability, we strongly encourage you to use one of the solutions named above — they are all excellent!
The following demo shows how quick and easy it is to run a crawl using Starbelly.
We are releasing version 1.0 of Starbelly today. We have been using Starbelly on production projects for several months now.
Starbelly is a crawling backend, i.e. a component that can be plugged into different front ends (through a stable API) in order to build a variety of applications. You can build a search engine, a scraping system, or something else we haven't even imagined yet!
Hit us up at @hyperiongray if you have questions or feedback!