Here at Hyperion Gray, crawling the web is a major part of our business. The company's first major project was an open source web crawler/fuzzer hybrid called PunkSPIDER, which was the subject of a research grant and a DEFCON talk. Since then, we have developed awesome new crawling technology with our partners on the DARPA Memex program, applying machine learning techniques to make crawlers that are more efficient and more focused. And we frequently deploy custom crawlers to support various internal programs.
Crawling is second nature to us.
On the other hand, crawling is still a complex affair, a nuanced and deep specialization. While several high quality crawlers are available as open source, all of the most popular crawlers are difficult to install, configure, and operate. These crawlers require a great deal of expertise just to get started. Customizing out-of-the-box crawlers requires wading through arcane XML files or even writing custom code.
Today, we are introducing a new, open source crawler we are building called Starbelly. It is named after the Starbellied Orbweaver (scientific name Acanthepeira stellata) a small spider that has a super funky, star-shaped back. (Do spiders even have backs? Hmm… well we're not experts on that type of spider.)
There are several powerful, open source crawlers already in wide use, like Nutch, Scrapy, and Heretrix. There are also new crawlers being written on top of new tech stacks, like Sparkler, a streaming crawler built on top of Apache Spark.
Where does Starbelly fit into this landscape? Let's start with the project's goals:
- Simple deployment: should be easy to install.
- Easy to use: should not require editing XML or writing custom code.
- Real-time: return crawl results in real-time.
- Flexible: adapt quickly to common crawling issues, such as sites that return incorrect HTTP status codes.
- Showcase innovation: consolidate advanced crawling technologies that we have developed under various research programs into a single application.
It may also be helpful to contrast Starbelly with other mainstream crawlers:
- Nutch, Heretrix: highly scalable but complex to install, configure, and operate. Results are batch-oriented, not real-time.
- Scrapy: combines crawling and scraping into a single process. Requires coding to customize many behaviors.
- Sparkler: highly scalable and real-time, but still fairly complicated to deploy and operate.
In contrast, Starbelly is a user-friendly crawler that is completely GUI-driven and sports a real-time API: it streams crawl data over a Websocket. The synchronization API is carefully designed so that a client can stream large datasets even during long crawls (multiple days) where the client or server may need to disconnect and reconnect periodically.
The main limitation of Starbelly is that it does not have the same scalability as the applications mentioned above: instead, we're focusing on the best user experience possible. If you want high scalability, we strongly encourage you to use one of the solutions named above — they are all excellent!
Let's take a quick tour of Starbelly. First of all, let's start a crawl:
Type in a seed URL, select a crawl policy (more on that in a moment), and click "Start Crawl". That's it!
As the crawl starts running, the results stream into the GUI in real time. You can monitor the status of all crawls on the Dashboard, or you can view detailed status for a single crawl:
The crawl details update in real time and allows you to drill down to view individual responses. When something goes wrong with a crawl — e.g. a crawl exception — you can quickly view what the exception was and which URL caused it.
Each crawl job is based on a "crawl policy", i.e. a set of rules that guide the crawler's decision making. You can configure as many different crawl policies as you like. Every time you kick off a new crawl job, you select one of these policies.
The crawl policy replaces the XML, INI, or custom code that you would typically write to customize a crawler. Instead, the crawl policy can be configured completely through the GUI, allowing you to control which links get followed, what types of media to download, how long/deep the crawl runs, etc.
Starbelly is in early alpha status. We are currently conducting crawling tests, fixing some performance issues, and refactoring like crazy. We plan to start using Starbelly on some internal projects within a few months, and we will release a beta at that time. If you're interested in playing with some experimental, undocumented software <insert spooky sound effect> you'll find the source code on GitHub on the
refactor branch. (I'm not linking to it here since this branch will eventually be merged into
master and then deleted.)
We view Starbelly as a crawling backend, i.e. a component that can be plugged into different front ends (through a stable API) in order to build a variety of applications. You can build a search engine, a scraping system, or something else we haven't even imagined yet!
Hit us up at @hyperiongray if you have questions or feedback!