Quick Links:
eli5 github

Scraping Hub eli5 post

Machine Learning and Statistical Learning are increasingly mainstream. People from all walks of life are finding all kinds of great new applications of known algorithms, and, as a result, most people have used a learning system without even being fully aware of it! These systems range from your Netflix recommendations (“Netflix just really gets me”) to your stock trade suggestions to the automated system you get when you call 1-800 numbers to simple games that try to guess what you are drawing.

While ML-based systems become ever more pervasive, our ability to understand them falls farther and farther behind -- a fact readily admitted by the data science community at large, btw. In some cases it’s not a huge deal if we don’t understand the outputs of an ML system. To be sure, what to watch on Netflix is important!! But if you watch the wrong show, it isn’t really a big deal (unless you’re Alex, in which case it is a HUGE FUCKING DEAL). If you invest in the wrong stocks, however, this could be a very costly mistake, so you probably want to know enough about why those suggestions are being made that you can trust them.

So, as learning systems are integrated further and further into our lives, how do we make sure we understand why specific decisions are being made? (Why does Netflix always think I like Pauly Shore movies? That was one time…) The difficulty with many learning algorithms is visualizing them under the hood, because the typical algorithm has very large lists of seemingly nonsense values, and this makes it pretty tricky to understand what it all means. A Neural Network stores information by setting the weights between neuron pairs. No matter how hard you try, looking at these numbers will never make sense to a human. If the system makes a mistake, the best you can do is add the situation to the training set, however even when you do this you can’t be sure that it will generalize to similar situations. One of the current strategies for explaining these systems is actually to wrap it in another ML system. This has been met with some success, but causes a bit of a chicken and an egg situation! Other methods try to map network weights to a set of domain rules. This, as it turns out, is really hard to do, but it can work okay.

Segues are weird. So anyway, Hyperion Gray is announcing a new tool: eli5, an awesome little library that gives you the ability to visualize what is going on under the hood of machine/statistical learning algorithms! eli5 (short for Explain Like I’m 5, a la reddit) was initially focused on explaining sklearn models, but has since been improved to include xgboost, lightgbm, and many more to come. eli5 is the brainchild of two Scrapinghub engineers, Mikhail and Konstantin, who we partner with on the DARPA Memex program. It was built out of our own need to understand our own machine learning based web crawlers and classifiers (and also, admittedly, out of a bid to this other DARPA program, which we didn’t win, but we built it anyway and it’s awesome so HA!), Turns out, it’s super useful to a lot of people, for a lot of things! The Scrapinghub guys also have written a more technical post on what eli5 is able to do, which you can find here.

Let’s talk a bit more about the situation that motivated us to build eli5 and how we’ve used it in our own projects.

Under the Memex program we built a tool called Site Hound, built mostly by our team member Tomas (be sure to check out his most recent blog post below). It is designed to help a non-technical user perform the task of domain discovery (if that term doesn’t mean anything to you, it basically translates to finding information on a topic that you care about). Site Hound works by letting a user provide some basic keywords about their domain, and then it crawls thousands of webpages, and returns only sites that are within the topic requested (there’s a lot that happens in the middle there but that’s the gist of it). In order to return relevant pages to the user, we’re analysing the page content “on the fly” to gauge whether it’s on topic, using a topic model that the user herself trains by annotating a small sample of pages as “relevant” or “not relevant.” This builds a model that the crawler then uses to classify pages during crawling.

Early versions of Site Hound worked well, but it sometimes wasn’t clear why certain pages were being classified as on or off topic. Furthermore, our model might yield good results on a small set of pages, but when used on a larger set of pages it seemed to break down. We found ourselves asking questions about how Site Hound decided if a website was relevant or not. It became part of a larger conversation on understanding how statistical learning systems operate. It is not always easy to trust the predictions of a system that isn’t fully understood itself (go figure). Most of the time these systems operate like a magic black box, and the reasoning for their decisions is not easily discernable, even by the system’s creators. Our ideas on how to solve these issues turned into what is now eli5, and Site Hound was our first attempt to use it in practice. Konstantin and Tomas worked together on implementation.

Figure 1: Screenshot displaying an initial result for a domain discovery task on the topic of “good whisky.” Shown here is the option to mark the results as Irrelevant, Neutral, and Relevant. These annotations are then used to train the classifier, and used by the crawler for new results.

Figure 2: eli5 displaying an analysis of a site classifier model on the topic of “good whisky.” Damn it, now I want some whisky...
Figure 1 is a screenshot of some initial results from a topical crawl. Shown in the image is a thumbnail of the webpage, a blob of text from the page, and the ability to flag the site as irrelevant, neutral, and relevant. Site Hound also has a range of other features that are really awesome, but I’m going to keep focused on the eli5 components here. After labelling some pages you can go back to the dashboard and build the initial classifier model.

Once trained, Site Hound allows you to “release the hounds”-- oww-oww-ooooww!!! -- and crawl both the web, using your now trained page classifier. Normally when you build the classifier model, the only next step would be to initiate a crawl and hope it returns good results. However, with the integration of eli5 we are also giving the user the ability to view a human-readable representation of the model and predict how well it will perform at scale.

Figure 2 shows how the model has weighted features, what its dataset looks like, and what the initial accuracy was, given the labelled pages. For this example I used Site Hound to look for pages related to Scotch whisky. I started with the keywords ‘oban’, and ‘scotch’. Site Hound also allows a user to include terms that should be excluded, but I did not add any (I should also admit that I only labelled only a handful of pages, so the model accuracy is a tad lower than it would be if I was actually doing research).

The feature breakdown shown in Figure 2 shows the features ordered by weight. Looking at the results you can see it will weight highly any URL that begins with /ob, or ends with ban. I believe this is because many of the web pages I marked as relevant were some version of : https://[some-scotch-website]/[scotch-type]. So pages ending in https://.../oban were quite common. The classifier was also able to identify the importance of words like ‘distillery’, ‘drink’, ‘Scotland’, and ‘malt’, all of which seem very reasonable to me.

“How exactly can this help?” you may be asking. Well, in this example I now understand what the classifier model learned from the pages I labelled. It’s no longer a black box!! I can actually see what it’s thinking, in a way, and more importantly I can correct it if it doesn’t align with my semantic understanding of the topic. I could use this knowledge to fine-tune and add new keywords or label additional pages, thus improving my results.

Figure 3: eli5 shows feature weights on live text

I also wanted to point out one other way eli5 allows you to view a model. Figure 3 is a screenshot from the eli5 github readme. When building a text classifier this view can make it much easier to understand why a document was classified the way that it was. The transparency of the red or green highlights indicate the strength of the feature (positive and negative, respectively). A more opaque highlight indicates a stronger weight. With this a user can simply read text from their data set and quickly identify how it aligns with the features of that class.

eli5 does more than just text classification however. It currently can also explain predictions of linear classifiers, regressors, print decision trees as text or as SVG, show feature importances and explain predictions of decision trees and tree-based ensembles, and more.

Although I’ve illustrated only a small fraction of what eli5 is now capable of, I hope it is enough to show why it can be so helpful. Please feel free to download and experiment with the package, and to contact us with any bug reports or pull requests with suggestions on how it could be made better!

Thanks for reading!
-- Jason