Hacking Through Data with Metrics

What is a BK Tree?

A Burkhard and Keller (BK) Tree is a data structure which optimizes search, based on a metric space. It's good because it's fast, and works with any true metric space. It is basically the algorithm when you want to find all nodes within X distance from another node.

The most common example of a BK Tree's use case is for auto-correct and auto-complete. If you're interested in that example, it's already published elsewhere. Thanks Xenopax!

Implementation Improvements

The usual BK Tree implementation uses an array or list to map children nodes onto a parent, which use the index as the distance.

Pseudocode using difference of values as a distance metric:

parent = {
    'value': 4,
    'children': [{}, {'value': 3, 'children': ...}, {'value': 6, 
        'children': ...}, {}, {}]

This is how we originally implemented our BK Tree as well. It has the benefit of being ridiculously fast for search, because the distance is directly used as the index to lookup the next node. After way too much debating with a fellow colleague, who will not be named to protect involved parties, we concluded that a list or array is the wrong approach for us.

Storing nodes in the way above is a good implementation if your only requirement is speed of search results. It's drawbacks include increased memory usage, and some serious precision costs. The memory usage comes from storing empty nodes, and creating arrays that match the size of the largest distance to be stored (depending on the language and exact implementation, this has further consequences). As for precision costs, you have to map any distance onto an integer value which can be used as an index.

Great, we have two drawbacks, memory usage and precision issues. Now you say, "So what? Memory is cheap and we can afford a little precision on a function we're using to fuzzy match search results." (I assume you say that, because that is exactly what I said, dear reader) Now, for the counter-argument, which I'm sure you'll love.

You're forgetting 2 giant reason's we're using the BK Tree:

  • We need a way to quickly search through giant datasets--memory is limited in this case
  • We need a way to give exact distance searches, because we want to come to conclusions about a large dataset... FOR (data) SCIENCE!

Ok, so... let's just pretend that that actually convinced you. “Then there must be a better implementation!” you’ll say. There is! Instead of using a set array or list and directly storing in the index value, you instead would store values in an array sorted by distance. Since it's sorted, you can still do a binary search on the array, making the search for the initial child node O(log(n)) instead of O(1).

Luckily (or unluckily) you then have to search through each other child node within the threshold. With the "improved" implementation this should be faster in most cases, because after the initial node value is found, every other child node checked will have a value to add to search (except the last node). The previous implementation on the other hand must search through every index, which may or may not have a value.

What's a metric space?

Well, glad you (and by "you" I mean "I") asked! A metric space defines the distance between points, where the distances have the following properties:

  • d(x, y) >= 0 Distance cannot be negative.
  • d(x, y) == d(y, x) Distance between two points is symmetrical
  • d(x, y) + d(y, z) >= d(x, z) Distance meets triangle inequality

Note: x and y above represent 2 points.

Great! Now we know how to check if a metric actually will work, but coming up with distance metrics is like... so hard, okay?

Some Metrics!

Lucky for you, dear reader, I have some distance metrics lying around that you should check out! I'll just list them off here with some links, and tell you how we use them in a second.

Names of things are good, but we need to actually learn what the fuck these things are, and why they are useful.

Levenshtein Distance

Levenshtein distance, also known as edit distance is a really good metric for fuzzy matching between strings. This could be helpful when you're doing things like attempting to correct misspelled text you've scraped. We do a lot of scraping, parsing, crawling, searching, and general interneting here at HG. So anytime we want a good option for fuzzy matching, we start by throwing Levenshtein Distance at it.

This is the most common case I've seen for a BK Tree. In fact, every BK Tree implementation I've seen only implements Levenshtein Distance. For this reason I'm going to kinda skip over this section and move on.

Jaccard Distance

Jaccard Distance is much simpler as a distance metric, it is simply:

For a set A and a set B,

A & B / A | B

“Why's this useful?” I'll tell you why it's not useful: for attempting to fuzzy match strings! This is because it is position indifferent. For instance if you take "team" the Jaccard Distance to "meat" would be 0, because all characters in "team" are also in "meat." Conversely, Levenshtein Distance would be 2 (of a possible 4).

This same property is an awesome metric for doing things like detecting similar sentences or content. Especially if the sets are n-grams or skip-grams. This is because it shows if sentences come from the same vocabulary, but are simple rearrangements of the same words.

Hamming Distance

Last but not least we have Hamming Distance. Hamming distance is a direct comparison index-by-index of two arrays.

An Example

Take the differences between two binary strings:

^  ^     ^^    

There are 4 positions where values are different. This means the hamming distance would be 4.

Similarly if we look at two hex strings:

^  ^      ^^

There are 4 positions where the values are different. Again, the hamming distance would be 4.

Why would I use Hamming Distance?

Welllll... obviously it's useless for language and word analysis, because the distance between any two words is basically random as to how different the words actually are.

One thing we use it for is for comparing hash values for hashes which embed information, such as image hashes. We use this for comparing image similarity to reduce search space.


So, metrics are cool and important tools for many reasons, but equally important is being able to detect when a function is not a metric. I'll give some examples of some tricky metric look-alikes that are simply impostors.

  • Jaro-Winkler Distance An adaptation of Levenshtein distance, with a weight on the first few letters. This is extremely useful for matching nouns. This breaks triangle inequality on sparse cases, so it may seem like it's working at first, but then subtly omits certain search results.
  • Sørensen–Dice Coefficient An adaptation of Jaccard Distance, which helps to reduce the effect of length disparity between the sets. It does not meet the triangle inequality.

Disproving a Metric

Disproving a metric is actually fairly simple, since you only need one case where it fails. A good way to quickly do this is randomly create a bunch of strings of varying lengths, though they should be long. Say between 16-30 characters. Then attempt to show triangle inequality of all permutations of those strings.

If any string is found to fail triangle inequality, it disproves the metric. Please note that if no failed sets are found, this does not prove the metric.

Python-ish Psuedocode for disproving a metric:

def f(x, y): 
    # metric function

def check_metric(func, checks):
    for i in range(checks):
        a_length = random(16, 30)
        a = random_chars(a_length)
        b_length = random(16, 30)
        b = random_chars(b_length)
        c_length = random(16, 30)
        c = random_chars(c_length)

        distances = [func(a, b), func(b, c), func(a, c)]

        for i in range(3):
            first = distances[i]
            second = distances[i + 1 % 3]
            third = distances[i + 2 % 3]

            if first + second < third:
                return False

    return True

if check_metric(f, 1000000):
    print('Yaaaay, we have a metric!')
    print('Boo!  This is not a metric!')


Hopefully this gave you a good idea and motivated you to go grab some data (legally of course) and dig through it to find some interesting statistics! We have a Rust based version of a BK Tree implemented, and will open-source that soon. I'll link that below with an update when we have the code available.