2024-10-07 Weekly Update - Levenshtein, Hamming, and Jaccard Distances

acv,4 min read

This week as part of my consulting work I’ve been working on evaluation metrics for LLM-powered text extraction. Essentially, trying to quantify if the LLM grabbed the right text from a document. This means getting hands-on with some common text comparison algorithms. These were largely already familiar to me but I hadn’t had a change to use most of them on a real engineering problem before. I collected my notes and turned them into this week’s post.

If you need something to listen to while you read may I suggest “Something Will Happen” by house/jazz artist Beloiz. Come for the lush melodies, stay for the encouraging voice-overs by Willem Dafoe.

String Metrics

Reference: https://en.wikipedia.org/wiki/String_metric

Formally, all the string metrics measures here are proper metric spaces in the mathematical sense. A metric space is a set MM together with a notion of distance dd between set elements that must satisfy the following four conditions:

Levenshtein Distance

Reference: https://en.wikipedia.org/wiki/Levenshtein_distance

The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. For example:

L(bat, cat) = 1

bat : 0
cat : 1 (b -> c)

L(bat, egg) = 3

bat : 0
eat : 1 (b -> e)
egt : 2 (a -> g)
egg : 3 (T -> g)

The Levenshtein distance can be normalized by dividing by the length of the longer of the two text strings. This results in a range of [0,1][0,1].

Hamming Distance

Reference: https://en.wikipedia.org/wiki/Hamming_distance

Unlike the Levenshtein distance the Hamming distance is only used to compare strings of the same length. As a result it only measure the number of substitutions. So it is the number of positions at which the two strings are different.

Similar to the Levenshtein distance, the Hamming distance can be normalized by dividing by the length of the strings.

Jaccard Distance

Reference: https://en.wikipedia.org/wiki/Jaccard_index

The Jaccard distance is the complement of the Jaccard coefficient (or index), that is, 1J(A,B)1 - J(A,B). Where the Jaccard coefficient is the ratio of the intersection of two sets over their union:

J(A,B)=ABABJ(A,B) = \frac{|{A \cap B}|}{|A \cup B|}

Note that the Jaccard coefficient does not define a metric space because J(A,A)=1J(A,A) = 1 not 00 which is why we subtract 11 to get the Jaccard distance. For text comparisons set elements can be defined in different ways leading to different results. For example, by characters, by words, or by n-grams.

Odds and Ends

Here are some other things that have caught my interest this week.

This work by Alex C. Viana is licensed under CC BY-NC-SA 4