2024-10-07 Weekly Update - Levenshtein, Hamming, and Jaccard Distances

acv,Mon Oct 07 2024•4 min read

This week as part of my consulting work I’ve been working on evaluation metrics for LLM-powered text extraction. Essentially, trying to quantify if the LLM grabbed the right text from a document. This means getting hands-on with some common text comparison algorithms. These were largely already familiar to me but I hadn’t had a change to use most of them on a real engineering problem before. I collected my notes and turned them into this week’s post.

If you need something to listen to while you read may I suggest “Something Will Happen” by house/jazz artist Beloiz. Come for the lush melodies, stay for the encouraging voice-overs by Willem Dafoe.

String Metrics

Reference: https://en.wikipedia.org/wiki/String_metric

Formally, all the string metrics measures here are proper metric spaces in the mathematical sense. A metric space is a set $M$ together with a notion of distance $d$ between set elements that must satisfy the following four conditions:

Self-Distance is Zero: Distance from a point to itself is zero: $d(x,x) = 0$
Positivity: The distance function always returns a positive value for any two distinct points: $x \neq y \Rightarrow d(x,y) > 0$
Symmetry: The distance function is symmetric: $d(x,y) = d(y,x)$
The Triangle Inequality: The following property holds: $d(x,z) \leq d(x,y) + d(y,z)$

Levenshtein Distance

Reference: https://en.wikipedia.org/wiki/Levenshtein_distance

The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. For example:

L(bat, cat) = 1

bat : 0
cat : 1 (b -> c)

L(bat, egg) = 3

bat : 0
eat : 1 (b -> e)
egt : 2 (a -> g)
egg : 3 (T -> g)

The Levenshtein distance can be normalized by dividing by the length of the longer of the two text strings. This results in a range of $[0,1]$ .

Hamming Distance

Reference: https://en.wikipedia.org/wiki/Hamming_distance

Unlike the Levenshtein distance the Hamming distance is only used to compare strings of the same length. As a result it only measure the number of substitutions. So it is the number of positions at which the two strings are different.

Similar to the Levenshtein distance, the Hamming distance can be normalized by dividing by the length of the strings.

Jaccard Distance

Reference: https://en.wikipedia.org/wiki/Jaccard_index

The Jaccard distance is the complement of the Jaccard coefficient (or index), that is, $1 - J(A,B)$ . Where the Jaccard coefficient is the ratio of the intersection of two sets over their union:

J(A,B) = \frac{|{A \cap B}|}{|A \cup B|}

Note that the Jaccard coefficient does not define a metric space because $J(A,A) = 1$ not $0$ which is why we subtract $1$ to get the Jaccard distance. For text comparisons set elements can be defined in different ways leading to different results. For example, by characters, by words, or by n-grams.

Odds and Ends

Here are some other things that have caught my interest this week.

Python 3.13 features: Python 3.13 was released on 2024-10-08. I recently wrote about some new (to me) Python features I had missed so I was curious to check out what’s new in this release. I found a nice overview on this Real Python Podcast episode which includes links to all the relevant feature PEPs.
Inline Functions: In another call-back to a recent post I was talking about “taste” in programming, which I loosely defined as being something a little more granular than design patterns. This post on inline functions from John Carmack (and the associated Hacker News discussion) on when to be inconsistent is exactly the kind of programming nuance I’ve been thinking about.
The MANIAC: I’ve been working on this audiobook about the life and work of John von Neumann. Written in a semi-fictional voice, the book covers everything from his work on the incompleteness theorem with Godel through the impact of his work on recent Nobel Prize winner Demis Hassabis and Lee Sedol’s legendary Go match against AlphaGo. It’s a dark, almost creepy, read on technological progress but I’m still enjoying it.

This work by Alex C. Viana is licensed under CC BY-NC-SA 4