Hierarchical Embeddings for Text Search

Gwern

Hierarchical Embeddings for Text Search

List-processing tricks for generating embeddings at different levels of document abstraction to allow efficient semantic searching.

2024-12-10 finished : possible : 3

A proposal for better retrieval on large, complex, hierarchically-structured document corpuses, which can be implemented straightforwardly using heuristics and pairwise embedding distances.

We embed each atomic element, merging them heuristically into larger embeddings repeatedly.

Then we search arbitrary text inputs in the reverse direction hierarchically, from the largest embedding to the smallest, to return the best matches in size order while excluding spuriously-similar hits.

X-dimensional text embeddings⁠ have heavily supplemented (or even replaced) traditional keyword search these days: neural nets encode documents as numeric fingerprints, allowing easy retrieval of the nearest matches. You search over documents by using neural nets to generate an embedding, which is a list of numbers which act like a sort of ‘thumbprint’ or ‘semantic summary’, and then finding the numerically-closest embeddings from all the other documents available. These embeddings can also be processed versions of text, like summaries which draw on multiple documents, and can become quite complex, like RAPTOR⁠, which summarizes clusters of documents at multiple levels to try to provide useful summaries for downstream Q&A tasks.

I would like to consider a simpler task here: extending recommender-system-like ‘suggested reading’ lists of retrieval hits, based on arbitrary parts of documents rather than the entire document.

Magically Similar Links

This is how ‘similar links’ is done on Gwern.net: the abstract and some additional metadata get turned into an embedding, and then it is compared to all the others (as k-nearest neighbor search⁠), and the closest subset is returned.

We then additionally ‘seriate’ them into a list: sorting them to try to minimize the ‘distance’ between each one, so they don’t look like they were randomly shuffled, but follow a certain internal logic. (We call this use of seriation, ⁠‘sort by magic’⁠.)

Intra-Document Search

This works well at the document level, but it doesn’t handle anything inside documents. If you want to search by a specific passage, you can’t do that. On a text-document-centric website like Gwern.net, ⁠LessWrong⁠, or English Wikipedia⁠, power users would sometimes like to do that, in a way which is better than a keyword search.

If you embedded a passage and searched it against the documents, that’s helpful but still not too useful, since there are lots of other passages which might be similar, but which cannot be seen in this way. OK, so you decide you will ⁠embed every passage… but what is a ‘passage’? Is it every paragraph? What is a paragraph, is it just every <p>... </p> element? (What if it’s a single sentence?) What about lists and blockquotes and footnotes and tables and…? What if the paragraphs are not individually good matches, but an entire section (or the second half of an essay) is what the reader is looking for? How do you specify all this in advance? We could try to pre-define a hierarchical scheme: embed each HTML block element⁠, then embed each section, then embed each document, and search over them all, and hope that we sliced it the right way. But if you get it wrong, you partially defeat the point because your a priori way of slicing will not necessarily work for every part of every document. (How does this handle pages with deeply-nested section hierarchies…?)

You can’t embed every possible way to slice a page, because that becomes combinatorially many, and embeddings are not that cheap. (Even for a relatively small corpus like Gwern.net using only document-level embeddings, it’s not hard to hit 1GB of embeddings, and that’s without possibly generating thousands of hypothetical embeddings per page.) You also risk becoming a victim of your own success: the more embeddings, the more spurious or overlapping hits. (If there is 1 key paragraph, and it’s contained within each of a dozen different ways of slicing that page, then a search will turn up a dozen redundant hits—which is both a nuisance and crowds out meaningful alternatives. This gets especially bad if you think about how many paragraphs contain little unique information, like in scientific paper abstracts, where the ‘introduction’ and ‘conclusion’ paragraphs might be both short & stereotyped and could practically be copy-pasted among hundreds of papers with a little editing.) This sort of retrieval-by-embedding problem has many solutions, none particularly satisfying, and many quite complex.

List Clustering

We have a problem of breaking up a document into natural ‘clusters’.

As it happens, I faced a similar problem on Gwern.net, in breaking up the list of similar documents into clusters which I could easily label with a name. I solved it when I noticed, looking at them, that the seriation resulted in clear clusters in the list, and these clusters corresponded to sudden ‘jumps’ in the embedding distance. This made sense in retrospect: if there is a topic A and a topic B, then the seriation will tend to result in a list like ‘A A AB B B’, and at ‘AB’, there will be the largest distance in the list. So you can extract n clusters simply by going through the list, and recording the indexes of the largest n pairwise distances, and splitting the list at those. So if you specified n = 2, then the ‘A A AB B B’ list cleanly splits into ‘A A A’ and ‘B B B’ sub-lists.

How do you decide n? You could try to define some distance which corresponds to a ‘jump’, but I settled for a heuristic: the number of clusters tends to go up fairly steadily with the total number of documents, faster than logarithmically but slower than linearly, and a square root seemed to work reasonably well. So if there are k documents, one just extracts n = √k clusters (rounding up, for a ceiling, as I find better a bit too many clusters than too few). This ghetto clustering algorithm worked surprisingly well in practice, and wasn’t too hard to implement.

Documents As Lists-Of-Lists

So I think you could do something similar for our document problem: embed every block element, like paragraphs and list items; then apply the seriation clustering algorithm to determine the break points. Then, one applies this recursively to build up a data-driven hierarchy.

Each cluster’s embeddings get averaged together, to create the next level. So a document might have 100 block elements, which yield √100 = 10 clusters; each of those 10 cluster’s 10 embeddings get averaged into a ‘prototypical’ embedding of that cluster, its center; then the seriation is done again, and √10 = 3 clusters are taken; then 2 clusters; and then the entire document. (One tracks the ‘spatial’ extent of each cluster as well, for use later on in search.) Now one has broken up a document in the semantically logical ways, without being too hardwired with a specific hierarchy. And there will not be too many embeddings, because it is hierarchical: the square root heuristic means we don’t have to store too many embeddings beyond the atomic level (total embeddings: 𝒪(n + n^0.5)?), which we can tune the size of. (If block level is too expensive, we can try to work with a few blocks each time, etc.)

Reverse Tree Search

How do we search an arbitrary text range against this usefully, since there will now be so many atomic embeddings potentially cluttering results? We can do hierarchical search in the reverse direction (a bit like HNSW⁠): we can first do the normal lookup against the top-level (document embeddings) and show the k results from there (seriated, of course!); then we can descend into each matching top-level document, and search only the clusters there, and return the l best results there; then we can descend into those and return the best m results there; and so on. (Because the hits are operating at increasing levels of abstraction, we may find we need to be very generous in terms of distance at the top-level, but increasingly narrow as we drill down to the atomic level.)

By restricting our increasingly-granular search to the scope defined by the hierarchy, we exclude spuriously-similar hits from other documents, avoiding issues like boilerplate ‘in conclusion’ paragraphs matching hundreds of other documents—they would only match a similar ‘in conclusion’ paragraph in the same scope (and that would probably not exist.)

This provides a list of useful semantic hits from general to specific, allowing the reader to drill down to an exact paragraph, at a roughly linear cost in the most specific atomic level of text embedding.

Website Implementation

HTML-implementation-wise, we will assume that IDs are available which define ranges inside documents, and the embeddings propagate ranges appropriately by taking the beginning & end of the cluster, and this is a bit like a range tree⁠. This would work well with the Gwern.net transclusion⁠ popups, which can transclude specific ID ranges.

We can also enable the ‘drilling down’ nicely by using lazy-loading collapses/disclosure blocks: we can show the first k hits at each level, but then append a collapse block to each hit, which contains the k-best hits within that that hit, so a user can choose which way to drill—either ‘vertically’ or ‘horizontally’. So a text input returns a ⁠popup⁠ containing a nested list of “Document-level hits: A + [within A], B + [within B], C + [within C] / §-level: E.1+… U.8+… Z.2+… … / paragraphs-level: M.5.1+… … / paragraph-level: N.22.6.9+…”, each of which can be either popped-up or uncollapsed. (And like the seriated sort-by-magic clusters, we can label each range using a LLM⁠ to summarize them with a human-readable name, if they do not match up nicely with a section header or something.)

We can also avoid dynamicism by changing the search: you can’t search for arbitrary text, instead, if you mouse over some text, you simply search for the cluster which contains that text. (If you search on a paragraph, then you start with that atomic element; if you search 2 adjacent paragraphs, whatever is the lowest cluster that contains them both.) The results can then be precomputed, or deferred to an API service.

Because embeddings are fairly cheap and independent at the document level, we can simply throw away embeddings at the document-level whenever a document is modified, and re-embed hierarchically that document.