Better Vector Clustering With Head Noun Extraction

Process

An exploration of how standard embeddings can create a semantic soup by grouping search queries by adjectives rather than head nouns during clustering.

Listen

Imagine looking at a list of items that includes blue thermal socks, cheap gaming laptops, and rental bulldozers. If you had to group them, you would probably make three neat piles: socks, laptops, and bulldozers. That is how the human brain naturally categorizes the world.

But what happens when we ask a machine to do the same task? If we convert those search queries into mathematical vectors and cluster them by similarity, we get a very different result.

Instead of grouping by the actual objects, the machine groups them by their adjectives. It puts all the "cheap" things together, all the "blue" things together, and all the "used" things together.

This happens because standard embeddings create a semantic soup. The vector for "cheap laptop" is a mathematical average of "cheap" and "laptop." Because "cheap" is such a strong concept, it pulls the vector toward other cheap items, completely ignoring the physical object itself.

An analysis of search queries reveals a wide variety of these patterns, combining adjectives, nouns, and verbs in complex ways. So, what do we do about this machine learning blind spot? To be continued.

Let’s do a mental exercise.

Glance over the following list and group them in your mind:

blue thermal socks
cheap diesel bulldozer
cheap gaming laptops
blue rental bulldozer
cheap ankle socks
used cushioned socks
blue lightweight laptops
cheap striped socks
used touchscreen laptops
blue compact bulldozer
cheap business laptops
blue ultraportable laptops
used electric bulldozer
cheap mini bulldozer
blue compression socks

Most people arrive at the following clustering schema:

SocksLaptopsBulldozersblue thermal sockscheap gaming laptopscheap diesel bulldozercheap ankle socksblue lightweight laptopsblue rental bulldozerused cushioned socksused touchscreen laptopsblue compact bulldozercheap striped sockscheap business laptopsused electric bulldozerblue compression socksblue ultraportable laptopscheap mini bulldozer

What would a machine do?

Let’s find out.

We’ll vectorise these search queries using Embedding Gemma

0,1,...,255
0.01809046,0.014781968,...,-0.090892490.036337394,0.06969773,...,0.0038870324...etc

Note: In the above example we’re using MRL 256 to reduce dimensionality.

After that we’ll cluster them by similarity of their embeddings. In this specific example we’ll use FAISS index which builds implicit clusters represented as Voronoi cells each one with a “topical centroid”.

And you end up with grouping like this:

???cheap ankle socksblue thermal socksused cushioned sockscheap striped socksblue compression socksused touchscreen laptopscheap gaming laptopsblue lightweight laptopsused electric bulldozercheap business laptopsblue ultraportable laptops
cheap diesel bulldozerblue rental bulldozer
cheap mini bulldozerblue compact bulldozer

What happened?

We ended up with head nouns grouped by adjectives.

Standard embeddings create a “semantic soup.” The vector for “cheap laptop” is a mathematical average of “cheap” and “laptop.” Because “cheap” is a very strong concept, it pulls the vector towards other “cheap” things, ignoring the physical object.

Obviously it’s not all as simple as the above example, our large-scale NLP analysis of search queries reveals a wide variety of patterns:

patternfreqADJ NOUN NOUN45154NOUN NOUN NOUN28902NOUN NOUN25469ADJ NOUN NOUN NOUN25036ADJ NOUN14539NOUN NOUN NOUN NOUN11848NOUN6732ADJ NOUN NOUN NOUN NOUN5403ADJ ADJ NOUN NOUN4033NOUN ADJ NOUN NOUN3684NOUN VERB NOUN3492NOUN ADJ NOUN3367ADJ ADJ NOUN3304ADJ NOUN VERB NOUN2968ADJ NOUN ADJ NOUN2726NOUN NOUN VERB2137ADV NOUN2063ADJ NOUN VERB2037NOUN NOUN VERB NOUN2001NOUN VERB1898

So what do we do?

To be continued…

Dan Petrovic · Nov 28, 17:10