From Query to Meaning; Reshaping the Landscape of Google Search Results Through Semantic Search.
I’ve always hated French class.
Not only was the class incredibly boring, but it was impossible to scam the system. Using Google Translate was strictly off-limits in that class.
Whenever I zoned out in the middle of a lesson, I was always questioning how my teachers just knew if our writing was authentic French or copied from Google Translate.
And, my semester-long question was answered when a classmate tried to translate ‘I am excited’ and wrote an inappropriate response on his test.
If you know, you know.
Google Translate renders its translation word-by-word, and doesn’t consider the context of the phrase. So, instead of realizing its providing the wrong translation for the phrase asked, it just translates each word individually and calls it a day.
That’s how we get caught in class — when the results are contextually wrong.
Although, I’m done French class for the rest of my life — but, this problem still exists. It extends beyond this one moment but rather into our lives as well.
With the usage of technology rising exponentially, it’s important that we are understood by our computers. It has a huge impact on so many aspects of our lives, with a significant focus on one essential element: information search.
Searching for information on engines like Google has become a fundamental part of our daily routines, and we couldn’t see the world without it.
But, there’s flaws.
Sometimes, even powerful search engines like Google can’t decipher what you’re looking for.
So, let’s optimize these engines to understand us, humans, while increasing the value plus efficiency of information retrieval.
Understanding the Problem.
Let’s start by dumbing the problem down and understanding it at its fundamental level.
In order for us to get the correct translation or results based on the search we write, the computer needs to understand what we’re saying.
But, this is trickier than it sounds.
Humans excel at understanding and communicating through words. That’s why you’re understanding my words so easily. But, computers only understand numbers, specifically binary numbers.
There’s a disconnect between the two so, for us two to sit down and have a conversation, we need a coherent translator for both parties to understand.
And, this translation needs to be as seamless as possible.
Just like this Friends episode.
What we’re looking at is a notion called Natural Language Processing (NLP), which describes the ability of a computer to understand spoken and written human language.
For years, it wasn’t even close to its optimal capability. In order for a search engine to be of true value to a user, they shouldn’t have to think twice about whether their question is worded 100% perfectly. And, the first few responses should be beneficial.
But, that isn’t the case.
The Problem Lies In Keyword Search.
In a statement released a couple of years ago, Google acknowledged that its search functionality was not performing optimally. They admitted that users needed to incorporate specific “keywords” into their queries to align with the keyword search algorithm, even if it resulted in a less authentic or natural way of asking for information.
“At its core, Search is about understanding language. It’s our job to figure out what you’re searching for and surface helpful information from the web, no matter how you spell or combine the words in your query. While we’ve continued to improve our language understanding capabilities over the years, we sometimes still don’t quite get it right, particularly with complex or conversational queries. In fact, that’s one of the reasons why people often use “keyword-ese,” typing strings of words that they think we’ll understand, but aren’t actually how they’d naturally ask a question.”
Before the advent of semantic search, the most popular method of searching was through keyword search.
Let’s say I typed into the search bar “What is an Apple, a Banana and Grapes?”
And I have a sample set of responses…
1) The Apple is a fruit.
2) The Pear is a fruit.
3) The Apple and Banana are fruits.
4) An Apple is a fruit, a Banana is a fruit, and Grapes are not fruits.
5) There’s a tree outside.
The process will involve the model analyzing the question and finding synonymous terms shared between the query and the response.
“What is an Apple, a Banana and Grapes?”
1) The Apple is a fruit. [3 words]
2) The Pear is a fruit. [2 words]
3) The Apple and Banana are fruits. [3 words]
4) An Apple is a fruit, a Banana is a fruit, and Grapes are not fruits. [9 words]
5) There’s a tree outside. [1 word]
Then, the response with the highest number of common words will be generated and returned to the user.
So, let’s rerank our responses now…
“What is an Apple, a Banana and Grapes?”
1) An Apple is a fruit, a Banana is a fruit, and Grapes are not fruits. [9 words]
2) The Apple and Banana are fruits. [3 words]
3) The Apple is a fruit. [3 words]
4) The Pear is a fruit. [2 words]
5) There’s a tree outside. [1 word]
There’s a clear issue with this.
The returned response is misleading, as responses #2 and #3 are actually accurate. However, due to the program’s focus on identifying common words, response #1 is selected and returned simply because it mentions all three fruits.
This algorithm is why it’s extremely easy to receive irrelevant responses when you search a question on the internet, as it may fail to understand the actual ask behind what your words but rather recognize a ‘similarity’ score between the words involved.
An Upgraded Solution — Semantic Search.
TL;DR till this point — the problem is that computers evaluate our sentences based on individual words, which results in misunderstandings and inaccurate answers.
A quick way to to address this issue is to shift the computer’s reliance away from common words and instead focus on understanding the contextual meaning of the query and response. By understanding the intent behind the ask, the computer can improve its ability to provide accurate and meaningful answers.
And, we can do this through a new NLP approach called Semantic Search.
Let’s break our term into its two pieces.
Search = the process of locating and retrieving a specific piece of information.
Semantic = the philosophical study of meaning.
And, this philosophical definition proves true within computer science as well.
Now, let’s reconnect the dots with our two components and understand semantic search.
At its core, semantic search is a data-searching technique that aims to extend beyond the keywords interpretation within the search bar but also strives to determine the intent and contextual meaning behind a search query.
In short, semantic search can be simplified down to a three-step process:
- Text Embedding Is Used To Turn Words into Vectors
- Similarity Methods to Find the Vector Among the Responses Which is The Most Similar to the Vector Corresponding to the Query
- The Corresponding Response to This Most Similar Vector is Outputted
Let’s take a deeper look.
1) Text Embeddings
Our end objective is to have a smooth, knowledgeable translator between the computer and the human — and, text embeddings are responsible for making this a reality through correctly assigning a number to a word.
But, this array of numbers that’s assigned to a word is very intentional and captures the words true meaning.
Let’s say I have a randomized list of words…
- Airpods
- Apple
- Banana
- Bracelet
- Earrings
- iPhone
- Macbook
- Necklace
- Pear
Automatically, when you look at this set of words, you can internally divide them into three distinct categories: Fruits, jewelry and Apple products.
So, let’s rearrange…
Category #1: Fruits
- Apple
- Banana
- Pear
Category #2: Jewelry
- Bracelet
- Earrings
- Necklace
Category #3: Apple Products
- Airpods
- iPhone
- Macbook
Now, if I were to give you the word ‘Ring’, you’d probably think of category #2, since it fits with the theme of jewelry. And, you can see visually, how this separation would play out on an x-y axis.
Here, you can see that the categories are as far apart as possible. And, this is the very simple version of how text embeddings can assign a number to a word based on its graphical location in categories — Eg. Ring == [8, 6]
An ideal word embedding is characterized by close proximity between similar words and greater distances between dissimilar words.
But, this isn’t all.
We don’t want to rely on one form of ‘categories’ because that isn’t enough for the computer to understand the object individually. So, we want to build a text embedding that captures the relations between words, as relations between numbers.
Kind of like how these four objects are mapped out…
Now, there’s two ways these four objects can be sorted into its respective categories. These two axes represent different relationships between the words; vertical = bigger items, horizontal = objective categories.
So, it seems that this embedding is comprehending that the words in it have two main properties. And, our computer also understands that if we included the bracelet and an iPad, they’d probably go in the middle of each horizontal axis. Basically, it’s identifying features.
In terms of our array of numbers, each features represents two coordinates, since we can have multiple ‘graphs’ per feature.
Eg. a text embedding could look like [8, 6, 4, 3]
But, a good text embedding has way more than two coordinates to truly capture the essence of the word. Companies like Cohere that have deployed text embeddings API tools have 4096 coordinates associated with each word, so they capture 2048 features. Some of these features may even be beyond human comprehension, that’s how good it is at conceptualizing the meaning of words.
2) Similarity Methods
Overall, text embedding is the heart of NLP & works towards truly understanding the intention behind a user’s words. But, our end goal is to ensure that the query is understood and matches the responses.
So, if we had this set of questions…
- Where does the bear live?
- Where is the world cup?
- What color is the sky?
- What is an apple?
And, this set of answers…
- The bear lives in the woods
- The world cup is in Qatar
- The sky is blue
- An apple is a fruit
And, we can encode these 8 sentences through text embeddings. Just like before, we can plot the sentences in the plane with 2 coordinates. But, in order to match the query and response, we need to evaluate the similarities between the text embeddings.
This is done through similarity methods — a way to tell if two pieces of text are similar or different.
There’s two common ways to assess similarity:
- Dot Product Similarity
- Cosine Similarity
Both of these involve the notion of comparing two documents or two texts and representing the contrast with a number. Overall, a high number demonstrates a comparison of text to itself, a relatively high number when comparing two similar pieces of text, and a small number when comparing two different pieces of text.
Today, we’re going to focus on cosine similarity, a metric commonly used to measure the similarity between two vectors in a vector space. It’s unique because the values it returns are between zero and one. And, the similarity between itself and the text is always one and the lowest value taken is 0 — meaning, they’re not similar at all.
So, back to our questions & answers.
If we were to start off with our first question — “Where is the world cup?” — what’ll happen is that it’ll get compared to all 8 sentences, and each analysis is given a score from zero to one. But, you cannot output the highest score as it’ll be the question itself, with a score of 1, so the system will eliminate it and turn to the nearest ‘neighbour’ that’ll fall at a score of ~0.7.
And, this process is repeated for each sentence.
The plot is as follows:
Here, these properties can be derived:
- In the diagonal of the matrix, you will find all entries filled with the value of 1, indicating that the similarity between each sentence and itself is 1, as expected.
- For the pairs of sentences and their corresponding responses, the similarity scores hover around 0.7 which suggests a relatively high degree of similarity between these sentence pairs.
- And, any other comparisons have a value lower than 0.7.
Once the sentence embeddings have been reviewed, the queries will match with their corresponding answers.
3) Output of Response
And, once the computer understands how the sentence embeddings correspond with one another, it can finally take its learnings and output it to the user.
It’s as simple as that.
Activating a Semantic Search Engine With Cohere & Pinecone!
Now, let’s apply all of that theoretical knowledge and apply the Cohere Embed API endpoint to generate language embeddings and later indexing those embeddings in the Pinecone vector database for fast and scalable vector search.
Let’s walk through a step-by-step annotation together.
After setting up our environment and downloading the required libraries, we’ll start off by forming embeddings.
Create Embeddings
To do this, we need to initialize our connection to Cohere, through importing the libraries and utilizing our API keys.
import cohere
co = cohere.Client("<YOUR API KEY>")
And, we need a questionnaire dataset to work with. So, for this, we can import the Text Retrieval Conference (TREC) question classification dataset and scale it down to 1000 questions.
from datasets import load_dataset
# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
Within each sample in the TREC dataset, there are two label features as well as the text feature. Our focus will be on utilizing the text feature. By taking the questions from the text feature, we can pass them to Cohere in order to generate embeddings.
embeds = co.embed(
texts=trec['text'],
model='small',
truncate='LEFT'
).embeddings
We can also check the dimensionality of our returning vectors to ensure our model is performing correctly till this point, and it is! It’s producing the expected result of (1000, 1024).
import numpy as np
shape = np.array(embeds).shape
print(shape)
[Out]:
(1000, 1024)
With our embeddings in hand, the next step is to proceed with indexing them in the Pinecone vector database.
Storing Embeddings
To begin, we establish the connection to Pinecone, initializing it for further interactions. Following that, we create a fresh index dedicated to storing the embeddings, which we’ll name “cohere-pinecone-trec.”
During the index creation process, we specify our preference for utilizing the cosine similarity metric to align with the embeddings from Cohere. Additionally, we provide the information about the embedding dimensionality we just a step earlier.
import pinecone
# initialize connection to pinecone
pinecone.init(api_key="<YOUR API KEY>", environment="<YOUR ENVIRONMENT")
index_name = 'cohere-pinecone-trec'
# if the index does not exist, we create it
if index_name not in pinecone.list_indexes():
pinecone.create_index(
index_name,
dimension=shape[1],
metric='cosine'
)
# connect to index
index = pinecone.Index(index_name)
With the index prepared, we can populate it with our embeddings. In our case, we will store the original text of the embeddings as part of the metadata, which enables us to associate the embeddings with their corresponding textual content for future reference and analysis.
batch_size = 128
ids = [str(i) for i in range(shape[0])]
# create list of metadata dictionaries
meta = [{'text': text} for text in trec['text']]
# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))
for i in range(0, shape[0], batch_size):
i_end = min(i+batch_size, shape[0])
index.upsert(vectors=to_upsert[i:i_end])
# let's view the index statistics
[Out]:
{'dimension': 1024,
'index_fullness': 0.0,
'namespaces': {'': {'vector_count': 1000}}}
The output reveals that we have successfully populated a 1024-dimensional index with a total of 1000 embeddings. However, when examining the indexFullness
metric, it indicates that our index is currently empty, meaning it has no entries or data stored within it.
Semantic Search
Now that our vectors are indexed, we can move on to performing several search queries.
To conduct a search, we start by embedding our query using Cohere, which generates a vector representation. From there, we use this returned vector to search within the Pinecone vector database we’ve set up.
query = "What caused the 1929 Great Depression?"
# create the query embedding
xq = co.embed(
texts=[query],
model='small',
truncate='LEFT'
).embeddings
print(np.array(xq).shape)
# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)
for match in res['matches']:
print(f"{match['score']:.2f}: {match['metadata']['text']}")
[OUT]:
0.83: Why did the world enter a global depression in 1929 ?
0.75: When was `` the Great Depression '' ?
0.50: What crop failure caused the Irish Famine ?
0.34: What war did the Wanna-Go-Home Riots occur after ?
0.34: What were popular songs and types of songs in the 1920s ?
Here, the output showcases the most related questions to the query in the correct ranking order, which is determined by the similarity vector.
The paraphrased version of the query receives a high rating of 0.83, indicating a strong degree of similarity. On the other hand, a less relevant question related to entertainment in the 1920s has a lower score of 0.34, signifying a reduced level of similarity.
It’s going great so far. But, now, we’re going to test the model and determine if it can judge flaws in the query based on its contextual understanding of the event.
So, let’s change ‘depression’ to ‘recession.’
query = "What was the cause of the major recession in the early 20th century?"
# create the query embedding
xq = co.embed(
texts=[query],
model='small',
truncate='LEFT'
).embeddings
# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)
for match in res['matches']:
print(f"{match['score']:.2f}: {match['metadata']['text']}")
[OUT]:
0.66: Why did the world enter a global depression in 1929 ?
0.61: When was `` the Great Depression '' ?
0.43: What crop failure caused the Irish Famine ?
0.43: What are some of the significant historical events of the 1990s ?
0.37: What were popular songs and types of songs in the 1920s ?
As you can see, the similarity scores for this set of responses are relatively lower compared to the previous ones. But, despite the lower scores, the computer comprehends that the user intends to refer to the Great Depression.
So, it generates a similar range of questions as before, maintaining consistency in understanding the user’s intended meaning.
Pretty cool, if you ask me.
Now, we’ll do one last search, but instead of giving the computer to bounce off of keywords, we’ll use the definition of the Great Depression to truly test its knowledge.
query = "Why was there a long-term economic downturn in the early 20th century?"
# create the query embedding
xq = co.embed(
texts=[query],
model='small',
truncate='LEFT'
).embeddings
# query, returning the top 10 most similar results
res = index.query(xq, top_k=10, include_metadata=True)
for match in res['matches']:
print(f"{match['score']:.2f}: {match['metadata']['text']}")
[OUT]:
0.71: Why did the world enter a global depression in 1929 ?
0.62: When was `` the Great Depression '' ?
0.40: What crop failure caused the Irish Famine ?
0.38: What are some of the significant historical events of the 1990s ?
0.38: When did the Dow first reach ?
0.35: What were popular songs and types of songs in the 1920s ?
0.33: What was the education system in the 1960 's ?
0.32: Give a reason for American Indians oftentimes dropping out of school .
0.31: What war did the Wanna-Go-Home Riots occur after ?
0.30: What historical event happened in Dogtown in 1899 ?
And, our highest similarity vector is 0.71. That’s amazing — we can see the computer recognize the event based on its definition and timings, so it’ll output relevant questions.
These examples clearly demonstrate the effectiveness of a semantic search pipeline for accurately identifying the meaning behind each query. By leveraging these embeddings alongside Pinecone, we can efficiently retrieve the most similar questions from the pre-indexed TREC dataset.
Closing Thoughts
Wow, this is all so so cool.
The future of semantic search is expected to bring advancements such as improved understanding of user intent, multimodal search incorporating various input types, advancements in natural language processing (NLP), integration with knowledge graphs, domain-specific search, and continued advancements in machine learning. And, all these developments will result in more personalized, accurate, and comprehensive search experiences, completely changing how we interact with information and access knowledge.
What started as a mere rabbit hole of a cool-sounding algorithm resulted in a project of building a system capable of it and a journey of technical depth in a new area. Semantic search makes me so very excited for the future of Natural Language Processing, and I cannot wait to see what’s next in this space as inspirational companies like Cohere continue working towards making an impact.
Hey, I’m Priyal, a 17-year-old who’s ambition revolves around working on solving complex problems to meaningfully contribute to the world. If you enjoyed my article or learned something new, feel free to subscribe to my monthly newsletter to keep up with my progress quantum computing & AI development, along with an insight into everything I’ve been up to. You can also connect with me on LinkedIn and follow my Medium for more content!
Thank you so much for reading. I appreciate it.
— Priyal