Customizing ‘Relevancy’ In Elasticsearch - Part 1

Data Pipeline and Analytics | May 22, 2018

4 min read

Customizing ‘Relevancy’ in Elasticsearch – Part 1

I often come across a debate amongst techies about ‘speed vs relevance’ while they are building search/ retrieval information applications. It is an important consideration for this kind of an application. However, there is no single answer to it. I am going to share how I approach this challenge.
I start by asking 2 questions:

How long do you wait on a page if you are searching for something and waiting for the result? Few secs, few milliseconds? No, we are talking about micro or nanoseconds.
Do you get upset when your search query provides result that are not relevant to your query? Ofcourse. So, what we do. We try to adjust our search term/word to get the relevant results/answers.

Relevancy has become a really important consideration when we are searching or reading up on the internet or any app especially when the data size is huge.
Relevancy ranking is the process of sorting the document results so that those documents which are most likely to be relevant to your query are shown right at the top.
But this process needs to be defined in order to appear when one is looking for it. And, this is how it can be defined:

Typo: Will rank higher when a word doesn’t contain a typo or a spelling mistake
Words: Will rank higher when an object matches the words typed in by a user – if the query contains more than one word
Proximity: Will rank higher when words are having no unwanted conjunctions. For eg: If you’re looking up George Clooney then people won’t type George ‘and’ Clooney or “George ‘some word’ Clooney. Word proximity matters in ranking.
Attribute: Will rank higher when the match is a more important attribute (Title, Description).

Say: There is a big article and it has titles, description, subtitles, footnotes etc., then one can define the key/important words and base the ranking accordingly. Or an algorithm may work on its own and pick random words based on the Title.

Words position: Will rank higher when the accurate words are at the beginning of an attribute
Exact: Will rank higher when words match exactly without any suffix (Algolia natively handles prefix search)

There are a few algorithms designed to find relevant results. Most commonly used ones are:

TF-IDF
Vector Space Model
Boolean Model
K Nearest Neighbor (KNN)
Custom Build

Again, the documents which are most relevant are not necessarily those which are most useful to display on the first page of search results. For example, two duplicate documents might be individually considered quite relevant, but it is only useful to display one of them. A measure called “maximal marginal relevance” (MMR) has been proposed to overcome this shortcoming. It considers the relevance of each document only in terms of how much new information it brings given the previous results.
Apache Lucene, on which the Elasticsearch engine is built, uses TF-IDF, Vector Space Model as well as Boolean Model. One can actually build or change their own method of searching based on Elasticsearch.
Let’s find out how TF-IDF works?
TF- Term Frequency. Number of occurrence of the word in given document/s
IDF- Inverse Document Frequency. Number of given document/s has that given word
TF: One can find the value of given term (word) in given document/s like tf (t,d).This means that number of times the term t occurs in document d.
Let’s take an example using 2 sentences as below:
I have document d1: This is called Cricket.
And I have document d2: This is another game.

Terms d1	Frequency	Terms d2	Frequency
this	1	this	1
is	1	is	1
called	2	another	2
cricket	1	game	3

I need to find out weightage of each word in the document so that it is easy to provide a result. By tf formula, suppose I have to find weightage of the word “this” which appears only once in both documents. But the number of words in d2 is more.
Tf{“this”,d1}= 1/5= 0.2
Tf{“this”,d2}=1/7 ~0.14

IDF, in this case, is same for both documents as there are 2 number of documents and term appears twice so
Idf{“this”,D}=log{2/2}=0

So tf-idf in this case is
Tfidf{“this”,d1}=0.2×0=0
Tfidf{“this”,d2}=0.14×0=0

So this is not a very interesting or relevant term in given documents.
Let’s look at the term ‘game’ here.
Tf{“game”,d1}=0/5=0
Tf{“game”,d2}=3/7=~0.429
Idf{“game”D}=log(2/1)=0.301

Finally,
Tfidf{“game”,d1}=0x0.301=0
Tfidf{“game”,d2}=0.429×0.301=~0.13

So, the word ‘game’ is the most relevant word in document d2.
And, that’s how TF-IDF algorithm works.
In case of ElasticSearch engine which is based on Apache Lucene, uses “_score” filed as metadata which gives the weightage as “_score” on each document. This value “_score” calculated based on tfidf and some other algorithms helps provide relevant results. We can customize this search results by providing other mechanisms like ‘Boosting’.
I’ll be taking this up in my next blog – which will cover features/API to be used in ElasticSearch to achieve relevant search results and other popular algorithms. Watch out for the June issue!
Ajit Gadge I Senior Database Consultant

Ajit brings over 16 years of solid experience in solution architecting and implementation. He has successfully delivered solutions on various database technologies including PostgreSQL, MySQL, SQLServer. His derives his strength from his passion for database technologies and love of pre-sales engagement with customers. Ajit has successfully engaged with customers from across the globe including South East Asia, USA and India. His commitment and ability to find solutions has earned him great respect from customers. Prior to Ashnik he has worked with EnterpriseDB for 7 years and helped it grow in South East Asia and India

What Does "Database High Availability" Really Mean?

Jul 30, 2020 | 4 MIN READ

Elastic Stack Sizing consideration and architecture

Feb 13, 2019 | 7 MIN READ

Relevance Search with Elastic – Part 2

Jul 13, 2018 | 6 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Bolt.new, Bolt.DIY & DeepSeek-V3: AI Transforming DevOps from Development to Deployment - Watch Now!

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo

Customizing ‘Relevancy’ in Elasticsearch – Part 1

Read More

What Does "Database High Availability" Really Mean?

Elastic Stack Sizing consideration and architecture

Relevance Search with Elastic – Part 2

Products