Things learned from implementing Elasticsearch


Introduction

Recently, I undertook a project to transition an Azure App Service from ‘Azure Cognitive Search’ to ‘Elastic Search’. This was driven by several factors:

  • Improved Performance
  • Enhanced Backup/Restore Functionality (ISO Delegation)
  • Cost Efficiency

Let’s start with a couple of things I’ve learned:

Nuget support

Elastic Search is currently in the migration of using the Nuget NEST to the nuget ElasticSearch.Net. This means that ElasticSearch.Net doesn’t have all functionality from Nest (yet).

Be careful in which you pick. Personally, I used Nest because of the following reasons:

  1. Serializing Enum to strings using Nest.JsonNetSerializer is easy, similar to using a Json Converter in Newtonsoft.

Note: Since Elastic Search is a ‘Vector Search’ Database, this didn’t seem a performance issue.

  1. Nest has all functionality on a higher level and I didn’t want to be bottlenecked by anything ( a precaution). ElasticSearch is considered to be a Low level client ( eg. check differences with schema generation)

Schema generation

Nest provides Auto mapping functionality for schema generation. While the ElasticSearch.Net library provides functionality to the mapping API’s.

I defined some properties manually ( eg. The Version) and let the other properties AutoMap.

Indices vs hot, warm & cold storage

Data in ‘Elastic Search’ is sorted per index.

Eg. if we would take a indice named: ’ecom_orders’. It means that all data of orders will be stored in the index of ’ecom_orders’. Since it’s the only index that is being queried and/or updated to, this index remains in the hot storage.

If we decide that we can split the orders per year, we could decide to split the index per year. So we have the following indices and storage behavior:

  • ecom_orders_2024 - hot storage => being written to
  • ecom_orders_2023 - warm storage => being queried
  • ecom_orders_2022 - cold storage -> index is no longer being updated and seldom queried.

Choose a logical structure for indices so costs can be optimized (cold storage is much cheaper). Additionally, it’s easier to integrate certain data policies for eg. GDPR reasons and delete data based on it’s index ( eg. ecom_orders_2022 may be deleted after 360 days of not querying).

If possible, query the indices correctly. Eg.

  • Querying solely the orders from the year 2024, then we’ll want to query the index: “ecom_orders_2024”.
  • Querying all orders over the years, then we’ll want to query the index: “ecom_orders_%”. Which will query the following indices: ’ecom_orders_2022’, ’ecom_orders_2023’ and ’ecom_orders_2024’.

Read more about it here

Querying

Consider the following entity:

{
  "product": {
    "name": "Product 1",
    "status": "Active",
    "createdOn": "2023-05-07T05:03:02.020"
  }
}

In a traditional SQL query, to retrieve all active products since 2023, you might use the following logic:

SELECT * FROM Product 
WHERE status = 'Active' AND createdOn >= '2023-01-01'
ORDER BY CreatedOn DESC

However, Elasticsearch operates differently. Here’s the equivalent Elasticsearch query:

{
  "query": {
    "bool": {
      "filter": [
        { "term": { "status": "Active" }}
      ],
      "must": [
        { "range": { "createdOn": { "gte": "2023-01-01" }}}
      ]
    }
  },
  "sort": [
    { "createdOn": { "order": "desc" }}
  ]
}

Let’s dive into the three main options for querying: ‘Filter’, ‘Must’, and ‘Should’:

Querying: ‘Filter context’

"filter": [
  { "term": { "status": "Active" }}
]

The ‘filter context’ is suitable for frequent filtering and is automatically cached by Elasticsearch to improve performance. In our example, we’re interested in ‘Active’ products, and this context ensures efficient querying.

Querying: ‘Must context’

"must": [
  { "range": { "createdOn": { "gte": "2023-01-01" }}}
]

The ‘must’ context is used for exact querying, but the results aren’t cached. It’s ideal for conditions that frequently change, such as search parameters.

Querying: ‘Should context’

At first, I thought the ‘should context’ is misleading. Since it only affect ordering behavior.

Let’s consider a new query with should clauses:

"should": [
  { "match": { "name": "premium" }},
  { "term": { "status": "Active" }},
  { "range": { "createdOn": { "gte": "2023-01-01" }}}
]

Here’s what could be returned:

[
  {
    "id": "123456",
    "name": "Premium Laptop",
    "status": "Active",
    "createdOn": "2023-02-15",
    "score": 0.9
  },
  {
    "id": "789012",
    "name": "Premium Headphones",
    "status": "Active",
    "createdOn": "2023-05-20",
    "score": 0.9
  },
  {
    "id": "098765",
    "name": "Premium Old Camera",
    "status": "Active",
    "createdOn": "2022-10-10",
    "score": 0.66
  }
]

Pay close attention to the following:

[
  //...
  {
    //...
    "createdOn": "2022-10-10",
    "score": 0.66
  }
]

There is a result from before 2023 and it’s ranked lower. The score we’ve given here is .66 ( ~ 1 of the 3 clauses failed). But… The result is still returned.

So ‘should’ only affects the scoring mechanism within ‘Elastic Search’ The more clauses it matches, the better the score, the higher it ranks!

Note: To remove lower ranked results from the results, you could fine-tune the ‘minimum_should_match’ in your query.

Optimistic concurrency

‘Elastic Search’ provides 2 ways for optimistic concurrency:

  1. Version ( build-in or an external one)

The buildin version can be seen with a _version property, return that _version property by querying your entities with version=true.

If you want to enable external versioning, then you’ll need to add a property with the datatype “version” to your entity and adjust your data with the parameter “version_type=external”. More info here

  1. seq_no and primary term

More info here but I didn’t go down this road.

Final Notes

I’ll update my ramblings here when I discover something new ;)

See also

g