4 concepts to get started with Elasticsearch

Are you bummed that the only database you can query is SQL? Don’t fret, as you too can soon be developing with Elasticsearch. It’s true. Let’s review a case study from the Payment Screening team to see how.

Understanding Real-Time Payment Screening

Our team is responsible for flagging suspicious transactions in real-time. Is the beneficiary of the payment sanctioned? Does this look like a bribe to a politician? We also screen the reference text field in payments for names, or even particular words, that warrant human review. Depending on the search configuration and data, this type of screening can yield many false positive matches.

Challenges with Single-Word Aliases

The issue stems from arbitrary words in the reference text field matching persons with single-word aliases. For example, Ali Saddam Hussein al Tikriti (the son of Saddam Hussein) has the alias “Hassan”. Often such aliases are marked as “weak”, which enables us to filter out low-quality matches. But this isn’t foolproof. “Hassan” is marked as weak in some sanction lists, but non-weak in others.

Identifying Non-Weak Single-Word Aliases

The Payment Screening team was asked how many non-weak single-word aliases exist. Then introduce a new screening option to filter these out. Elasticsearch powers reference text screening. So responding to the problem required a working knowledge of this database. However such knowledge only requires you to understand four core concepts. We will see how our team used these to introduce the remove_one_word_person_alias search option for Payment Screening. After, you will be able to debug issues, answer support requests, or build new features for any app using Elasticsearch.

Getting Started with Elasticsearch: Setting Up Your Environment

We will be querying a local version of the Elasticsearch cluster populated with fake data. Clone the repository and build the docker service yourself. Then run these queries with me to see the responses.

Concept 1: The Index Mapping

What is it? It’s the schema of documents stored in Elasticsearch.

How did this help the team? We used the mapping to check if Elasticsearch stored the information needed to query for one-word aliases.

Elasticsearch is a document-oriented database. Collections of documents are stored in an “index”. Confusingly, this term doesn't relate to the traditional concept of database indexes. The index mapping describes the schema of the documents it stores.

The mapping shows us the document fields and data types. This will guide what queries we can perform later. Use this request to view all the cluster’s mappings (also available in the repository):

curl http://localhost:9200/_mapping

Which returns...

{
  "entities": {
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "addresses": {
          "type": "text"
        },
        "aliases": {
          "type": "nested",
          "properties": {
            "aml_type": {
              "type": "keyword"
            },
            "is_weak": {
              "type": "boolean"
            },
            "name": {
              "type": "keyword"
            },
            "word_count": {
              "type": "integer"
            }
          }
        },
        "birthdate": {
          "type": "date"
        },
        "country_code": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date"
        },
        "primary_name": {
          "type": "keyword"
        },
        "type": {
          "type": "keyword"
        }
      }
    }
  }
}

The entities are stored in the entities index, so that’s the mapping we’re interested in. Querying the Search API without a request body returns the first 10 documents in the index. We can use the next request to see example values for the document fields. (Later, we will construct more specific queries.)

curl http://localhost:9200/entities/_search

The shape of the entity objects looks like this:

{
    "primary_name": "Vladimir Putin",
    "created_at": "28-03-2014",
    "birthdate": "07-10-1952",
    "type": "person",
    "country_code": "RU",
    "addresses": ["Vosdvizhenka str. 1. Moscow 121019. Russia"]
    "aliases": [
        {
            "name": "Vladimir Putin",
            "word_count": 2,
            "is_weak": false,
            "aml_type": "sanction"
        },
        {
            "name": "Putin Vladimir Gennad evich",
            "word_count": 4,
            "is_weak": false,
            "aml_type": "sanction"
        }
    ]
}

Our team needed to filter for person entities that have at least one non-weak one-word alias. What document fields could help us? Well, the aliases for a given entity are stored in a nested aliases field. And the name, type, word_count, is_weak fields seem relevant. But what are the data types we’re storing?

The number types are self-explanatory (long, integer, float, etc.). The keyword type is a literal string, while text is unstructured full-text data. When Elasticsearch indexes documents it also transforms (“analyses” in Elasticsearch parlance) text data. These configurable transformations improve the usefulness of full-text search. Lowercasing all text is an obvious example. And stemming helps matching between conjugations (i.e. run == runner == running).

Because our reference text queries don’t use full-text search, I won't discuss it further. If you want to learn more, read the Elasticsearch documentation. It provides all the details of text analysis and the full-text Query DSL.

Concept 2: The Term Query

What is it? The term query is used to match against exact strings, booleans and numbers.

How did this help the team? We used a term query to match against fields like name, type, word_count, and is_weak.

Time for your first Elasticsearch query. Wow, congratulations! A general syntax for the Search API's request body is:

{
  "query": {
    "<query type>": {
      "<field name>": <value>
    }
  },
  "size": <n docs returned>
}

We want to find the count of non-weak one-word person aliases. First, let us filter documents by the entity type. The index mapping showed a type keyword field that we can use:

curl http://localhost:9200/entities/_search \
--header 'Content-Type: application/json' \
--data '{
  "query": {
    "term": {
      "type": "person"
    }
  },
  "size": 1
}'

That’s a start, but we need to match against multiple fields. Let’s move to our next concept…

Concept 3: The Bool Query

What is it? The bool query composes subqueries. It can be used to match several terms and provide more complex conditional logic.

How did this help the team? We used a bool with term queries to match against all our required fields (name, type, word_count, and is_weak).

The bool query isn’t a true query itself. It composes other queries to provide conditional logic (i.e. find documents that match this query and that query, or this other query). It comprises four fields, each of which are optional. Each field can accept an array of queries (also referred to as clauses) to match. These fields are:

must - Equivalent to AND. Documents must match all queries provided. Query matches contribute to a document’s relevancy score.

filter - Equivalent to AND. Like must, documents need to match all queries provided. However, query matches won’t contribute to the relevancy score. Prefer filter over must if the ordering of returned documents isn't important. It will improve search performance.

must_not - Equivalent to NOT. A document can not match any query provided.

should - Equivalent to OR. A document must match at least one of the provided queries.

A general syntax looks like:

{
  "query": {
    "bool": {
      "must": [query1, query2],
      "filter": [query3],
      "must_not": [query4],
      "should": [query5]
    }
}

The bool query is the simplest way to implement negation logic into a search. For example, let’s filter out person entities that aren’t from the United States. The index mapping shows there is a country_code field we can take advantage of. Our search uses the bool query to compose multiple term queries:

curl http://localhost:9200/entities/_search \
--header 'Content-Type: application/json' \
--data '{
  "query": {
    "bool": {
      "filter": [{
        "term": { "type": "person" }
      }],
      "must_not": [{
        "term": { "country_code": "US" }
      }]
    }
  },
  "size": 1
}'

Unfortunately, we can’t naively use a bool query on all the fields we highlighted earlier. Most are in the nested field aliases. “Nested” means something specific in Elasticsearch… Next concept! 👇

Concept 4: Nested Fields

What is it? Nested fields are separately indexed documents within a document. Analogous to a one-to-many relation between two SQL tables.

How did this help the team? The nested alias documents are available to the Elasticsearch Query DSL. Therefore, we could use a nested query to filter aliases by type, word_count, and is_weak.

A stored document can contain a nested array of objects on a field. Like the aliases field below:

{
    "primary_name": "Vladimir Putin",
    "country_code": "RU",
    "aliases": [
        {
            "name": "Vladimir Putin",
            "word_count": 2,
            "is_weak": false,
            ...
        },
        {
            "name": "Putin Vladimir Gennad evich",
            "word_count": 4,
            "is_weak": false,
            ...
        },
    ],
    ...
}

Typically, nested objects are not indexed by Elasticsearch. As a result, they cannot be searched by the Query DSL. A nested field like aliases is indexed and queryable. But it requires a special, awkward query syntax:

{
  "query": {
    "nested": {
      "path": <field to perform query on>,
      "inner_hits" {}  // to control the output from the nested documents
      "query": <query to apply>
    }
}

Putting it all together

Our team wanted to create a query to return all persons that have at least one non-weak one-word alias. How did the four concepts above help us?

The index mapping showed the relevant fields for us to query.
A term query can filter for any one of these relevant fields.
A bool query can combine multiple of these term queries.
A nested query can search fields within the nested aliases field.

Combining this knowledge, the query to return entities with at least one non-weak one-word aliases looks like:

curl http://localhost:9200/entities/_search \
--header 'Content-Type: application/json' \
--data '{
  "_source": false,
  "size": 10,
  "query": {
    "bool": {
      "filter": [
        {
          "term": { "type": "person" }
        },
        {
          "nested": {
            "path": "aliases",
            "inner_hits": {"_source": ["aliases.name"]},
            "query": {
              "bool": {
                "filter": [
                {
                  "term": { "aliases.word_count": 1 }
                }],
                "must_not": [{
                  "term": {
                    "aliases.is_weak": true
                  }
                }]
              }
            }
          }
        }
      ]
    }
  }
}'

A few points to note:

We used multiple bool queries to help us match both nested and non-nested fields. It might look confusing, but it isn’t really. It only contains the basic queries we’ve already learnt.
"size": 10 will show the results from only the first 10 entities.
"_source": false will stop Elasticsearch from returning the full entity document.
"inner_hits": {"_source": ["aliases.name"]} will return only the name from the nested aliases that match the query.

Additional search information is returned in the hits field of the response. hits.total.value is the total document count that matched the query.

    
  ...
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 0,
    "hits": [
      {
      ...

So, how many entities matched the query above? The hits.total.value suggests exactly 10,000 🤔 This is not the true value, but the Search API only returns the first 10,000 results. There are two options to see the true value. The first is to update the query to track the total hits. The second option is to perform an aggregation on the query results. But that’s a more advanced topic. For now, you can track the total hits, tweak your query as needed, and count the returned documents.

From our investigation last year, we found there were tens of thousands of entities with at least one non-weak one-word alias 😣 We responded to our Product Manager and moved forward introducing the new remove_one_word_person_alias. Instead of returning all the one-word aliases, this search option filters them out. I’ll leave it to you to figure out what that query should look like 😉

A Few Gotchas

Here are some final points to note before you rush out to develop the next Elasticsearch-powered super app.

Don’t use a term query on a text field! When Elasticsearch analyses a text field, it may normalise and truncate the terms being indexed in your query. The search results will be different to what you expect. There's a whole section in the documentation about this phenomenon.

nested queries return the full document, even if only one nested object matched. Aggregations are run on all documents returned from a query. If you’re aggregating on a nested field, you will need to explicitly write an nested aggregation. Otherwise, you will be including nested objects in your aggregation that you thought were filtered out by your query.

The Search API really does only return the first 10,000 results. Pagination using a combination of from and size will throw an exception if trying to return results outside this window. For example, the query request body {"from": 10000, "size": 10} will return:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10010]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    ],

And that's it! Good luck and happy coding 🙂

Author: Callum Lamont