Search API in Elasticsearch Overview

If you can’t find it, it might as well not exist — that’s the takeaway for all organizations providing content or products to their users. It’s become a practical requirement to allow ways to search for and discover data records, or “documents,” amongst a large collection.

Thus, in this highly competitive world, applications must have search functionality so users can find what they are looking for as easily as possible, or they will assume you don’t have it and visit your competitors.

Implementing a great search feature deserves attention to detail. How lenient is the system to typos or voice misinterpretations? Can it provide context-sensitive guided/faceted navigation? Are matches highlighted for clear visibility? And ultimately, are the results relevant to the user’s query?

Beyond end-user keyword search, we also want to be able to perform analytics of our query and system logs so that we can make the right business decisions. Analytics encompasses security auditing and anomaly detection too. And hence, performing analytics on the data easily is another important aspect of building any software today.

What is Elasticsearch?

The Elasticsearch search engine provides powerful, scalable search and analytics capabilities for all types of data, including structured and unstructured text, numerics, vectors, and geospatial shapes and coordinates. It is used for full-text search, structured search, analytics, and many forms of data exploration. The code itself lives within an open-source project and the company behind it provides a fully supported commercial offering for on-premise deployment or SaaS hosting.

Built upon the well-known Apache Lucene search engine library, Elasticsearch is able to quickly provide search results over large volumes of data. Lucene indexes content into, primarily, an inverted-index structure, which facilitates quickly matching queries to content. An inverted index is built by tokenizing text into terms (which are usually single words) and building a lexicographically ordered data structure, as in the diagram below.

Illustration of a inverted-index structure.

Lucene, and thus Elasticsearch, supports the majority of written/textual languages and many query types, including fuzzy and partial matching, boolean operators, phrases and proximity, numerical ranges, and geospatial proximity. A key differentiating feature of Elasticsearch, again thanks to its Lucene heart, is relevancy ranking. Results are scored based on how well they match the query, which may also include boosting criteria to provide the best documents for the query, user, and context.

Relevancy scoring includes content-based factors such as term frequency (how many times the word occurs in a product description, for example) and document frequency (how many products in your catalog contain the query term). The more unique the term is, the more weight it carries, and likewise, the more common a term is, the less weight it carries. The length of the matching field is also a content-based scoring factor, where matches in shorter (title-like) fields carry more weight than the same word in a longer (description-like) field.

The Elasticsearch ecosystem includes other complementary tools: Logstash, Kibana, and Beats. These tools, known as the Elastic Stack and originally called the ELK stack (Elasticsearch, Logstash, and Kibana), provide a powerful combination of tools for ingesting and visualizing large volumes of data.

Logstash is used to collect, parse, transform, enrich, and index logs from various sources, including system logs, web server logs, application logs, and event logs.

Kibana is an open-source data visualization tool that allows you to interact with data in Elasticsearch through charts, graphs, and maps that can be combined into custom dashboards that help you understand complex datasets.

Beats provides lightweight data shipping agents installed on your servers, whereas Logstash provides a more sophisticated framework for input, filtering, and output plugins.

Why use Elasticsearch?

Developers have flocked to Elasticsearch because it combines the power of Lucene behind a scalable, distributed infrastructure with a developer-centric API. Using JSON as the ubiquitous format for all interactions, a robust ecosystem of client-side libraries is available.

Along with the designed-for-developers APIs, Elasticsearch defaults to accepting and indexing your data without requiring up-front schema design, making it easy to get started and straightforward to configure as your application grows. With this flexibility, different types of content can be indexed into a single collection (sometimes referred to as “multi-type”). For example, content from a relational database could be indexed alongside documents crawled from Google Drive and logging events extracted from log files. Or, each type of data could be indexed into separate collections and queried independently or searched together (called “multi-index”).

Elasticsearch can scale horizontally to handle massive amounts of content and provide fault tolerance and heavy query loads. Horizontal scaling is achieved by sharding collections, breaking the collection into smaller indexes deployed across multiple nodes in a cluster. The diagram below illustrates data nodes that distribute host primary and replica shards across an Elasticsearch cluster.

Illustration of data nodes that distribute host primary and replica shards across an Elasticsearch cluster.

What is the Elasticsearch Search API?

Once you’ve ingested your content into Elasticsearch, the querying fun begins. Elasticsearch provides the querying capabilities that your applications demand, from a simple exact match on a single field to sophisticated multi-clause queries that span multiple fields, combining user input with other factors such as user preferences, location, and search context.

Let’s take a look at the basics of searching an Elasticseach collection, where we have a collection of customers names, like these:

Code Snippet

xxxxxxxxxx
 
{
 "first_name": "John",
 "last_name": "Smith",
 "type": "Standard"
}
{
 "first_name": "Jane",
 "last_name": "Doe",
 "type": "Premium"
}
{
 "first_name": "Tyler",
 "last_name": "Donovan",
 "type": "Free"
}
{
 "first_name": "Hugh",
 "last_name": "Tyler",
 "type": "Premium"
}
{
 "first_name": "Leslie",
 "last_name": "Fritsch",
 "type": "Standard"
}

A simple match query on first_name would be crafted in this manner:

Code Snippet

xxxxxxxxxx
 
{
 "query": {
   "match": {
     "first_name": {
       "query": "tyler"
     }
   }
 }
}

Using Elasticsearch aggregations, the counts of each “type” of a matching customer can be incorporated into our search request response, providing the ability to display facets (also called “guided navigation”). Here is a more sophisticated search request that searches both first and last name fields, ranking matches on last name higher than first name, along with aggregation counts of customer types:

Code Snippet

xxxxxxxxxx
 
{              
  "query": {       
    "multi_match": {                 
      "query": "tyler",
      "fields": ["first_name", "last_name^10"]
    }
  },
  "aggs": {
    "types": {
      "terms": { "field": "type" }
    }
  }
}

The response to this query would look like this:

Code Snippet

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 13.862944,
    "hits" : [
      {
        "_index" : "customers",
        "_id" : "4",
        "_score" : 13.862944,
        "_source" : {
          "first_name" : "Hugh",
          "last_name" : "Tyler",
          "type" : "Premium"
        }
      },
      {
        "_index" : "customers",
        "_id" : "3",
        "_score" : 1.3862942,
        "_source" : {
          "first_name" : "Tyler",
          "last_name" : "Donovan",
          "type" : "Free"
        }
      }
    ]
  },
  "aggregations" : {
    "types" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Free",
          "doc_count" : 1
        },
        {
          "key" : "Premium",
          "doc_count" : 1
        }
      ]
    }
  }
}

xxxxxxxxxx
 
{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 13.862944,
    "hits" : [
      {
        "_index" : "customers",
        "_id" : "4",
        "_score" : 13.862944,
        "_source" : {
          "first_name" : "Hugh",
          "last_name" : "Tyler",
          "type" : "Premium"
        }
      },
      {
        "_index" : "customers",
        "_id" : "3",
        "_score" : 1.3862942,
        "_source" : {
          "first_name" : "Tyler",
          "last_name" : "Donovan",
          "type" : "Free"
        }
      }
    ]
  },
  "aggregations" : {

Multi-index searching can be achieved where multiple collections are searched as if they were a single unified index. When a comma-separated list (or _all for every user collection) is specified for the collection to be searched, all indicated collections are searched, and combined results are returned. This feature allows collections to be architected for independent or combined searching, such as separate collections of current customers and past customers, or separate collections per time period that can be queried together or independently.

Summary

In conclusion, Elasticsearch offers state-of-the-art search and analytics capabilities behind a developer-friendly interface. Elasticsearch scales to handle enormous amounts of data. The most demanding applications in the world today are comfortably powered by Elasticsearch.

Get Started With MongoDB Atlas

Try Free