Using the Power of Real-Time Distributed Search with ElasticSearch. Part 1
The Internet is a place where everyone in the world can find any information they want. But with billions of documents available in the web, how is it possible to find exactly what we want in seconds or less?
For this purpose special programs called ‘search engines’ are developed by using many algorithms for analyzing, stemming, building indexes and searching querying terms. In Java world, there is one of the most popular open source libraries called Lucene from Apache. It is a high performance, reliable and widely used full-featured Information Retrieval library written in Java. On top of it are built a few servers such as Solr, ElasticSearch and others.
Nowadays most companies are trying to move all computation into the cloud and Search is not an exception. In this article, I would like to consider ElasticSearch, which, besides many other features, is initially designed to work in clouds and is quite successful in accomplishing that mission.
Developing high-loaded systems you encounter a problem of performing fast, up-to-date and comfortable search. ElastiсSearch perfectly complies with all those requirements and even more. Here are major pros of the engine:
- High performance
- Open source
- Near-real-time indexing
- Ability to run in any Cloud
- Information exchange in JSON format via HTTP
- Simple installation and configuration procedures
- Provided interfaces (REST, Java and Groovy API)
Let’s dig deeper
ElasticSearch is a flexible and powerful open source, distributed real-time search and analytics engine. It uses Apache Lucene as a base and makes it easier to create and implement large search systems. It is schema-free and document-oriented which are very important technical innovations. ElasticSearch has been designed with the cloud in mind. Indexing and searching are performed via simple http requests.
ElasticSearch uses document-based data structure. Each document has index, id and set of fields. When new document type or field comes in then ES builds schema for it dynamically. So there is no need to define a strict structure of each type. Of course, it’s also possible to define structure manually in mapping file or via Mapping API. There can be specified the following parameters:
- field names and their types
- whether they are required or not
- the way in which those properties are indexed
- which one should be used as a unique key
- which should be stored
- whether a field should be searchable through ‘_all’ or not
- if its value should be “highlightable”
An index in ElasticSearch may store documents of different “mapping types”. It allows one to associate multiple mapping definitions for each mapping type. A mapping type is a way of separating the documents in index into logical groups (like tables in relational databases).
Indexing is one of the most important procedures search engines perform. But for it, a search would take considerably more time and consume huge amount of resources. In the reality, searchers don’t perform queries on saved text but on indexes. That’s why it’s highly important to create efficiently.
It’s like in books when we need to find a page with some word. We go to the back of the book and check indexes of words instead of reading all pages. This type of indexing is called inverted indexing (it inverts “page -> words” structure to a keyword-centric data structure “word -> pages”).
Before storing a text in index ES analyzes it. Currently, there are a few default analyzers, but it’s also possible to add your own. One of the most efficient is the snowball analyzer. It works very well with the stems and roots of words. For example, a document contains words “searchable”, “searched” and “searching”. All of them will be transformed into “search” by the analyzer and then added to the index, pointing back to the full version of the document. The same happens when a user searches some word – first it analyzes the structure of the word and then uses its root for querying the index to get a list of matching documents.
Modern applications require not only full-text search by a keyword but also more complex queries that would allow, for example, to filter out unnecessary data, return results in a certain order or get statistics for each term in the query (e.g., in how many documents occurs a word). ES allows to do that easily and provide results very quickly.
It should be noted that filtered queries could be cached and therefore all the following searches with the same filter would return results immediately.
A very powerful feature of ES is Faceting. It allows getting aggregated data along with standard search query. Here is a list of some facet types:
- Terms (get the most popular terms)
- Filter (number of hits matching the filter)
- Histogram (statistics per interval/bucket)
- Statistics (count, total, a sum of squares, mean (average), minimum, maximum, variance, and standard deviation)
- Geo-distance (within 500m, 1km)
ES can retrieve a lot of useful information that could be used in a software application for solving quite complex tasks. For example, it’s possible to get locations of hotels close to the current location of the user (geo-distance facet), use terms facet for auto-complete functionality or get a histogram of prices per month and so on…
View a full PDF version of the article here:
Using the Power of Real-Time Distributed Search