Elasticsearch Server - Third Edition - Sample Chapter
Elasticsearch Server - Third Edition - Sample Chapter
Elasticsearch Server - Third Edition - Sample Chapter
Third Edition
Elasticsearch is a very fast and scalable open source
search engine, designed with distribution and cloud in mind,
complete with all the goodies that Apache Lucene has to
offer. Elasticsearch's schema-free architecture allows
developers to index and search unstructured content,
making it perfectly suited for both small projects and
big data warehouses, even those with petabytes of
unstructured data.
$ 54.99 US
34.99 UK
P U B L I S H I N G
Rafa Ku
Marek Rogoziski
Third Edition
This book will guide you through the world of the most
commonly used Elasticsearch server functionalities.
You'll start off by getting an understanding of the basics
of Elasticsearch and its data indexing functionality. Next,
you will see the querying capabilities of Elasticsearch,
followed by a thorough explanation of scoring and search
relevance. After this, you will explore the aggregation and
data analysis capabilities of Elasticsearch and will learn
how cluster administration and scaling can be used to boost
your applications' performance. You'll find out how to use
the friendly REST APIs and how to tune Elasticsearch to
make the most of it. By the end of this book, you will be
able to create amazing search solutions in line with your
project's specifications.
Elasticsearch Server
Elasticsearch Server
ee
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Elasticsearch Server
Third Edition
Leverage Elasticsearch to create a robust, fast, and flexible
search solution with ease
Sa
m
Rafa Ku
Marek Rogoziski
Marek Rogoziski is a software architect and consultant with more than 10 years
of experience. His specialization concerns solutions based on open source search
engines, such as Solr and Elasticsearch, and the software stack for big data analytics
including Hadoop, Hbase, and Twitter Storm.
He is also a cofounder of the solr.pl site, which publishes information and tutorials
about Solr and Lucene libraries. He is the coauthor of ElasticSearch Server and its
second edition, and the first and second editions of Mastering ElasticSearch, all
published by Packt Publishing.
He is currently the chief technology officer and lead architect at ZenCard, a company
that processes and analyzes large quantities of payment transactions in real time,
allowing automatic and anonymous identification of retail customers on all retailer
channels (m-commerce/e-commerce/brick&mortar) and giving retailers a customer
retention and loyalty tool.
Preface
Welcome to Elasticsearch Server, Third Edition. This is the third instalment of the
book dedicated to yet another major release of Elasticsearchthis time version 2.2.
In the third edition, we have decided to go on a similar route that we took when we
wrote the second edition of the book. We not only updated the content to match the
new version of Elasticsearch, but also restructured the book by removing and adding
new sections and chapters. We read the suggestions we got from youthe readers
of the book, and we carefully tried to incorporate the suggestions and comments
received since the release of the first and second editions.
While reading this book, you will be taken on a journey to the wonderful world of
full-text search provided by the Elasticsearch server. We will start with a general
introduction to Elasticsearch, which covers how to start and run Elasticsearch, its
basic concepts, and how to index and search your data in the most basic way. This
book will also discuss the query language, so called Query DSL, that allows you
to create complicated queries and filter returned results. In addition to all of this,
you'll see how you can use the aggregation framework to calculate aggregated data
based on the results returned by your queries. We will implement the autocomplete
functionality together and learn how to use Elasticsearch spatial capabilities and
prospective search.
Finally, this book will show you Elasticsearch's administration API capabilities
with features such as shard placement control, cluster handling, and more, ending
with a dedicated chapter that will discuss Elasticsearch's preparation for small and
large deployments both ones that concentrate on indexing and also ones that
concentrate on indexing.
Preface
Preface
Chapter 11, Scaling by Example, is dedicated to scaling and tuning. We will start with
hardware preparations and considerations and a single Elasticsearch node-related
tuning. We will go through cluster setup and vertical scaling, ending the chapter
with high querying and indexing use cases and cluster monitoring.
Document: This is the main data carrier used during indexing and
searching, comprising one or more fields that contain the data we
put in and get from Lucene.
[2]
Chapter 1
Apache Lucene writes all the information to a structure called the inverted
index. It is a data structure that maps the terms in the index to the documents and
not the other way around as a relational database does in its tables. You can think
of an inverted index as a data structure where data is term-oriented rather than
document-oriented. Let's see how a simple inverted index will look. For example,
let's assume that we have documents with only a single field called title to be
indexed, and the values of that field are as follows:
A very simplified visualization of the Lucene inverted index could look as follows:
Each term points to the number of documents it is present in. For example, the
term edition is present twice in the second and third documents. Such a structure
allows for very efficient and fast search operations in term-based queries (but
not exclusively). Because the occurrences of the term are connected to the terms
themselves, Lucene can use information about the term occurrences to perform fast
and precise scoring information by giving each document a value that represents
how well each of the returned documents matched the query.
Of course, the actual index created by Lucene is much more complicated and advanced
because of additional files that include information such as term vectors (per document
inverted index), doc values (column oriented field information), stored fields ( the
original and not the analyzed value of the field), and so on. However, all you need to
know for now is how the data is organized and not what exactly is stored.
[3]
[4]
Chapter 1
Apart from the tokenizer, the Lucene analyzer is built of zero or more token filters
that are used to process tokens in the token stream. Some examples of filters are
as follows:
Synonyms filter: Changes one token to another on the basis of synonym rules
Filters are processed one after another, so we have almost unlimited analytical
possibilities with the addition of multiple filters, one after another.
Finally, the character mappers operate on non-analyzed textthey are used before
the tokenizer. Therefore, we can easily remove HTML tags from whole parts of text
without worrying about tokenization.
[5]
What you should remember about indexing and querying analysis is that the index
should match the query term. If they don't match, Lucene won't return the desired
documents. For example, if you use stemming and lowercasing during indexing, you
need to ensure that the terms in the query are also lowercased and stemmed, or your
queries won't return any results at all. For example, let's get back to our LightRed
term that we analyzed during indexing; we have it as two terms in the index: light
and red. If we run a LightRed query against that data and don't analyze it, we won't
get the document in the resultsthe query term does not match the indexed terms.
It is important to keep the token filters in the same order during indexing and query
time analysis so that the terms resulting from such an analysis are the same.
[6]
Chapter 1
Index
An index is the logical place where Elasticsearch stores the data. Each index can be
spread onto multiple Elasticsearch nodes and is divided into one or more smaller
pieces called shards that are physically placed on the hard drives. If you are coming
from the relational database world, you can think of an index like a table. However,
the index structure is prepared for fast and efficient full text searching and, in
particular, does not store original values. That structure is called an inverted index
(https://en.wikipedia.org/wiki/Inverted_index).
If you know MongoDB, you can think of the Elasticsearch index as a collection in
MongoDB. If you are familiar with CouchDB, you can think about an index as you
would about the CouchDB database. Elasticsearch can hold many indices located on
one machine or spread them over multiple servers. As we have already said, every
index is built of one or more shards, and each shard can have many replicas.
Document
The main entity stored in Elasticsearch is a document. A document can have
multiple fields, each having its own type and treated differently. Using the analogy
to relational databases, a document is a row of data in a database table. When you
compare an Elasticsearch document to a MongoDB document, you will see that
both can have different structures. The thing to keep in mind when it comes to
Elasticsearch is that fields that are common to multiple types in the same index
need to have the same type. This means that all the documents with a field called
title need to have the same data type for it, for example, string.
Documents consist of fields, and each field may occur several times in a single
document (such a field is called multivalued). Each field has a type (text, number,
date, and so on). The field types can also be complexa field can contain other
subdocuments or arrays. The field type is important to Elasticsearch because type
determines how various operations such as analysis or sorting are performed.
Fortunately, this can be determined automatically (however, we still suggest
using mappings; take a look at what follows).
[7]
Unlike the relational databases, documents don't need to have a fixed structure
every document may have a different set of fields, and in addition to this, fields
don't have to be known during application development. Of course, one can
force a document structure with the use of schema. From the client's point of
view, a document is a JSON object (see more about the JSON format at https://
en.wikipedia.org/wiki/JSON). Each document is stored in one index and has its
own unique identifier, which can be generated automatically by Elasticsearch, and
document type. The thing to remember is that the document identifier needs to be
unique inside an index and should be for a given type. This means that, in a single
index, two documents can have the same unique identifier if they are not of the
same type.
Document type
In Elasticsearch, one index can store many objects serving different purposes. For
example, a blog application can store articles and comments. The document type
lets us easily differentiate between the objects in a single index. Every document
can have a different structure, but in real-world deployments, dividing documents
into types significantly helps in data manipulation. Of course, one needs to keep the
limitations in mind. That is, different document types can't set different types for the
same property. For example, a field called title must have the same type across all
document types in a given index.
Mapping
In the section about the basics of full text searching (the Full text searching section),
we wrote about the process of analysisthe preparation of the input text for
indexing and searching done by the underlying Apache Lucene library. Every field
of the document must be properly analyzed depending on its type. For example,
a different analysis chain is required for the numeric fields (numbers shouldn't be
sorted alphabetically) and for the text fetched from web pages (for example, the
first step would require you to omit the HTML tags as it is useless information).
To be able to properly analyze at indexing and querying time, Elasticsearch stores
the information about the fields of the documents in so-called mappings. Every
document type has its own mapping, even if we don't explicitly define it.
[8]
Chapter 1
Shards
When we have a large number of documents, we may come to a point where a single
node may not be enoughfor example, because of RAM limitations, hard disk
capacity, insufficient processing power, and an inability to respond to client requests
fast enough. In such cases, an index (and the data in it) can be divided into smaller
parts called shards (where each shard is a separate Apache Lucene index). Each
shard can be placed on a different server, and thus your data can be spread among
the cluster nodes. When you query an index that is built from multiple shards,
Elasticsearch sends the query to each relevant shard and merges the result in such a
way that your application doesn't know about the shards. In addition to this, having
multiple shards can speed up indexing, because documents end up in different
shards and thus the indexing operation is parallelized.
Replicas
In order to increase query throughput or achieve high availability, shard replicas can
be used. A replica is just an exact copy of the shard, and each shard can have zero
or more replicas. In other words, Elasticsearch can have many identical shards and
one of them is automatically chosen as a place where the operations that change the
index are directed. This special shard is called a primary shard, and the others are
called replica shards. When the primary shard is lost (for example, a server holding
the shard data is unavailable), the cluster will promote the replica to be the new
primary shard.
[9]
Gateway
The cluster state is held by the gateway, which stores the cluster state and indexed
data across full cluster restarts. By default, every node has this information stored
locally; it is synchronized among nodes. We will discuss the gateway module in
The gateway and recovery modules section of Chapter 9, Elasticsearch Cluster, in detail.
Shard 1
primary
Shard 2
primary
Forward to
leader
Elasticsearch Node
Indexing request
Shard 1
replica
Application
Shard 2
replica
Elasticsearch Node
Elasticsearch Cluster
[ 10 ]
Chapter 1
When you send a new document to the cluster, you specify a target index and send
it to any of the nodes. The node knows how many shards the target index has and is
able to determine which shard should be used to store your document. Elasticsearch
can alter this behavior; we will talk about this in the Introduction to routing section in
Chapter 2, Indexing Your Data. The important information that you have to remember
for now is that Elasticsearch calculates the shard in which the document should be
placed using the unique identifier of the documentthis is one of the reasons each
document needs a unique identifier. After the indexing request is sent to a node, that
node forwards the document to the target node, which hosts the relevant shard.
Now, let's look at the following diagram on searching request execution:
Shard 1
Scatter phase
Gather phase
Elasticsearch Node
Results
Application
Shard 2
Query
Elasticsearch Node
Elasticsearch Cluster
When you try to fetch a document by its identifier, the node you send the query to
uses the same routing algorithm to determine the shard and the node holding the
document and again forwards the request, fetches the result, and sends the result to
you. On the other hand, the querying process is a more complicated one. The node
receiving the query forwards it to all the nodes holding the shards that belong to a
given index and asks for minimum information about the documents that match the
query (the identifier and score are matched by default), unless routing is used, when
the query will go directly to a single shard only. This is called the scatter phase. After
receiving this information, the aggregator node (the node that receives the client
request) sorts the results and sends a second request to get the documents that are
needed to build the results list (all the other information apart from the document
identifier and score). This is called the gather phase. After this phase is executed,
the results are returned to the client.
[ 11 ]
Now the question arises: what is the replica's role in the previously described
process? While indexing, replicas are only used as an additional place to store the
data. When executing a query, by default, Elasticsearch will try to balance the load
among the shard and its replicas so that they are evenly stressed. Also, remember
that we can change this behavior; we will discuss this in the Understanding the
querying process section in Chapter 3, Searching Your Data.
Installing Java
Elasticsearch is a Java application and to use it we need to make sure that the Java SE
environment is installed properly. Elasticsearch requires Java Version 7 or later to run.
You can download it from http://www.oracle.com/technetwork/java/javase/
downloads/index.html. You can also use OpenJDK (http://openjdk.java.net/)
if you wish. You can, of course, use Java Version 7, but it is not supported by Oracle
anymore, at least without commercial support. For example, you can't expect new,
patched versions of Java 7 to be released. Because of this, we strongly suggest that you
install Java 8, especially given that Java 9 seems to be right around the corner with the
general availability planned to be released in September 2016.
Installing Elasticsearch
To install Elasticsearch you just need to go to https://www.elastic.co/
downloads/elasticsearch, choose the last stable version of Elasticsearch,
download it, and unpack it. That's it! The installation is complete.
[ 12 ]
Chapter 1
The main interface to communicate with Elasticsearch is based on the HTTP protocol
and REST. This means that you can even use a web browser for some basic queries
and requests, but for anything more sophisticated you'll need to use additional
software, such as the cURL command. If you use the Linux or OS X command, the
cURL package should already be available. If you use Windows, you can download
the package from http://curl.haxx.se/download.html.
Running Elasticsearch
Let's run our first instance that we just downloaded as the ZIP archive and unpacked.
Go to the bin directory and run the following commands depending on the OS:
Linux or OS X: ./elasticsearch
Windows: elasticsearch.bat
[ 13 ]
Now, we will use the cURL program to communicate with Elasticsearch. For example,
to check the cluster health, we will use the following command:
curl -XGET http://127.0.0.1:9200/_cluster/health?pretty
The -X parameter is a definition of the HTTP request method. The default value is
GET (so in this example, we can omit this parameter). For now, do not worry about
the GET value; we will describe it in more detail later in this chapter.
As a standard, the API returns information in a JSON object in which new line
characters are omitted. The pretty parameter added to our requests forces
Elasticsearch to add a new line character to the response, making the response
more user-friendly. You can try running the preceding query with and without
the ?pretty parameter to see the difference.
Elasticsearch is useful in small and medium-sized applications, but it has been
built with large clusters in mind. So, now we will set up our big two-node cluster.
Unpack the Elasticsearch archive in a different directory and run the second instance.
If we look at the log, we will see the following:
[2016-01-13 20:07:58,561][INFO ][cluster.service
] [Big
Man] detected_master {Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}
{127.0.0.1:9300}, added {{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}
{127.0.0.1:9300},}, reason: zen-disco-receive(from master [{Blob}
{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}])
This means that our second instance (named Big Man) discovered the previously
running instance (named Blob). Here, Elasticsearch automatically formed a new
two-node cluster. Starting from Elasticsearch 2.0, this will only work with nodes
running on the same physical machinebecause Elasticsearch 2.0 no longer supports
multicast. To allow your cluster to form, you need to inform Elasticsearch about the
nodes that should be contacted initially using the discovery.zen.ping.unicast.
hosts array in elasticsearch.yml. For example, like this:
discovery.zen.ping.unicast.hosts: ["192.168.2.1", "192.168.2.2"]
[ 14 ]
Chapter 1
The second option is to kill the server process by sending the TERM
signal (see the kill command on the Linux boxes and Program Manager
on Windows)
The previous versions of Elasticsearch exposed a dedicated
shutdown API but, in 2.0, this option has been removed
because of security reasons.
Description
Config
Lib
Modules
After Elasticsearch starts, it will create the following directories (if they don't exist):
Directory
Data
Description
Logs
Plugins
Work
[ 15 ]
Configuring Elasticsearch
One of the reasonsof course, not the only onewhy Elasticsearch is gaining more
and more popularity is that getting started with Elasticsearch is quite easy. Because
of the reasonable default values and automatic settings for simple environments,
we can skip the configuration and go straight to indexing and querying (or to the
next chapter of the book). We can do all this without changing a single line in our
configuration files. However, in order to truly understand Elasticsearch, it is worth
understanding some of the available settings.
We will now explore the default directories and the layout of the files provided with
the Elasticsearch tar.gz archive. The entire configuration is located in the config
directory. We can see two files here: elasticsearch.yml (or elasticsearch.json,
which will be used if present) and logging.yml. The first file is responsible for setting
the default configuration values for the server. This is important because some of
these values can be changed at runtime and can be kept as a part of the cluster state,
so the values in this file may not be accurate. The two values that we cannot change
at runtime are cluster.name and node.name.
The cluster.name property is responsible for holding the name of our cluster.
The cluster name separates different clusters from each other. Nodes configured
with the same cluster name will try to form a cluster.
The second value is the instance (the node.name property) name. We can leave
this parameter undefined. In this case, Elasticsearch automatically chooses a unique
name for itself. Note that this name is chosen during each startup, so the name can be
different on each restart. Defining the name can helpful when referring to concrete
instances by the API or when using monitoring tools to see what is happening
to a node during long periods of time and between restarts. Think about giving
descriptive names to your nodes.
Other parameters are commented well in the file, so we advise you to look through
it; don't worry if you do not understand the explanation. We hope that everything
will become clearer after reading the next few chapters.
Remember that most of the parameters that have been set in the
elasticsearch.yml file can be overwritten with the use of the
Elasticsearch REST API. We will talk about this API in The update
settings API section of Chapter 9, Elasticsearch Cluster in Detail.
[ 16 ]
Chapter 1
The second file (logging.yml) defines how much information is written to system
logs, defines the log files, and creates new files periodically. Changes in this file are
usually required only when you need to adapt to monitoring or backup solutions
or during system debugging; however, if you want to have a more detailed logging,
you need to adjust it accordingly.
Let's leave the configuration files for now and look at the base for all the applications
the operating system. Tuning your operating system is one of the key points to ensure
that your Elasticsearch instance will work well. During indexing, especially when
having many shards and replicas, Elasticsearch will create many files; so, the system
cannot limit the open file descriptors to less than 32,000. For Linux servers, this can
usually be changed in /etc/security/limits.conf and the current value can be
displayed using the ulimit command. If you end up reaching the limit, Elasticsearch
will not be able to create new files; so merging will fail, indexing may fail, and new
indices will not be created.
On Microsoft Windows platforms, the default limit is more than 16
million handles per process, which should be more than enough.
You can read more about file handles on the Microsoft Windows
platform at https://blogs.technet.microsoft.com/
markrussinovich/2009/09/29/pushing-the-limits-ofwindows-handles/.
The next set of settings is connected to the Java Virtual Machine (JVM) heap memory
limit for a single Elasticsearch instance. For small deployments, the default memory
limit (1,024 MB) will be sufficient, but for large ones it will not be enough. If you spot
entries that indicate OutOfMemoryError exceptions in a log file, set the ES_HEAP_SIZE
variable to a value greater than 1024. When choosing the right amount of memory
size to be given to the JVM, remember that, in general, no more than 50 percent of
your total system memory should be given. However, as with all the rules, there are
exceptions. We will discuss this in greater detail later, but you should always monitor
your JVM heap usage and adjust it when needed.
[ 17 ]
Alternatively, you can add the remote repository and install Elasticsearch from it
(this command needs to be run as root as well):
rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
[ 18 ]
Chapter 1
This command adds the GPG key and allows the system to verify that the fetched
package really comes from Elasticsearch developers. In the second step, we need to
create the repository definition in the /etc/yum.repos.d/elasticsearch.repo file.
We need to add the following entries to this file:
[elasticsearch-2.2]
name=Elasticsearch repository for 2.2.x packages
baseurl=http://packages.elastic.co/elasticsearch/2.x/centos
gpgcheck=1
gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1
Now it's time to install the Elasticsearch server, which is as simple as running the
following command (again, don't forget to run it as root):
yum install elasticsearch
It is as simple as that. Another way, which is similar to what we did with RPM
packages, is by creating a new packages source and installing Elasticsearch from
the remote repository. The first step is to add the public GPG key used for package
verification. We can do that using the following command:
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo aptkey add -
The second step is by adding the DEB package location. We need to add the
following line to the /etc/apt/sources.list file:
deb http://packages.elastic.co/elasticsearch/2.2/debian stable main
This defines the source for the Elasticsearch packages. The last step is updating the
list of remote packages and installing Elasticsearch using the following command:
sudo apt-get update && sudo apt-get install elasticsearch
[ 19 ]
/etc/sysconfig/elasticsearch or /etc/default/elasticsearch:
If you want Elasticsearch to start automatically every time the operating system
starts, you can set up Elasticsearch as a system service by running the following
command:
/bin/systemctl enable elasticsearch.service
You'll be asked for permission to do so. If you allow the script to run, Elasticsearch
will be installed as a Windows service.
If you would like to see all the commands exposed by the service.bat script file,
just run the following command in the same directory as earlier:
service.bat
For example, to start Elasticsearch, we will just run the following command:
service.bat start
[ 20 ]
Chapter 1
[ 21 ]
[ 22 ]
Chapter 1
As you can see in the preceding code snippet, the JSON document is built with
a set of fields, where each field can have a different format. In our example, we
have a set of text fields (id, title, and content), we have a number (the priority field),
and an array of text values (the tags field). We will show documents that are more
complicated in the next examples.
One of the changes introduced in Elasticsearch 2.0 has been that field
names can't contain the dot character. Such field names were possible
in older versions of Elasticsearch, but could result in serialization
errors in certain cases and thus Elasticsearch creators decided to
remove that possibility.
Let's now index our document and make it available for retrieval and searching.
We will index our articles to an index called blog under a type named article.
We will also give our document an identifier of 1, as this is our first document.
To index our example document, we will execute the following command:
curl -XPUT 'http://localhost:9200/blog/article/1' -d '{"title": "New
version of Elasticsearch released!", "content": "Version 2.2 released
today!", "priority": 10, "tags": ["announce", "elasticsearch", "release"]
}'
[ 23 ]
Note a new option to the curl command, the -d parameter. The value of this option is
the text that will be used as a request payloada request body. This way, we can send
additional information such as the document definition. Also, note that the unique
identifier is placed in the URL and not in the body. If you omit this identifier (while
using the HTTP PUT request), the indexing request will return the following error:
No handler found for uri [/blog/article] and method [PUT]
In the preceding response, Elasticsearch included information about the status of the
operation, index, type, identifier, and version. We can also see information about the
shards that took part in the operationall of them, the ones that were successful and
the ones that failed.
Chapter 1
We've used the HTTP POST method instead of PUT and we've omitted the identifier.
The response produced by Elasticsearch in such a case would be as follows:
{
"_index":"blog",
"_type":"article",
"_id":"AU1y-s6w2WzST_RhTvCJ",
"_version":1,
"_shards":{
"total":2,
"successful":1,
"failed":0},
"created":true
}
As you can see, the response returned by Elasticsearch is almost the same as in the
previous example, with a minor differencethe _id field is returned. Now, instead
of the 1 value, we have a value of AU1y-s6w2WzST_RhTvCJ, which is the identifier
Elasticsearch generated for our document.
Retrieving documents
We now have two documents indexed into our Elasticsearch instanceone using a
explicit identifier and one using a generated identifier. Let's now try to retrieve one
of the documents using its unique identifier. To do this, we will need information
about the index the document is indexed in, what type it has, and of course what
identifier it has. For example, to get the document from the blog index with the
article type and the identifier of 1, we would run the following HTTP GET request:
curl -XGET 'localhost:9200/blog/article/1?pretty'
As you can see in the preceding response, Elasticsearch returned the _source field,
which is the original document sent to Elasticsearch and a few additional fields that tell
us about the document, such as the index, type, identifier, document version, and of
course information as towhether the document was found or not (the found property).
If we try to retrieve a document that is not present in the index, such as the one with
the 12345 identifier, we get a response like this:
{
"_index" : "blog",
"_type" : "article",
"_id" : "12345",
"found" : false
}
As you can see, this time the value of the found property was set to false and there
was no _source field because the document has not been retrieved.
Updating documents
Updating documents in the index is a more complicated task compared to indexing.
When the document is indexed and Elasticsearch flushes the document to a disk,
it creates segmentsan immutable structure that is written once and read many
times. This is done because the inverted index created by Apache Lucene is currently
impossible to update (at least most of its parts). To update a document, Elasticsearch
internally first fetches the document using the GET request, modifies its _source field,
removes the old document, and indexes a new document using the updated content.
The content update is done using scripts in Elasticsearch (we will talk more about
scripting in Elasticsearch in the Scripting capabilities of Elasticsearch section in Chapter
6, Make Your Search Better).
Please note that the following document update examples
require you to put the script.inline: on property into your
elasticsearch.yml configuration file. This is needed because
inline scripting is disabled in Elasticsearch for security reasons.
The other way to handle updates is to store the script content in
the file in the Elasticsearch configuration directory, but we will
talk about that in the Scripting capabilities of Elasticsearch section
in Chapter 6, Make Your Search Better.
[ 26 ]
Chapter 1
Let's now try to update our document with identifier 1 by modifying its content field
to contain the This is the updated document sentence. To do this, we need to run
a POST HTTP request on the document path using the _update REST end-point. Our
request to modify the document would look as follows:
curl -XPOST 'http://localhost:9200/blog/article/1/_update' -d '{
"script" : "ctx._source.content = new_content",
"params" : {
"new_content" : "This is the updated document"
}
}'
As you can see, we've sent the request to the /blog/article/1/_update REST endpoint. In the request body, we've provided two parametersthe update script in the
script property and the parameters of the script. The script is very simple; it takes
the _source field and modifies the content field by setting its value to the value of
the new_content parameter. The params property contains all the script parameters.
For the preceding update command execution, Elasticsearch would return the
following response:
{"_index":"blog","_type":"article","_id":"1","_version":2,"_shards":{"
total":2,"successful":1,"failed":0}}
The thing to look at in the preceding response is the _version field. Right now, the
version is 2, which means that the document has been updated (or re-indexed) once.
Basically, each update makes Elasticsearch update the _version field.
We could also update the document using the doc section and providing the
changed field, for example:
curl -XPOST 'http://localhost:9200/blog/article/1/_update' -d '{
"doc" : {
"content" : "This is the updated document"
}
}'
[ 27 ]
[ 28 ]
Chapter 1
As you can imagine, the document has not been updated because it doesn't exist.
So now, let's modify our request to include the upsert section in our request body
that will tell Elasticsearch what to do when the document is not present. The new
command would look as follows:
curl -XPOST 'http://localhost:9200/blog/article/2/_update' -d '{
"script" : "ctx._source.priority += 1",
"upsert" : {
"title" : "Empty document",
"priority" : 0,
"tags" : ["empty"]
}
}'
With the modified request, a new document would be indexed; if we retrieve it using
the GET API, it will look as follows:
{
"_index" : "blog",
"_type" : "article",
"_id" : "2",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Empty document",
"priority" : 0,
"tags" : [ "empty" ]
}
}
As you can see, the fields from the upsert section of our update request were taken
by Elasticsearch and used as document fields.
[ 29 ]
Let's imagine that we would like to update our initial document and add a new field
called count to it (setting it to 1 initially). We would also like to index the document
under the specified identifier if the document is not present. We can do this by
running the following command:
curl -XPOST 'http://localhost:9200/blog/article/1/_update' -d '{
"doc" : {
"count" : 1
},
"doc_as_upsert" : true
}
We specified the new field in the doc section and we said that we want the doc
section to be treated as the upsert section when the document is not present
(with the doc_as_upsert property set to true).
If we now retrieve that document, we see the following response:
{
"_index" : "blog",
"_type" : "article",
"_id" : "1",
"_version" : 3,
"found" : true,
"_source" : {
"title" : "New version of Elasticsearch released!",
"content" : "This is the updated document",
"priority" : 10,
"tags" : [ "announce", "elasticsearch", "release" ],
"count" : 1
}
}
[ 30 ]
Chapter 1
Deleting documents
Now that we know how to index documents, update them, and retrieve them,
it is time to learn about how we can delete them. Deleting a document from an
Elasticsearch index is very similar to retrieving it, but with one major difference
instead of using the HTTP GET method, we have to use HTTP DELETE one.
For example, if we would like to delete the document indexed in the blog index under
the article type and with an identifier of 1, we would run the following command:
curl -XDELETE 'localhost:9200/blog/article/1'
The response from Elasticsearch indicates that the document has been deleted and
should look as follows:
{
"found":true,
"_index":"blog",
"_type":"article",
"_id":"1",
"_version":4,
"_shards":{
"total":2,
"successful":1,
"failed":0
}
}
Of course, this is not the only thing when it comes to deleting. We can also remove
all the documents of a given type. For example, if we would like to delete the entire
blog index, we should just omit the identifier and the type, so the command would
look like this:
curl -XDELETE 'localhost:9200/blog'
The preceding command would result in the deletion of the blog index.
[ 31 ]
Versioning
Finally, there is one last thing that we would like to talk about when it comes
to data manipulation in Elasticsearch the great feature of versioning. As you
may have already noticed, Elasticsearch increments the document version when
it does updates to it. We can leverage this functionality and use optimistic locking
(http://en.wikipedia.org/wiki/Optimistic_concurrency_control), and
avoid conflicts and overwrites when multiple processes or threads access the same
document concurrently. You can assume that your indexing application may want to
try to update the document, while the user would like to update the document while
doing some manual work. The question that arises is: Which document should be the
correct onethe one updated by the indexing application, the one updated by the
user, or the merged document of the changes? What if the changes are conflicting?
To handle such cases, we can use versioning.
Usage example
Let's index a new document to our blog indexone with an identifier of 10,
and let's index its second version soon after we do that. The commands that
do this look as follows:
curl -XPUT 'localhost:9200/blog/article/10' -d '{"title":"Test
document"}'
curl -XPUT 'localhost:9200/blog/article/10' -d '{"title":"Updated test
document"}'
Because we've indexed the document with the same identifier, it should have a
version 2 (you can check it using the GET request).
Now, let's try deleting the document we've just indexed but let's specify a version
property equal to 1. By doing this, we tell Elasticsearch that we are interested in
deleting the document with the provided version. Because the document is a different
version now, Elasticsearch shouldn't allow indexing with version 1. Let's check if what
we say is true. The command we will use to send the delete request looks as follows:
curl -XDELETE 'localhost:9200/blog/article/10?version=1'
Chapter 1
"index" : "blog"
} ],
"type" : "version_conflict_engine_exception",
"reason" : "[article][10]: version conflict, current [2],
provided [1]",
"shard" : 1,
"index" : "blog"
},
"status" : 409
}
As you can see, the delete operation was not successfulthe versions didn't match.
If we set the version property to 2, the delete operation would be successful:
curl -XDELETE 'localhost:9200/blog/article/10?version=2&pretty'
This time the delete operation has been successful because the provided version
was proper.
[ 33 ]
Sample data
For the purpose of this section of the book, we will create a simple index with two
document types. To do this, we will run the following six commands:
curl -XPOST 'localhost:9200/books/es/1' -d '{"title":"Elasticsearch
Server", "published": 2013}'
curl -XPOST 'localhost:9200/books/es/2' -d '{"title":"Elasticsearch
Server Second Edition", "published": 2014}'
curl -XPOST 'localhost:9200/books/es/3' -d '{"title":"Mastering
Elasticsearch", "published": 2013}'
curl -XPOST 'localhost:9200/books/es/4' -d '{"title":"Mastering
Elasticsearch Second Edition", "published": 2015}'
curl -XPOST 'localhost:9200/books/solr/1' -d '{"title":"Apache Solr 4
Cookbook", "published": 2012}'
[ 34 ]
Chapter 1
curl -XPOST 'localhost:9200/books/solr/2' -d '{"title":"Solr Cookbook
Third Edition", "published": 2015}'
Running the preceding commands will create the book's index with two types: es
and solr. The title and published fields will be indexed and thus, searchable.
URI search
All queries in Elasticsearch are sent to the _search endpoint. You can search a single
index or multiple indices, and you can restrict your search to a given document type
or multiple types. For example, in order to search our book's index, we will run the
following command:
curl -XGET 'localhost:9200/books/_search?pretty'
The results returned by Elasticsearch will include all the documents from our book's
index (because no query has been specified) and should look similar to the following:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 1.0,
"hits" : [ {
"_index" : "books",
"_type" : "es",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "Elasticsearch Server Second Edition",
"published" : 2014
}
}, {
"_index" : "books",
"_type" : "es",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"title" : "Mastering Elasticsearch Second Edition",
[ 35 ]
[ 36 ]
Chapter 1
As you can see, the response has a header that tells you the total time of the query and
the shards used in the query process. In addition to this, we have documents matching
the querythe top 10 documents by default. Each document is described by the index,
type, identifier, score, and the source of the document, which is the original document
sent to Elasticsearch.
We can also run queries against many indices. For example, if we had another index
called clients, we could also run a single query against these two indices as follows:
curl -XGET 'localhost:9200/books,clients/_search?pretty'
We can also run queries against all the data in Elasticsearch by omitting the index
names completely or setting the queries to _all:
curl -XGET 'localhost:9200/_search?pretty'
curl -XGET 'localhost:9200/_all/_search?pretty'
In a similar manner, we can also choose the types we want to use during searching.
For example, if we want to search only in the es type in the book's index, we run a
command as follows:
curl -XGET 'localhost:9200/books/es/_search?pretty'
Please remember that, in order to search for a given type, we need to specify
the index or multiple indices. Elasticsearch allows us to have quite a rich semantics
when it comes to choosing index names. If you are interested, please refer to
https://www.elastic.co/guide/en/elasticsearch/reference/current/
multi-index.html; however, there is one thing we would like to point out. When
running a query against multiple indices, it may happen that some of them do not
exist or are closed. In such cases, the ignore_unavailable property comes in handy.
When set to true, it tells Elasticsearch to ignore unavailable or closed indices.
For example, let's try running the following query:
curl -XGET 'localhost:9200/books,non_existing/_search?pretty'
[ 37 ]
Now let's check what will happen if we add the ignore_unavailable=true to our
request and execute the following command:
curl -XGET 'localhost:9200/books,non_existing/_search?pretty&ignore_
unavailable=true'
In this case, Elasticsearch would return the results without any error.
The response returned by Elasticsearch for the preceding request will be as follows:
{
"took" : 37,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.625,
"hits" : [ {
"_index" : "books",
"_type" : "es",
"_id" : "1",
"_score" : 0.625,
"_source" : {
"title" : "Elasticsearch Server",
"published" : 2013
}
}, {
[ 38 ]
Chapter 1
"_index" : "books",
"_type" : "es",
"_id" : "2",
"_score" : 0.5,
"_source" : {
"title" : "Elasticsearch Server Second Edition",
"published" : 2014
}
}, {
"_index" : "books",
"_type" : "es",
"_id" : "4",
"_score" : 0.5,
"_source" : {
"title" : "Mastering Elasticsearch Second Edition",
"published" : 2015
}
}, {
"_index" : "books",
"_type" : "es",
"_id" : "3",
"_score" : 0.19178301,
"_source" : {
"title" : "Mastering Elasticsearch",
"published" : 2013
}
} ]
}
}
The first section of the response gives us information about how much time the
request took (the took property is specified in milliseconds), whether it was timed
out (the timed_out property), and information about the shards that were queried
during the request executionthe number of queried shards (the total property of
the _shards object), the number of shards that returned the results successfully (the
successful property of the _shards object), and the number of failed shards (the
failed property of the _shards object). The query may also time out if it is executed
for a longer period than we want. (We can specify the maximum query execution
time using the timeout parameter.) The failed shard means that something went
wrong with that shard or it was not available during the search execution.
[ 39 ]
Of course, the mentioned information can be useful, but usually, we are interested in
the results that are returned in the hits object. We have the total number of documents
returned by the query (in the total property) and the maximum score calculated (in
the max_score property). Finally, we have the hits array that contains the returned
documents. In our case, each returned document contains its index name (the _index
property), the type (the _type property), the identifier (the _id property), the score
(the _score property), and the _source field (usually, this is the JSON object sent
for indexing.
Query analysis
You may wonder why the query we've run in the previous section worked.
We indexed the Elasticsearch term and ran a query for Elasticsearch and even
though they differ (capitalization), the relevant documents were found. The reason
for this is the analysis. During indexing, the underlying Lucene library analyzes the
documents and indexes the data according to the Elasticsearch configuration. By
default, Elasticsearch will tell Lucene to index and analyze both string-based data
as well as numbers. The same happens during querying because the URI request
query maps to the query_string query (which will be discussed in Chapter 3,
Searching Your Data), and this query is analyzed by Elasticsearch.
Let's use the indices-analyze API (https://www.elastic.co/guide/en/
elasticsearch/reference/current/indices-analyze.html). It allows us to see
how the analysis process is done. With this, we can see what happened to one of the
documents during indexing and what happened to our query phrase during querying.
In order to see what was indexed in the title field of the Elasticsearch server phrase,
we will run the following command:
[ 40 ]
Chapter 1
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
You can see that Elasticsearch has divided the text into two termsthe first one has
a token value of elasticsearch and the second one has a token value of the server.
Now let's look at how the query text was analyzed. We can do this by running the
following command:
curl -XGET 'localhost:9200/books/_analyze?pretty&field=title' -d
'elasticsearch'
We can see that the word is the same as the original one that we passed to the query.
We won't get into the Lucene query details and how the query parser constructed
the query, but in general the indexed term after the analysis was the same as the
one in the query after the analysis; so, the document matched the query and the
result was returned.
Please remember to enclose the URL of the request using the ' characters because,
on Linux-based systems, the & character will be analyzed by the Linux shell.
[ 41 ]
The query
The q parameter allows us to specify the query that we want our documents to
match. It allows us to specify the query using the Lucene query syntax described
in the Lucene query syntax section later in this chapter. For example, a simple
query would look like this: q=title:elasticsearch.
Analyzer
The analyzer property allows us to define the name of the analyzer that should
be used to analyze our query. By default, our query will be analyzed by the same
analyzer that was used to analyze the field contents during indexing.
Query explanation
If we set the explain parameter to true, Elasticsearch will include additional
explain information with each document in the resultsuch as the shard from
which the document was fetched and the detailed information about the scoring
calculation (we will talk more about it in the Understanding the explain information
section in Chapter 6, Make Your Search Better). Also remember not to fetch the explain
information during normal search queries because it requires additional resources
and adds performance degradation to the queries. For example, a query that includes
explain information could look as follows:
curl -XGET 'localhost:9200/books/_search?pretty&explain=true&q=title:solr'
[ 42 ]
Chapter 1
The results returned by Elasticsearch for the preceding query would be as follows:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.70273256,
"hits" : [ {
"_shard" : 2,
"_node" : "v5iRsht9SOWVzu-GY-YHlA",
"_index" : "books",
"_type" : "solr",
"_id" : "2",
"_score" : 0.70273256,
"_source" : {
"title" : "Solr Cookbook Third Edition",
"published" : 2015
},
"_explanation" : {
"value" : 0.70273256,
"description" : "weight(title:solr in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.70273256,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 1.4054651,
"description" : "idf(docFreq=1, maxDocs=3)",
"details" : [ ]
}, {
[ 43 ]
[ 44 ]
Chapter 1
[ 45 ]
[ 46 ]
Chapter 1
To query for the elasticsearch book phrase in the title field, we will pass the
following query:
title:"elasticsearch book"
You may have noticed the name of the field in the beginning and in the term or the
phrase later.
As we already said, the Lucene query syntax supports operators. For example, the +
operator tells Lucene that the given part must be matched in the document, meaning
that the term we are searching for must present in the field in the document. The operator is the opposite, which means that such a part of the query can't be present
in the document. A part of the query without the + or - operator will be treated as
the given part of the query that can be matched but it is not mandatory. So, if we
want to find a document with the book term in the title field and without the cat
term in the description field, we send the following query:
+title:book -description:cat
[ 47 ]
We can also group multiple terms with parentheses, as shown in the following query:
title:(crime punishment)
We can also boost parts of the query (this increases their importance for the scoring
algorithm the higher the boost, the more important the query part is) with the
^ operator and the boost value after it, as shown in the following query:
title:book^4
These are the basics of the Lucene query language and should allow you to use
Elasticsearch and construct queries without any problems. However, if you are
interested in the Lucene query syntax and you would like to explore that in
depth, please refer to the official documentation of the query parser available at
http://lucene.apache.org/core/5_4_0/queryparser/org/apache/lucene/
queryparser/classic/package-summary.html.
Summary
In this chapter, we learned what full text search is and the contribution Apache
Lucene makes to this. In addition to this, we are now familiar with the basic
concepts of Elasticsearch and its top-level architecture. We used the Elasticsearch
REST API not only to index data, but also to update, retrieve, and finally delete it.
We've learned what versioning is and how we can use it for optimistic locking in
Elasticsearch. Finally, we searched our data using the simple URI query.
In the next chapter, we'll focus on indexing our data. We will see how Elasticsearch
indexing works and what the role of primary shards and replicas is. We'll see
how Elasticsearch handles data that it doesn't know and how to create our own
mappingsthe JSON structure that describes the structure of our index. We'll
also learn how to use batch indexing to speed up the indexing process and what
additional information can be stored along with our index to help us achieve our
goal. In addition, we will discuss what an index segment is, what segment merging
is, and how to tune a segment. Finally, we'll see how routing works in Elasticsearch
and what options we have when it comes to both indexing and querying routing.
[ 48 ]
www.PacktPub.com
Stay Connected: