✒️ Elasticsearch Certified Engineer notes
✍ Personal notes and exercises creates to pass the Elastic Certified Engineer certification
🏅 Certification successfully passed
⚠ The following are my personal notes, so I assume no responsibility or liability for any errors or omissions in the content.
📧 Found an error or have a question? write to me or leave a comment
🗂 Index
- 🗂 Index
- 🗺️ Course summary
- 🎓 Topics notes
- 👨🏭 How to
- 🐳 Deepenings
- 💊 Pills
- 🤝 Advices
- 📔 Dictionary
- 🙏 Resources
🗺️ Course summary
Official summary Link - Certification FAQ - Stack subscriptions
An Elastic Certified Engineer can deploy a cluster, write precise queries and complex aggregations, create optimized mappings with custom analyzers, manage shard allocation as ingest increases, troubleshoot node issues, and more. - blog
-
Summary
- ⚠️ Warnings: arguments saw on exams but not addressed by this article
- Data Management
- Define an index that satisfies a given set of requirements
- Use the Data Visualizer to upload a text file into Elasticsearch *
- Define and use an index template for a given pattern that satisfies a given set of requirements
- Define and use a dynamic template that satisfies a given set of requirements
- Define an Index Lifecycle Management policy for a time-series index *
- Define an index template that creates a new data stream *
- Searching Data
- Write and execute a search query for terms and/or phrases in one or more fields of an index
- Write and execute a search query that is a Boolean combination of multiple queries and filters
- Write an asynchronous search *
- Write and execute metric and bucket aggregations
- Write and execute aggregations that contain sub-aggregations
- Write and execute a query that searches across multiple clusters
- Developing Search Applications
- Highlight the search terms in the response of a query
- Sort the results of a query by a given set of requirements
- Implement pagination of the results of a search query
- Define and use index aliases
- Define and use a search template
- Data Processing
- Define a mapping that satisfies a given set of requirements
- Define and use a custom analyzer that satisfies a given set of requirements
- Define and use multi-fields with different data types and/or analyzers
- Use the Reindex API and Update By Query API to reindex and/or update documents
- Define and use an ingest pipeline that satisfies a given set of requirements, including the use of Painless to modify documents
- Configure an index so that it properly maintains the relationships of nested arrays of objects
- Cluster Management
- Diagnose shard issues and repair a cluster’s health
- Backup and restore a cluster and/or specific indices
- Configure a snapshot to be searchable
- Configure a cluster for cross-cluster search
- Implement cross-cluster replication *
- Define role-based access control using Elasticsearch Security
🎓 Topics notes
📝 - We will explore each exam topics reported on the exam page
🧰 - Exam specs:
∙ Exam ES version: 7.13
∙ Elasticsearch 7.13 guide
∙ Kibana 7.13 guide
Legend:
⭐ - Relevant topic
💡 - Tips
🦂 - Tricky point
Aware:
🔗 - Original links to sources are often provided
🖱️ - The code sometimes contain invalid inline
comments included for study purposes
Examples:
🤖 - There are a lot of code examples, the suggestion
is to try everything on your machine
💻 - All the examples are executed on a notebook
with 16GB RAM, i5 CPU, into Docker containers
🔷 Data Management
-
Questions
🔹 Define an index that satisfies a given set of requirements
- See 🔹 Define a mapping that satisfies a given set of requirements chapter
🔹 Use the Data Visualizer to upload a text file into Elasticsearch
- Data visualizer
- [video] How to import data on Data visualizer
Elasticsearch Data Visualizer for Files - Section to upload logs data - doc
Kibana → Analytics → Machine Learning → Data Visualizer - Basically, you can upload CSV, TSV, JSON, Logs, data directly from Kibana UI
- Want to try? Here CSV with Italian cities info
- Index Pattern
- [video] How to import data on Data visualizer
🔹 Define and use an index template for a given pattern that satisfies a given set of requirements
🔗 Official docs
- An index template is a way to tell Elasticsearch how to configure an index when it is created.
- Could be composed of Component templates: reusable building blocks that configure mappings, settings, and aliases
- composable template: new (ES v.7.8) index template, it replaces the legacy templates - link
- 💡 Index created with explicit settings and also matches an index template: the settings from the create index request take precedence
- Changes to index templates do not affect existing indices, including the existing backing indices of a data stream.
- Create a template
-
API docs
-
🖱️ Code example
- 🦂 Note that Kibana suggestions don’t show template field, although is required before mappings and settings fields
GET _cat/indices?v # --- # Index template # --- # Create some composable templates PUT _component_template/component_template1 { "template": { "mappings": { "properties": { "@timestamp": { "type": "date" }, "name": { "type": "keyword" }, "bio":{ "type": "text", "analyzer": "simple" } } } } } PUT _component_template/runtime_component_template { "template": { "mappings": { "runtime": { "day_of_week": { "type": "keyword", "script": { "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))" } } } } } } GET _component_template/runtime_component_template # Create the template PUT _index_template/template_1 { "index_patterns": ["te*", "bar*"], "template": { "settings": { "number_of_shards": 1 }, /* The following mapping overwrite potential template specs*/ "mappings": { "_meta": { "description": "Generated using `template_1` " }, "_source": { "enabled": false }, "properties": { "host_name": { "type": "keyword" }, "created_at": { "type": "date", "format": "EEE MMM dd HH:mm:ss Z yyyy" } } }, "aliases": { "mydata": { } } }, "priority": 500, "composed_of": ["component_template1", "runtime_component_template"], "version": 1, "_meta": { "description": "Testing templating system" } } # Create an index that uses the template PUT bar_index_1 GET bar_index_1
-
🔹 Define and use a dynamic template that satisfies a given set of requirements
🔗 Official docs
-
Greater control of how Elasticsearch maps your data beyond the default dynamic field mapping rules.
-
💡 You can create rules to map new fields (dynamically added - so not explicitly declared in the index original mapping) to desired types
-
🖱️ Code example
# ───────────────────────────────────────────── # Basic example with dynamic_templates: # Map all fields that start with *ip** to IP type # ───────────────────────────────────────────── # Create the index PUT my-index-000001/ { "mappings": { "dynamic": "true", "dynamic_templates": [ { "strings_as_ip": { "match_mapping_type": "string", "match": "ip*", "runtime": { "type": "ip" } } } ] } } # One field PUT my-index-000001/_doc/1 { "ip_host":"0.0.0.0", "host_ip":"0.0.0.0" } GET my-index-000001/_search { "query": { "term": { "host_ip": "0.0.0.0/16" } } } # > 0 hits found GET my-index-000001/_search { "query": { "term": { "ip_host": "0.0.0.0/16" } } } # > 1 hit found DELETE my-index-000001
-
🖱️ Code example
# ───────────────────────────────────────────── # Dynamic templates example: # create a *full_name* field with desired format. # # Relevant fields involved: # - Patch match/unmatch # - copy_to # ───────────────────────────────────────────── PUT my-index-000001 { "mappings": { "dynamic_templates": [ { "full_name": { "path_match": "name.*", "path_unmatch": "*.middle", "mapping": { "type": "keyword", "copy_to": "full_name" } } } ] } } PUT my-index-000001/_doc/1 { "name": { "first": "John", "middle": "Winston", "last": "Lennon" } } GET my-index-000001/_search { "query": { "match": { "full_name": { "query": "John Winston", "operator" : "and" } } } } # > 0 hits GET my-index-000001/_search { "query": { "match": { "full_name": { "query": "John Lennon", "operator" : "and" } } } } # > 1 hits DELETE my-index-000001
🔹 Define an Index Lifecycle Management (ILM) policy (ILP) for a time-series index
-
⭐ ILM
🔗 Official docs
- 💡 Automatically manage indices according to your performance, resiliency, and retention requirements.
- 🦂 Don’t mix ILM with backup system: for the backup process there is a dedicated topic named SLM: Snapshot Lifecycle Management
- Actions you could trigger:
- Rollover:
create a new index when the current reaches some limits - Shrink:
reduce the number of primary shards- Possible with shrink api
- 💡 Tip: Why shrink and index? → To reduce overhead - link
- Possible with shrink api
- Force merge:
reduce the number of segments in the index’s shards- Possible with force merge api
- Freeze:
freeze and index - possible with freeze api - Delete:
delete the index
- Rollover:
- Index lifecycle temperatures
- Lifecycle phases - with data behavior description
- Hot: insert and queries
- Warm: queries
- Cold: queries infrequently
- Frozen: queries rarely
- Delete: no longer used
- ILM moves indices through the lifecycle according to their age
- Actions available for each phase: list
- Lifecycle phases are useful if we move the data on less expensive HW,
so we move data to different nodes belonging to different Data tiers
- Lifecycle phases - with data behavior description
- Create an ILM
- Using kibana: Stack Management > Index Lifecycle Policies - example
- API - docs
-
🦂 Hot phase and read-only action:
You can set read-only under the hot phase in the policy creation, without context this doesn’t make sense.- The read-only is referred to the indices archived after a rollover, as described on the GitHub issue
- Moreover: to enable read-only into the API call, the rollover action must be present - doc
-
🦂 What happens if we set min_age > 0ms in the hot phase?
- Official answers aren’t found, but on Kibana Edit policy section you couldn’t set the min_age parameter, so we can assume this parameter will be ignored for the hot phase
-
Get an index lifecycle status using ilm explain API - docs
GET <target>/_ilm/explain
-
🖱️ Code example
# ───────────────────────────────────────────── # ILM # > Example of almost all ILM settings available # ───────────────────────────────────────────── PUT _ilm/policy/my_policy { "policy": { "phases": { "hot": { "min_age": "0ms", # Assumption: will be ignored (see notes) "actions": { "readonly": {}, # Referred to rollover indexes "shrink": { "number_of_shards": 1 # Reduce the number of primary shards }, "rollover": { # Move the insertions to new indexes "max_primary_shard_size": "50gb", "max_age": "30d", "max_docs": 1000 }, "forcemerge": { # Merge the lucene segments "max_num_segments": 1, "index_codec": "best_compression" }, "set_priority": { # Set index recovery priority "priority": 100 } } }, "warm": { "min_age": "7d", # Min age to enter in the phase "actions": { "set_priority": { "priority": 50 }, "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 }, "allocate": { "number_of_replicas": 3 }, "readonly": {} } }, "cold": { "min_age": "14d", "actions": { "set_priority": { "priority": 0 } } }, "delete": { "min_age": "365d", "actions": { "delete": { "delete_searchable_snapshot": true } } } } } } GET _ilm/policy/my_policy # Create an index use the policy PUT my-index-3 { "settings": { "number_of_shards": 3, "number_of_replicas": 1, "index.lifecycle.name": "my_policy" } } # Check GET my-index-3/_ilm/explain
-
-
Time series & Data streams
🔗 Time series docs
🦂 Some documentation is under the How To section,
maybe not available during the exam🔗 Data streams docs
-
Time series:
- “A series of data points indexed (or listed or graphed) in time order” - wiki
-
Data streams
ES structure to manage time-series data
- 💡 “Data Streams is just an improved API and a better user-experience for using the Rollover API for partitioning data into more indexes” - web
- “A data stream lets you store append-only time series data across multiple indices while giving you a single named resource for requests.” - doc
- Data streams components:
- ILM policy with rollover definition
- Each time the rollover process run, a new backing index is created
- Search queries are forwarded to all the backing indices
- When you index a new document, only the last backing index is used
- You cannot add new documents to backing indices different than the latest, even by sending requests directly to the index shard.
- Each time the rollover process run, a new backing index is created
- Index template
"data_stream": { },
- mandatory parameter, specify the index created is a “data stream”"@timestamp"
- mandatory parameter, used to time order the data- You cannot delete a template used by a data stream
- See the chapter Define an index template that creates a new data stream for the Kibana code
- ILM policy with rollover definition
- 💡 “Data Streams is just an improved API and a better user-experience for using the Rollover API for partitioning data into more indexes” - web
-
-
ILM & Time series
ES offers features to help you store, manage, and search time series data,
such as logs and metrics - doc- 💡To manage time-series data ES offer different technologies that you should use together:
- [optional] Data tiers: create multiple nodes with different HW specs
(fast HW for hot data, slow and cheap HW for cold data) - [optional] Create a snapshot repository: store the data on distributed file storages (e.g. Google Cloud Storage) for backup purposes
- Create a ILP: define backing indexes lifecycle, from the creation to the storing on cold tiers and potential deletion
- Create an index template: the time series fields mapping must contain
@timestamp
- Create the index template and use it
- [optional] Data tiers: create multiple nodes with different HW specs
- 🔗 More on ES documentation
- 💡To manage time-series data ES offer different technologies that you should use together:
🔹 Define an index template that creates a new data stream
-
What is a data stream? → see the previous chapter
- tl;dr; “Data Streams is just an improved API and a better user-experience for using the Rollover API for partitioning data into more indexes” - web
-
⭐ Create a data stream
📎 Official docs
-
Five steps
- [optional] Create an index lifecycle policy
- [optional] Create component templates
- Create an index template
- Create the data stream
- [optional] Secure the data stream
-
Basic data stream creation (only steps 3 & 4)
📎 Data stream creation API
-
💡 Data stream index must include
@timestamp
field - doc -
💡 For the data stream index naming there is an official naming scheme
-
🖱️ Code example
- 🦂 The following example is good to get the hang with API but not so useful in a real-world scenario: without ILM the data stream index is not so different from a normal index
# ───────────────────────────────────────────── # Basic data stream creation: # define an index template that # creates a new data stream # ───────────────────────────────────────────── # --- # Create an index template # --- # Create data stream template PUT _index_template/my-stream-template { "index_patterns": [ "my-logs-backend-*" ], "data_stream": {}, "template": { "mappings": { "properties": { "@timestamp": { "type": "date" }, "message": { "type": "text" }, "relevance": { "type": "long" } } } } } # Note: # - data stream must include `@timestamp` field # - template must include `data_stream` field # --- # Create the data stream # --- PUT _data_stream/my-logs-backend-test GET _data_stream/my-logs-backend-test # > 200 # Note: # - "index_name" is not a human-friendly string, and start with a point # - "timestamp_field" is automatically linked to "@timestamp" # - "template" used is the previous "my-stream-template" GET _cat/indices # > 200 # Note: the previous "index_name" is present # --- # Load & search some data # --- POST my-logs-backend-test/_doc { "@timestamp": "2020-01-01T00:00:00", "message": "my first message", "relevance": 1 } POST my-logs-backend-test/_doc { "@timestamp": "2020-01-02T00:00:00", "message": "bla bla bla bla - my second message", "relevance": 2 } POST my-logs-backend-test/_doc { "@timestamp": "2020-01-02T00:00:01", "message": "low level message: relevance 3", "relevance": 3 } GET my-logs-backend-test/_search { "query": { "match": { "message": "message" } } } # > all docs returned GET my-logs-backend-test/_search { "query": { "bool": { "filter": [ { "term": { "relevance": 3 } } ] } } } # > 1 hit found
-
-
-
Use a data stream
🔗 Official doc
-
🖱️ Code example
- Section 1: exploring ILM operations# Cluster to use: `04_snapshots-locals` # ------------------------------------- # Create ILM: exploring ILM operations # ------------------------------------- DELETE _ilm/policy/my-hwc-policy PUT _ilm/policy/my-hwc-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_docs": 1 }, "set_priority": { "priority": 100 }, "forcemerge": { "max_num_segments": 1 } }, "min_age": "0ms" }, "warm": { "min_age": "10s", "actions": { "set_priority": { "priority": 50 }, "allocate": { "number_of_replicas": 0 } } }, "cold": { "min_age": "1m", "actions": { "set_priority": { "priority": 0 } } } } } } # > 200 # Note: rollover after 1 document, # warm after 10s, cold after 1m PUT _cluster/settings { "transient": { "indices.lifecycle.poll_interval": "3s" } } # > 200 # Note: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html#ilm-existing-indices-reindex # --- # Test the ILM # --- DELETE test-index-01 PUT test-index-01 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "my-hwc-policy" } } # Note: no alias provided GET _cat/shards/test*?v # > node es01 GET _cat/indices/test*?v # > test-index-01, count 0 PUT test-index-01/_doc/01 {"foo":"bar"} GET test-index-01/_ilm/explain # > stack_trace: java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [test-index-01] is empty or not defined # [!] To use rollover, we need to set the alias! DELETE test-index-01 PUT test-index-01 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "my-hwc-policy", "index.lifecycle.rollover_alias": "test-index-01-alias" } } # > 200 POST _aliases { "actions": [ { "add": { "index": "test-index-01", "alias": "test-index-01-alias", "is_write_index": true } } ] } # [!] Alias must exist to have rolling system PUT test-index-01/_doc/01 {"foo":"bar"} PUT test-index-01/_doc/02 {"foo":"bar"} GET test-index-01/_count # > 2 GET _cat/shards/test*?v # > Index: test-index-000002 - Node: es01 # > Index: test-index-01 - Node: es01 # Wait for 10s ... GET _cat/shards/test*?v # > Index: test-index-000002 - Node: es01 # > Index: test-index-01 - Node: es01 # [!] Index not moved because no template was defined, # so the `test-index-000002`hasn't the IL # --- # Tip: se alias for enable writing # --- POST _aliases { "actions": [ { "add": { "index": "test-index-01", "alias": "test-index-01-alias", "is_write_index": true } } ] } PUT test-index-01-alias/_doc/03 {"foo":"bar"} # > 200 PUT _cluster/settings { "transient": { "indices.lifecycle.poll_interval": null } } # > 200 # Note: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html#ilm-existing-indices-reindex
-
🖱️ Code example
- Section 2: ILM data-stream like# Cluster to use: `04_snapshots-locals` # ------------------------------------- # Create index with data-stream like behaviour # ------------------------------------- DELETE _ilm/policy/my-hwc-policy PUT _ilm/policy/my-hwc-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_docs": 1 }, "set_priority": { "priority": 100 }, "forcemerge": { "max_num_segments": 1 } }, "min_age": "0ms" }, "warm": { "min_age": "10s", "actions": { "set_priority": { "priority": 50 }, "allocate": { "number_of_replicas": 0 } } }, "cold": { "min_age": "1m", "actions": { "set_priority": { "priority": 0 } } } } } } # > 200 # Note: rollover after 1 document, # warm after 10s, cold after 1m PUT _cluster/settings { "transient": { "indices.lifecycle.poll_interval": "3s" } } # > 200 # Note: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html#ilm-existing-indices-reindex # --- # Create an index that rolls over after 1 document, # move the old index to warm and then to the cold, # and meanwhile an alias is updated accordingly # --- DELETE _index_template/test-index-template PUT _index_template/test-index-template { "index_patterns": ["test-index-*"], "template": { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "my-hwc-policy", "index.lifecycle.rollover_alias": "test-index-alias" } } } DELETE test-index-000001 PUT test-index-000001 # [!] Warning: name MUST end with `-000001`, or # the ILM process will broke POST _aliases { "actions": [ { "add": { "index": "test-index-000001", "alias": "test-index-alias", "is_write_index": true } } ] } GET _cat/shards/test-index*?v # > index: test-index; node: es01 PUT test-index-000001/_doc/01 {"foo":"bar"} PUT test-index-alias/_doc/02 {"foo":"bar"} GET test-index-000001/_count # > 2 GET _cat/shards/*test*?v # > index: test-index-000001; node: es01 # > index: test-index-000002; node: es01 GET test-index-000001/_ilm/explain # Note: use to check if errors occur # Wait for 10s... GET _cat/shards/*test*?v # > index: test-index-000001; node: es02 # > index: test-index-000002; node: es01 # Note: the first index is moved to es02 (warm) # a new index (...0002) is created, and # the alias is updated, with ...0002 index as writing index GET _alias/test-index-alias # > test-index-000002; "is_write_index" : true # > test-index-000001; "is_write_index" : false # wait 60s... GET _cat/shards/*test*?v # > index: test-index-000001; node: es03 # > index: test-index-000002; node: es01 # --- # And so on... # --- PUT test-index-alias/_doc/03 {"foo":"bar"} GET test-index-alias/_count GET _cat/shards/*test*?v # > index: test-index-000001; node: es03 # > index: test-index-000002; node: es01 # > index: test-index-000003; node: es01
-
🖱️ Code example
- Section 3: use data stream# --- # Create ILM # --- PUT _ilm/policy/hwc-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_docs": 1 }, "set_priority": { "priority": 100 } }, "min_age": "0ms" }, "warm": { "min_age": "10s", "actions": { "set_priority": { "priority": 50 }, "allocate": { "number_of_replicas": 0 } } }, "cold": { "min_age": "60s", "actions": { "set_priority": { "priority": 0 } } } } } } # > 200 # Note: rollover after 1 doc, warm after 10s, cold after 60s # --- # Create data stream # --- PUT _index_template/data-stream-template { "index_patterns": ["test-data-stream*"], "data_stream": { }, "template": { "settings": { "index.lifecycle.name": "hwc-policy", "number_of_replicas": 0 }, "mappings": { "properties": { "@timestamp": { "type": "date" } } } } } # [!] one `date` field is mandatory(@timestamp) # Note: we don't need to define alias even if we are using # the rollover functionality (like without data_stream) PUT _data_stream/my-new-data-stream # > 400; no matching index template found DELETE _data_stream/test-data-stream-01 PUT _data_stream/test-data-stream-01 # >200 # Note: index name must coincide with one index_patterns # of a template with data_stream enabled GET _data_stream # > "name" : "test-data-stream-01" # > "index_name" : ".ds-test-data-stream-01-2021.12.05-000001", GET _cat/indices/*test*?v # > .ds-test-data-stream-01-2021.12.05-000001 GET _alias # > ".ds-test-data-stream-01-2021.12.05-000001" : { # > "aliases" : { } # > }, # Note: alias not yet created GET _cat/shards/*test*?v # > index: .ds-test-data-stream-01-2021.12.05-000001; node: 01 # --- # Ingest some data # --- PUT test-data-stream-01/_bulk { "create":{ } } { "@timestamp": "2099-05-06T16:21:15.000Z", "message": "192.0.2.42 - - [06/May/2099:16:21:15 +0000] \"GET /images/bg.jpg HTTP/1.0\" 200 24736" } { "create":{ } } { "@timestamp": "2099-05-06T16:25:42.000Z", "message": "192.0.2.255 - - [06/May/2099:16:25:42 +0000] \"GET /favicon.ico HTTP/1.0\" 200 3638" } GET test-data-stream-01/_count # > count: 2 # Wait for 10s... GET _cat/indices/*test*?v # > ...001 # > ...002 # Note: new index ...002 created GET _cat/shards/*test*?v # > ...001; node: es02 # > ...002; node: es01 # Wait for 60s... GET _cat/shards/*test*?v # > ...01; node: es03 # > ...02; node: es01 # --- # Explore aliases # --- GET _alias # > .ds-test-data-stream-01-2021.12.05-000001; no alias # > .ds-test-data-stream-01-2021.12.05-000002; no alias GET _data_stream # > indices" : [ ...01, ...02 GET _cat/indices # Note: data stream not shown! PUT .ds-test-data-stream-01-2021.12.05-000001/_doc/99 { "foo": "bar", "@timestamp": "2099-05-06T16:21:15.000Z" } # > 400
-
🔷 Searching Data
-
Questions
🔹 Write and execute a search query for terms and/or phrases in one or more fields of an index
-
Elasticsearch is queried through a specific language named DSL (Domain Specific Language), we will combine multiple sections and elements of this language to retrieve the data with the desired characteristics - more on the doc
-
Use Full text queries to search analyzed text fields - doc
- Some interesting queries from the list:
- match - the standard search mode
- term - search for an exact term
- 💡 match_phrase - search for phrases, use when the order of the words is important
- Use slop parameter to set the maximum number of intervening unmatched positions
- Some interesting queries from the list:
-
🖱️ Code example
- 💡 On the following code block there is a question in this form:
Search all results that must satisfy X clause, and is a nice-to-have if satisfy Y clause.
How can we solve this? - using filter + “should” with a normal match-
Highlight extracted from the next code block:
*# Search all docs where the comment must contain the word "film", # and is a "nice-to-have" if the "phrase" field contains the "life" word* GET test-index-01/_search { "query": { "bool": { "filter": { "term": { "comment": "film" } }, "should": [ { "match": { "phrase": "life" } } ] } } }
-
# ───────────────────────────────────────────── # Search of text and keywords types # with multiple "full text queries" and # multiple index fields # ───────────────────────────────────────────── DELETE test-index-01 PUT test-index-01 { "mappings": { "properties": { "phrase": { "type": "text" }, "book": { "type": "keyword" }, "author": { "type": "keyword" }, "comment": { "type": "text" }, "review_date": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||HH:mm:ss yyyy/MM/dd" } } } } PUT test-index-01/_doc/1 { "phrase": "It was a bright cold day in April, and the clocks were striking thirteen", "book": "1984", "author": "George-Orwell", "comment": "A book everyone should read - recommended", "review_date": "2021/05/25 12:10:30" } PUT test-index-01/_doc/2 { "phrase": "Mr. Jones, of the Manor Farm, had locked the hen-houses for the night, but was too drunk to remember to shut the pop-holes", "book": "Animal Farm", "author": "George-Orwell", "comment": "A great classic of literature - recommended", "review_date": "2021/02/02 16:10:30" } PUT test-index-01/_doc/3 { "phrase": "Review the software license agreements for currently shipping Apple products", "book": "Software License Agreements", "author": "Apple", "comment": "Important but boring EULA informations - not recommended", "review_date": "2021/10/25 12:10:30" } PUT test-index-01/_doc/4 { "phrase": "Tyler gets me a job as a waiter, after that Tyler's pushing a gun in my mouth and saying, the first step to eternal life is you have to die.", "book": "Fight Club", "author": "Chuck Palahniuk", "comment": "The book behind the grat film - recommended", "review_date": "2019/10/25 12:10:30" } PUT test-index-01/_doc/5 { "phrase": "noise noise noise", "book": "Test book 1", "author": "Jhon Doe", "comment": "The book behind the grat film - not yet recommended", "review_date": "2021/01/10 09:10:30" } PUT test-index-01/_doc/6 { "phrase": "noise noise noise", "book": "Test book 2", "author": "Jhon Doe", "comment": "Not a book, not a film - recommended", "review_date": "10:10:30 2021/10/25" } PUT test-index-01/_doc/7 { "phrase": "Mr. Jones, of the Manor Farm, had locked the hen-houses for the night, but was too drunk to remember to shut the pop-holes", "book": "Animal Farm - not recommended", "author": "George-Orwell", "comment": "Test test test test - not recommended", "review_date": "2019/12/31 00:01:30" } PUT test-index-01/_doc/8 { "phrase": "It's my life", "book": "foo", "author": "John Doe", "comment": "Only a boook - recommended" } # --- # Search recommended film # # > different attempts reported # with discussion on each behavior # --- GET test-index-01/_search { "query": { "bool": { "must_not": [ { "term": { "comment": { "value": "not recommended" } } } ] } } } # > wrong - all docs returned # Note: you sould not use term query on text fields, # as suggested on the official documentation. GET test-index-01/_search { "query": { "bool": { "must_not": [ { "match": { "comment": "not recommended" } } ] } } } # > wrong - no docs returned # Note: the query is asking for all docs # that doesn't have words "not" AND "recommended". # Because all docs have the word "recommended" no # results are found. Same output of `"comment": "recommended"` GET test-index-01/_search { "query": { "bool": { "must_not": [ { "match": { "comment": "not" } } ] } } } # > correct - but not resilient # Note: searching all comments doesn't have the # word `not` work for this small example but isn't # a reliable solution, the `not` term can easily be # used on the comment before the final verdict. # [!] Like the test document "6" that is wrongly not retrieved. GET test-index-01/_search { "query": { "bool": { "must_not": [ { "bool": { "must": [ { "match": { "comment": "not" } }, { "match": { "comment": "recommended" } } ] } } ] } } } # > correct - but not resilient # Note: same problem of previous example GET test-index-01/_search { "query": { "query_string": { "default_field": "comment", "query": "NOT not recommended" } } } # > correct - but not resilient # Note: same problem of previous example. # This query is similar of the last one, # but write in more concise format and # return a _score for each hit GET test-index-01/_search { "query": { "bool": { "must_not": [ { "match_phrase": { "comment": "not recommended" } } ] } } } # > correct - can be improved # Note: with match_phrase we are asking for # documents that have both words "not recommended" # one word after the other word (ordering of the words is important). # [!] The test document "5" is returned with a # text "not yet recommended", this behaviour may or # may not be desired depending on the use case. GET test-index-01/_search { "query": { "bool": { "must_not": [ { "match_phrase": { "comment": "not recommended" } }, { "match_phrase": { "comment": "not yet recommended" } } ] } } } # > correct # Note: exclude all text phrases # that are not the simple "recommended" # Note: another approach could be to use `slope` # --- # Recommended books written by Orwell # --- GET test-index-01/_search { "query": { "bool": { "must": [ { "match": { "author": "George-Orwell" } }, { "bool": { "must_not": [ { "match_phrase": { "comment": "not recommended" } }, { "match_phrase": { "comment": "not yet recommended" } } ] } } ] } } } # > correct - can be improved # Note: because we aren't required how good # "George-Orwell" match with the author field, # `match` isn't the best API to use GET test-index-01/_search { "query": { "bool": { "must_not": [ { "match_phrase": { "comment": "not recommended" } }, { "match_phrase": { "comment": "not yet recommended" } } ], "filter": [ { "term": { "author": "George-Orwell" } } ] } } } # > correct # Note: with `filter` API we could # speed up the search using caching # --- # Books with comments must talk about "film", # possibly with phrase spoke about `life` # --- GET test-index-01/_search { "query": { "bool": { "should": [ { "match": { "comment": "film" } }, { "match": { "phrase": "life" } } ] } } } # > wrong - doc 8 should not be present # Note: the "must spoke about film" restriction # is not represented in this query GET test-index-01/_search { "query": { "bool": { "filter": { "term": { "comment": "film" } }, "should": [ { "match": { "phrase": "life" } } ] } } } # > correct # Note: for the conjunction between two assertions: # "X must be true" and "Y is nice-to-have" # we could put X into a `filter` and use the # standard `match` to evaluate the Y # --- # Documents with comments written in 2021 # --- GET test-index-01/_search { "query": { "range": { "review_date": { "gte": "2021/01/01 00:00:00", "lte": "2022/01/01 00:00:00" } } } } # > correct # Note: all docs have the same _score # --- # All books where the author # have a name that contains # the character "o" # --- GET test-index-01/_search { "query": { "wildcard": { "author": { "value": "*o*", "case_insensitive":true } } } } # --- # All docs are written by an author # with a name similar to `Jeorge-Orbell` # --- GET test-index-01/_search { "query": { "fuzzy": { "author": { "value": "Jeorge-Orbell", "fuzziness": 2 } } } } # > All the books that are written by "George-Orwell"
- 💡 On the following code block there is a question in this form:
🔹 Write and execute a search query that is a Boolean combination of multiple queries and filters
- See the previous question “Write and execute a search query for terms and/or phrases in one or more fields of an index”
- On search API,
bool
statement usages:must
→ query must be satisfied and track the scorefilter
→ like must but without the scoreshould
→ match not required but if verified score increasedmust_not
→ if match discard doc
🔹 Write an asynchronous search *
-
“Asynchronous search makes long-running queries feasible and reliable” - blog
-
“The async search API let you asynchronously execute a search request, monitor its progress, and retrieve partial results as they become available. - doc
- Create a standard query and make it async to receive a token used to monitor the query evolution and gather data as it executes
-
🖱️ Code example
- You can also specify how long the async search needs to be available through the
keep_alive
parameter - doc - 💡 Async search does not support scroll - doc
# ───────────────────────────────────────────── # Make an async search query # ───────────────────────────────────────────── # Add "Sample eCommerce orders" sample # data directly from Kibana: # https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana # --- # Check data existence # --- GET _cat/indices?v # > "kibana_sample_data_ecommerce" PUT _cluster/settings { "transient": { "search.max_buckets": 2290000 } } # Note: increase max_buckets to allow # a heavy query to be performed. # [!] Warning: the following is a heavy query GET kibana_sample_data_ecommerce/_search { "query": { "range": { "order_date": { "gte": "now-1d" } } }, "aggs": { "time_buckets": { "date_histogram": { "field": "order_date", "fixed_interval": "1s", "extended_bounds": { "min": "now-1d" }, "min_doc_count": 0 } } }, "size": 0 } # > [wait ~1m] - results returned # --- # Try Async search # --- POST kibana_sample_data_ecommerce/_async_search?size=0 { "query": { "range": { "order_date": { "gte": "now-1d" } } }, "aggs": { "time_buckets": { "date_histogram": { "field": "order_date", "fixed_interval": "1s", "extended_bounds": { "min": "now-1d" }, "min_doc_count": 0 } } }, "size": 0 } # > "is_running": true - no hits # Note: copy the "id" value (we will refer as $ID) GET _async_search/status/FmxTakt4Z0dDU0MyaG9TUC1GNVhqamcbVHhDQV9EcENTV21EOHNtWWt0b3hIdzo0ODg2 # > wait for "is_running": false GET _async_search/FmxTakt4Z0dDU0MyaG9TUC1GNVhqamcbVHhDQV9EcENTV21EOHNtWWt0b3hIdzo0ODg2 # > the query results
- You can also specify how long the async search needs to be available through the
🔹 Write and execute metric and bucket aggregations
“Aggregation summarizes your data as metrics, statistics, or other analytics” - doc
📎 High-Level Concepts - doc
📎 Official doc-
⭐ Aggregation is a powerful resource offered by Elasticsearch: it consists of the ability to aggregate (bucketization) the data and calculate metrics on those buckets.
- With some powerful characteristics:
- Efficiency: the aggregation use internal structures for fast calculation, and leverage the ES cluster scaling system
- Near real time: just as a document is indexed, it will be counted into the aggregation
- Powerful: aggregation structure allows a query nested system to allow the user to aggregate and measure any sort of data, moreover aggregation could be used in conjunction with the usual search system (the query field)
- With some powerful characteristics:
-
Bucket aggregations
-
“Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.” - doc
-
🖱️ Code example
# ───────────────────────────────────────────── # Bucket aggregations # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana GET _cat/indices?v # > `kibana_sample_data_ecommerce` GET kibana_sample_data_ecommerce/_search?size=1 # > get and idea of the doc structure # --- # Bucket aggregation # --- GET kibana_sample_data_ecommerce # > check "category" field: is indexed both as text and keyword # How many "category" exist? GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "category_census": { "terms": { "field": "category.keyword" } } } } # > There are 2024 "Men's Clothing", 1136 "Women's Shoes"... # Note: # - "size": 0 because we don't need to # use the "search" system, this spec speed-up the process # - we use `terms` although only one field is used # - `category.keyword` required because is a "Multi-Fields" field [1] # What are the manufacturers for each category? GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "sale_categories": { "terms": { "field": "category.keyword" }, "aggs": { "category_manufacturers": { "terms": { "field": "manufacturer.keyword" } } } } } } # > e.g. "Elitelligence" is the most important manufacturer of category "Men's Clothing" ... # In which categories the top 3 manufacturers sell products? GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "manufacturers": { "terms": { "field": "manufacturer.keyword", "size": 3 }, "aggs": { "categories": { "terms": { "field": "category.keyword" } } } } } } # Warning: description of the `aggs.manufacturers.terms.size` parameter # (in other words the `size` parameter on an `aggs` field) # is not found on the official API documentation. So it "should be" # the top 3 manufacturers, the count order is not guaranteed # --- # Resources # --- # [1] https://www.elastic.co/guide/en/elasticsearch/reference/7.13/mapping-types.html#types-multi-fields
-
-
Metric aggregations
Calculate metric on data searched and/or grouped into buckets
-
“Calculate metrics, such as a sum or average, from field values.” - doc
-
🖱️ Code example
# ───────────────────────────────────────────── # Metrics aggregations # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana GET _cat/indices?v # > `kibana_sample_data_ecommerce` GET kibana_sample_data_ecommerce/_search?size=1 # > get and idea of the doc structure # --- # Metrics aggregation # --- # AVG of products "price" GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "avg_price": { "avg": { "field": "products.price" } } } } # > average products price: 34.78€ # Most recent order GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "max_order_date": { "max": { "field": "order_date" } } } } # > "2021-11-13T23:45:36.000Z" # Older order GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "min_order_date": { "min": { "field": "order_date" } } } } # > "2021-10-14T00:04:19.000Z"
-
-
Bucket and metrics together
-
We can combine both the functionalities to calculate simultaneously metrics for all buckets
-
🦂 Inside the agg field, on the query request, we can use a Bucket or Metric predicate indiscriminately in the same place. It will be the predicate meaning that differentiates a bucketization from a stats calculation.
- e.g. terms predicate will create sub-groups (buckets) while avg will calculate the bucket average value
-
🖱️ Code example
# ───────────────────────────────────────────── # Bucket & Metrics aggregations # # Note: # - We will mix `query`, `bucket` and `aggs` terms # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana GET _cat/indices?v # > `kibana_sample_data_ecommerce` GET kibana_sample_data_ecommerce/_search?size=1 # > get and idea of the doc structure # --- # Bucket & Metrics aggregation # --- # AVG price per category GET kibana_sample_data_ecommerce/_search?size=0 { "aggs": { "avg_categories_price": { "terms": { "field": "category.keyword" }, "aggs": { "avg_price": { "avg": { "field": "products.price" } } } } } } # > "Men's Clothing" avg price: 33.44 # > "Women's Clothing" avg price: 32.91 # > [...] # AVG price of "Men's Clothing" GET kibana_sample_data_ecommerce/_search { "size": 0, "query": { "bool": { "filter": [ { "match": { "category.keyword": "Men's Clothing" } } ] } }, "aggs": { "category_mens_clothing_avg_price": { "avg": { "field": "products.base_price" } } } } # > "value": 33.44 # Note: same result as before, but without the overhead # of calculate all categories AVG # Number of products bought per day POST kibana_sample_data_ecommerce/_search?size=0 { "aggs": { "daily_orders": { "date_histogram": { "field": "order_date", "calendar_interval": "day" }, "aggs": { "products_counter": { "value_count": { "field": "products._id.keyword" } } } } } } # > 2021-10-21 - 318 products bought # > 2021-10-22 - 334 products bought # Number of products bought per day - another solution GET kibana_sample_data_ecommerce/_search?size=0 { "aggs": { "LEVEL_1": { "date_histogram": { "field": "order_date", "interval": "day" }, "aggs": { "LEVEL_2": { "cardinality": { "field": "products._id.keyword" } } } } } } # Date with the max products bought GET kibana_sample_data_ecommerce/_search?size=0 { "aggs": { "daily_orders": { "date_histogram": { "field": "order_date", "calendar_interval": "day" }, "aggs": { "products_counter": { "value_count": { "field": "products._id.keyword" } } } }, "date_max_products_bought": { "max_bucket": { "buckets_path": "daily_orders.products_counter" } } } } # > "2021-10-29T00:00:00.000Z" with 368 products boughts # Note: "max_bucket" term use "daily_orders.products_counter" # that are both user-defined
-
🔹 Write and execute aggregations that contain sub-aggregations
-
With sub-aggregations, we can go deeper in the analysis of the data. We can create buckets with some criteria inside other buckets.
- E.g. we can create buckets, one for each day and aggregate inside each day with some other policy
-
🖱️ Code example
# ───────────────────────────────────────────── # Aggregations with sub-aggregations # ───────────────────────────────────────────── # Add "Sample eCommerce orders" sample # data directly from Kibana: # https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana # --- # Check data existence # --- GET _cat/indices?v # > "kibana_sample_data_ecommerce" GET kibana_sample_data_ecommerce/_search?size=1 # --- # manufacturers inside an each category # --- GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "category_family": { "terms": { "field": "category.keyword" }, "aggs": { "manufacturer_family": { "terms": { "field": "manufacturer.keyword" } } } } } } # > "Men's Clothing" category have 1242 products boughts from "Elitelligence" manufacturer # Note: the "manufacturer_family" is a sub-aggregation # Check the above affirmation is true GET kibana_sample_data_ecommerce/_count { "query": { "bool": { "must": [ { "match": { "category.keyword": "Men's Clothing" } }, { "match": { "manufacturer.keyword": "Elitelligence" } } ] } } } # > 1242 # Note: the last query comment is true
🔹 Write and execute a query that searches across multiple clusters
🔗 Official doc
-
You could connect ES clusters to allow a search query to be performed across all their instances
- 🦂 Not all API are allowed, here is the complete list.
-
e.g. you cannot get a document from a remote cluster by _doc id:
GET local-index/_doc/01 # ← allowed GET remote-cluster:remote-index/_doc/01 # ← not allowed
-
- 🦂 Not all API are allowed, here is the complete list.
-
Basically use the following format if you want to query a remote cluster:
GET <remote-cluster-name> : <remote index name>/<API>
-
💡 You could check the remote cluster connection using the _remote API
GET /_remote/info # > "<cluster name>" # > "connected" : true, # > "num_nodes_connected" : 1,
-
🖱️ Code example
- The cluster used for the following example is multicluster-configured
# ───────────────────────────────────────────── # Multiple clsuters search # # Architecture: # Two clusters (`cluster1`,`cluster2`), # with `cluster1` configured to be # connected to `cluster2` # Note: # - To run the experiment use the cluster "multicluster-configured": # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles # - Pay attention to the comments: some kibana # code should be run on a different host # ───────────────────────────────────────────── # ───────────────────────────────────────────── # --- # Kibana code for `cluster2` # Tip: open `cluster2` kibana at localhost:5602 # and paste the following code # --- GET / # > `cluster2` GET /_remote/info # > no connections # Create some data PUT c2-index/_doc/01 { "msg": "Hello world form cluster 2!" } GET cluster1:c1-index/_search { "query": { "match_all": {} } } # > index not found # --- # [!] Kibana code for `cluster1` # Tip: open `cluster1` kibana at localhost:5601 # and paste the following code # --- GET / # > `cluster1` GET /_remote/info # `cluster2` connected # Note: "num_nodes_connected" should be at least 1 GET cluster2:c2-index/_search { "query": { "match_all": {} } } # > "msg" : "Hello world form cluster 2!" GET /c1-index,cluster2:c2-index/_search { "query": { "match_all": {} } } # > "Hello world form cluster 1!" # > "Hello world form cluster 2!" PUT c1-index/_doc/01 { "msg": "Hello world form cluster 1!" } # --- # Kibana code for `cluster2` # Tip: open `cluster2` kibana at localhost:5602 # and paste the following code # --- PUT _cluster/settings { "persistent": { "cluster": { "remote": { "cluster1": { "mode": "sniff", "seeds": [ "c1n1:9300" ], "transport.ping_schedule": "30s" } } } } } # 200 GET cluster1:c1-index/_search { "query": { "match_all": {} } } # > "msg" : "Hello world form cluster 1!"
-
🔷 Developing Search Applications
-
Questions
🔹 Highlight the search terms in the response of a query
-
“enable you to get highlighted snippets from one or more fields in your search results so you can show users where the query matches are” - doc
-
💡 ES internals thoughts:
At indexing time for searching purposes the text is parsed, tokenized and the tokens are used to build the search inverted index. In this system, the requirements to “highlight” a piece of the original text aren’t met: e.g. we should store/calculate also the tokens original position.
ES has multiple solutions for resolving this issue, explained in the chapter Offset Strategy. -
🖱️ Code example
- The cluster used for the following example is single-node
# ───────────────────────────────────────────── # Highlighting # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana # --- # Basic highlight test # --- PUT test-index-01 { "mappings": { "properties": { "msg": { "type": "text" } }, "_source": { "enabled": false } } } PUT test-index-02 { "mappings": { "properties": { "msg": { "type": "text" } }, "_source": { "enabled": true } } } PUT test-index-01/_doc/01 { "msg": "To be, or not to be, that is the question" } PUT test-index-02/_doc/01 { "msg": "To be, or not to be, that is the question" } GET test-index-01/_doc/01 # > "found" : true # Note: the body is not returned, because it wasn't stored GET test-index-02/_doc/01 # > "found" : true + "_source" with body GET test-index-02/_search { "query": { "match": { "msg": "to be" } }, "highlight": { "fields": { "msg": {} } } } # > <em>To</em> <em>be</em> GET test-index-01/_search { "query": { "match": { "msg": "to be" } }, "highlight": { "fields": { "msg": {} } } } # > hits with score, no highlight # Note: if the source text is not stored, # you cannot get the highlight # --- # Index settings and highlights behaviour # Experiment 1 # --- PUT test-index-03 { "mappings": { "_source": { "enabled": false }, "properties": { "msg":{ "type": "text", "term_vector": "with_positions_offsets" } } } } PUT test-index-03/_doc/01 { "msg": "To be, or not to be, that is the question" } GET test-index-03/_search { "query": { "match": { "msg": "to be" } }, "highlight": { "fields": { "msg": { "type": "unified" } } } } # > no highlights # Note: even with "term_vector": "with_positions_offsets", # if the source is not stored, the highlight couldn't # work. As described on the official documentation: # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/mapping-source-field.html # --- # Index settings and highlights behaviour # Experiment 2 # --- GET test-index-02 # > _source enabled and "msg" type "text" GET test-index-02/_search { "query": { "match": { "msg": "to be" } }, "highlight": { "fields": { "msg": { "type": "fvh" } } } } # > error # Note: you should index the termvector # if you want to use the `fvh` highlighter PUT test-index-04 { "mappings": { "_source": { "enabled": true }, "properties": { "msg": { "type": "text", "term_vector": "with_positions_offsets" } } } } PUT test-index-04/_doc/01 { "msg": "To be, or not to be, that is the question" } GET test-index-04/_search { "query": { "match": { "msg": "to be" } }, "highlight": { "fields": { "msg": { "type": "fvh" } } } } # > <em>To</em> <em>be</em> # Note: this is the faster highlight mode available, # but to enable the `fvh` highlighter you need an # index with "term_vector": "with_positions_offsets", # and this parameter will double the size of the field: # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/term-vector.html # --- # Highlights behaviour # --- PUT test-index-05 { "mappings": { "_source": { "enabled": true }, "properties": { "msg": { "type": "text" } } } } PUT test-index-05/_doc/01 { "msg": "To be, or not to be, that is the question" } GET test-index-05/_search { "query": { "match": { "msg": "to be" } }, "highlight": { "fields": { "msg": { "type": "unified" } }, "boundary_scanner": "word" } } # > "<em>To</em>", # --- # Highlight and bool # --- PUT test-index-06 { "mappings": { "_source": { "enabled": true }, "properties": { "phrase": { "type": "text" }, "comment": { "type": "text" } } } } PUT test-index-06/_doc/01 { "phrase": "To be, or not to be, that is the question", "comment": "he is questioning the value of life..." } PUT test-index-06/_doc/02 { "phrase": "The greatest glory in living lies not in never falling, but in rising every time we fall", "comment": "he was speaking about the power of persistence..." } PUT test-index-06/_doc/03 { "phrase": "I’ve learned that life is one crushing defeat after another until you just wish Flanders was dead.", "comment": "he is obese, immature, outspoken, aggressive, balding, lazy, ignorant,..." } GET test-index-06/_search { "query": { "bool": { "should": [ { "match": { "phrase": "Flanders" } }, { "match": { "comment": "he is" } } ] } }, "highlight": { "fields": { "phrase": { "type": "plain" } } } } # > <em>Flanders</em> # Note: only the `phrase` field is highlighted GET test-index-06/_search { "query": { "bool": { "must": [ { "match": { "phrase": "Flanders" } }, { "match": { "comment": "immature ignorant" } } ] } }, "highlight": { "fields": { "phrase": { "type": "plain" }, "comment": { "type": "plain" } } } } # > both fields highlighted GET test-index-06/_search { "query": { "bool": { "must": [ { "match": { "phrase": "Flanders" } }, { "match": { "comment": "he is" } } ] } }, "highlight": { "fields": { "comment": { "highlight_query": { "match": { "comment": "ignorant" } } } } } } # > <em>ignorant</em> # Note: the query search on some fields with some queries, # but we highlight work on something different.
🔹 Sort the results of a query by a given set of requirements
🔗 Official doc
-
“Allows you to add one or more sorts on specific fields.” - doc
-
Usually, the _score field is used to order the documents, but also other fields could be involved id ordering
-
🦂 If you use the sort field to order the results, the max_score value will be lost, use the “track_scores”: true if you want it
-
🖱️ Code example
- The cluster used for the following example is single-node
# ───────────────────────────────────────────── # Sorting the search results # ───────────────────────────────────────────── # Add "Sample eCommerce orders" sample # data directly from Kibana: # https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana GET _cat/indices?v # > "kibana_sample_data_ecommerce" GET kibana_sample_data_ecommerce GET kibana_sample_data_ecommerce/_search?size=1 GET kibana_sample_data_ecommerce/_search { "_source": ["products.product_name"], "query": { "match": { "products.product_name": "basic dark" } } } # > "max_score" : 5.0528083, # > "Basic T-shirt - Dark Salmon" GET kibana_sample_data_ecommerce/_search { "_source": ["products.product_name"], "query": { "match": { "products.product_name": "basic dark" } }, "sort": [ { "_score": { "order": "desc" } } ] } # > "max_score" : 5.0528083, # > "Basic T-shirt - Dark Salmon" # Note: same results as before because ES # order by _score by default GET kibana_sample_data_ecommerce/_search { "_source": ["products.product_name"], "query": { "match": { "products.product_name": "basic dark" } }, "sort": [ { "_score": { "order": "asc" } } ] } # > "max_score" : null, # > "Cocktail dress / Party dress - peacoat" | "_score" : 0.81010413 # Note: we have reversed the score order, so now ES # don't know the max_score value and the first hit # is the less relevant respect to the query GET kibana_sample_data_ecommerce/_search { "_source": [ "products.product_name", "order_id" ], "query": { "match": { "products.product_name": "basic dark" } }, "sort": [ { "order_id": { "order": "asc" } } ] } # > "max_score" : null, # > "order_id" : 550375, # > "product_name" : "Basic T-shirt - Medium Slate Blue" # Note: the results has lost the _score for the same # cause of the last query. Here we are ordering on # `order_id` value: this means that the first hit # is the document that match (maybe with the lowest # score but this doesn't matter) the query AND have # the biggest `order_id` value GET kibana_sample_data_ecommerce/_search { "_source": [ "products.product_name", "order_id" ], "track_scores": true, "query": { "match": { "products.product_name": "basic dark" } }, "sort": [ { "order_id": { "order": "asc" } } ] } # > "max_score" : 5.0528083, # Note: we can specify that we want the max score also # --- # Order an aggregation # --- # Number o products bought per day, in asc order, # of the "Elitelligence" manufacturer GET kibana_sample_data_ecommerce/_search?size=0 { "query": { "bool": { "filter": [ { "term": { "manufacturer.keyword": "Elitelligence" } } ] } }, "aggs": { "daily_bucket": { "date_histogram": { "field": "order_date", "interval": "day" }, "aggs": { "n_products": { "value_count": { "field": "products._id.keyword" } }, "n_products_sort": { "bucket_sort": { "sort": [ { "n_products": { "order": "asc" } } ] } } } } } } # > "key_as_string" : "2021-11-07T00:00:00.000Z", # > "n_products" : { "value" : 70 } # Note: for the bucket ordering we haven't use # the "sort term", for buckets we need # buckets pipeline aggregation fields. # In this case the `Bucket sort` pipeline # was used: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-pipeline-bucket-sort-aggregation.html
🔹 Implement pagination of the results of a search query
🔗 Official doc
-
A query could span up to a lot of docs, so usually, we get the results not all together but one page after another.
-
There are two main ways to paginate the documents:
- using from and size fields: recommended if the total hits to paginate are < 10.000
- using search_after field: recommended if the total hits to paginate are > 10.000
-
Both the pagination approaches require that during one-page request and the following request the index doesn’t change.
- To overcome this problem, Elasticsearch has a feature named Point In Time (PIT) - 🔗 doc
- 💡 Basically generate a token that represents the status of the cluster, and then pass this token during the pagination.
- To overcome this problem, Elasticsearch has a feature named Point In Time (PIT) - 🔗 doc
-
🖱️ Code example
- 🦂 Note the different pagination using search_after: without PIT we don’t receive and don’t use the _shard_doc value (see Dictionary paragraph and the following code for more explanation)
# ───────────────────────────────────────────── # Queries pagination # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana # --- # Use the basic "from" / "size" duo # --- GET kibana_sample_data_ecommerce/_search?size=1 GET kibana_sample_data_ecommerce/_count { "query": { "match": { "manufacturer": "Elitelligence" } } } # > 1370 hits GET kibana_sample_data_ecommerce/_search { "_source": ["order_id"], "from": 0, "size": 10, "query": { "match": { "manufacturer": "Elitelligence" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > last `order_id` returned is '723049' GET kibana_sample_data_ecommerce/_search { "_source": ["order_id"], "from": 9, "size": 20, "query": { "match": { "manufacturer": "Elitelligence" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > first `order_id` returned is '723049' POST kibana_sample_data_ecommerce/_doc { "manufacturer": "Elitelligence", "order_id" : 723050 } GET kibana_sample_data_ecommerce/_search { "_source": ["order_id"], "from": 9, "size": 20, "query": { "match": { "manufacturer": "Elitelligence" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > first `order_id` returned is '723050' # Note: same query, different result ('723050'!= '723049'), # because meanwhile a new document was indexed. # This behavior highlight how this approach # redoes the search, without persistence, # each time we ask for a new page. # --- # Use PIT: Point In Time with "from" / "size" duo # --- GET kibana_sample_data_ecommerce/_search { "_source": [ "order_id", "type" ], "from": 0, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > hits.total: 491 # > first document returned: "order_id" : 723213 POST kibana_sample_data_ecommerce/_pit?keep_alive=60m # > "id" : "85ez..." POST kibana_sample_data_ecommerce/_doc { "order_id": 730000, "geoip":{ "region_name": "Cairo Governorate" } } GET kibana_sample_data_ecommerce/_search { "_source": [ "order_id", "type" ], "from": 0, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > hits.total: 492 # > first document returned: "order_id" : 730000 # Note: same problem as before, an indexing # is occurred between two queries POST _search { "_source": [ "order_id", "type" ], "from": 0, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "pit": { "id": "85ezAwEca2liYW5hX3NhbXBsZV9kYXRhX2Vjb21tZXJjZRZHSGJGY19SNlJ6MkhyMTZTemNCWm5BABZ0MlBFd2hNQVRybUZ5SzZRcEU4WTF3AAAAAAAAAAWLFkVxTndfdC1VUXU2cEJTVVVkRDJzcmcAARZHSGJGY19SNlJ6MkhyMTZTemNCWm5BAAA" }, "sort": [ { "order_id": { "order": "desc" } } ] } # > hits.total: 491 # > first document returned: "order_id" : 723213 # Note: like before the indexing of the document. # With PIT we could search and paginate docs # without inconsistency. # Note: you need to replace the "pit.id" value # Note: with the use of PIT, we had received an # additional parameter under the "sort" array: # this prameter's field is named `_shard_doc` # --- # Use `search_after` for paginate # # Tip: recommended approach to paginate > 10.000 hits # --- GET kibana_sample_data_ecommerce/_search { "_source": [ "order_id", "type" ], "from": 0, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > last document "order_id" : 722406 GET kibana_sample_data_ecommerce/_search { "_source": [ "order_id", "type" ], "from": -1, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ], "search_after": [ 722406 ] } # > first document "order_id" : 722373 # Note: `from` value is -1 because is not relevant: # we are asking for 10 documents after the # 722406 `order_id` document GET kibana_sample_data_ecommerce/_search { "_source": [ "order_id", "type" ], "from": 10, "size": 11, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ] } # > first document "order_id" : 722373 # Note: same result as before, because both # methods have the same goal: paginate # --- # Recommended way to paginate >10 000 hits: # queries with both `search_after` and `PIT` fields # --- POST kibana_sample_data_ecommerce/_pit?keep_alive=60m # > "id" : "85ezA..." GET _search { "_source": [ "order_id", "type" ], "from": -1, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ], "pit": { "id": "85ezAwEca2liYW5hX3NhbXBsZV9kYXRhX2Vjb21tZXJjZRZHSGJGY19SNlJ6MkhyMTZTemNCWm5BABZ0MlBFd2hNQVRybUZ5SzZRcEU4WTF3AAAAAAAAAAqbFkVxTndfdC1VUXU2cEJTVVVkRDJzcmcAARZHSGJGY19SNlJ6MkhyMTZTemNCWm5BAAA=" } } # > last document "order_id" : 722406 # > last document "sort": ["722406", 4634] # Note: we don't specify the index in the GET API # Note: the 4634 is the `_shard_doc` value and need # to be used on the next request GET _search { "_source": [ "order_id", "type" ], "from": -1, "size": 10, "query": { "match": { "geoip.region_name": "Cairo Governorate" } }, "sort": [ { "order_id": { "order": "desc" } } ], "pit": { "id": "85ezAwEca2liYW5hX3NhbXBsZV9kYXRhX2Vjb21tZXJjZRZHSGJGY19SNlJ6MkhyMTZTemNCWm5BABZ0MlBFd2hNQVRybUZ5SzZRcEU4WTF3AAAAAAAAAAqbFkVxTndfdC1VUXU2cEJTVVVkRDJzcmcAARZHSGJGY19SNlJ6MkhyMTZTemNCWm5BAAA=" }, "search_after": [ "722406", 4634 ] } # > first document "order_id" : 722373 # Note: we need to use both the "order_id" # pagination value and the `_shard_doc` value # Note: you need to replace the pit.id # -------------------------------- # Other examples, based on Kibana # provided example indices # -------------------------------- GET kibana_sample_data_flights/_search { "_source": [ "FlightNum" ], "from": 0, "size": 3, "query": { "match_all": {} }, "sort": [ { "FlightNum": { "order": "asc" } } ] } # Second to last: 009NIGR # Last flight: 00CGX81 GET kibana_sample_data_flights/_search { "_source": [ "FlightNum" ], "from": 0, "size": 3, "query": { "match_all": {} }, "sort": [ { "FlightNum": { "order": "asc" } } ], "search_after": [ "009NIGR" ], "track_total_hits": false } # 1st: 00CGX81 # Note: due we have used the "second to last" of previous # query, now the 1st element is the same as the last query
🔹 Define and use index aliases
🔗 Official doc
-
“An index alias is a secondary name for one or more indices.” - doc
-
You could use aliases for multiple purposes, e.g.:
-
🖱️ Code example
# ───────────────────────────────────────────── # Indexes alias # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana # --- # Alias basics # --- GET kibana_sample_data_ecommerce/_search { "_source": [ "customer_full_name" ], "query": { "bool": { "filter": [ { "term": { "order_id": 584677 } } ] } } } # > hits.total.value = 1 # > "customer_full_name" : "Eddie Underwood" POST /_aliases { "actions": [ { "add": { "index": "kibana_sample_data_ecommerce", "alias": "my-index-001" } } ] } # > 200 GET my-index-001/_search { "_source": [ "customer_full_name" ], "query": { "bool": { "filter": [ { "term": { "order_id": 584677 } } ] } } } # > hits.total.value = 1 # > "customer_full_name" : "Eddie Underwood" # Note: we have used the alias as index name GET _cat/aliases?v # > my-index-001 | kibana_sample_data_ecommerce # --- # Alias and multiple indexes # --- GET kibana_sample_data_ecommerce/_search { "_source": [ "customer_full_name", "order_id", "type" ], "query": { "bool": { "filter": [ { "term": { "customer_full_name.keyword": "Eddie Underwood" } } ] } } } # > "customer_full_name" : "Eddie Underwood", # > "type" : "order", # > "order_id" : 584677 PUT customers-additional-info { "mappings": { "properties": { "customer_full_name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "customer_favorite_colour": { "type": "keyword" } } } } # > 200 PUT customers-additional-info/_doc/01 { "customer_full_name": "Eddie Underwood", "customer_favorite_colour": "red" } # > 200 POST /_aliases { "actions": [ { "add": { "indices": [ "kibana_sample_data_ecommerce", "customers-additional-info" ], "alias": "customers-info" } } ] } # > "acknowledged" : true GET customers-info/_search { "_source": [ "customer_full_name", "customer_gender", "customer_favorite_colour" ], "query": { "bool": { "filter": [ { "term": { "customer_full_name.keyword": "Eddie Underwood" } } ] } } } # > "customer_full_name" : "Eddie Underwood", # > "customer_favorite_colour" : "red" # > "customer_gender" : "MALE" # Note: the informations retrieved are from both the indices # --- # Alias as index filter # --- POST _aliases { "actions": [ { "add": { "index": "kibana_sample_data_ecommerce", "alias": "men-clothing", "filter": { "bool": { "filter": [ { "term": { "category.keyword": "Men's Clothing" } } ] } } } } ] } # > "acknowledged" : true GET men-clothing/_search?size=3 # > Only docs with "category" : [ "Men's Clothing" ]
🔹 Define and use a search template
🔗 Official doc
-
“A search template is a stored search you can run with different variables.” - doc
- Useful for many things:
- To not expose the ES query syntax externally: the final API will be more user-friendly and straightforward to use (you need only to fill the id of the query and the runtime parameters to use
- If an app makes the query, we can change the query structure without changing the app’s code
- Useful for many things:
-
To parametrize the query use the Mustache variables
-
🖱️ Code example
# ───────────────────────────────────────────── # Indexes templates # ───────────────────────────────────────────── # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana # --- # Create and use index template # --- PUT _scripts/ten-products-by-category-template { "script": { "lang": "mustache", "source": { "query": { "match": { "category": "{{product_category}}" } }, "size": 10 }, "params": { "product_category": "The category name to search" } } } # > "acknowledged" : true # Note: API to call is `_scripts` and we define # the query structure using moustache language # for the parameters placeholders POST _render/template { "id": "ten-products-by-category-template", "params": { "product_category": "category to search" } } # > "template_output" ... # Note: with _render we can see the final # query body produced from the template GET kibana_sample_data_ecommerce/_search/template { "id": "ten-products-by-category-template", "params": { "product_category": "Men's Clothing" } } # > "category" : [ "Men's Clothing", ... # Note: the query format is simpler than ES DSL query # Note: we have used a template against a # specific index, but the template # could be used with other indexes also PUT test-index-001/_doc/01 { "category": "my-test-category" } # > 200 GET test-index-001/_search/template { "id": "ten-products-by-category-template", "params": { "product_category": "test" } } # > "category" : "my-test-category" # Note: same template, different index # --- # Search templates with default values # --- GET kibana_sample_data_ecommerce/_search
-
🔷 Data Processing
-
Questions
🔹 Define a mapping that satisfies a given set of requirements
🔗 Official doc
-
Mapping is the process of defining how a document and the fields it contains are stored, analyzed and indexed
-
Mapping process
-
The mapping is specified using the homonym field inside the create index API
-
⭐ Inside the mapping term we can use two families of fields: Metadata Fields and Mapping Parameters
-
Metadata fields - doc
- Fields linked to the document but not directly created by them,
e.g. _id is the univocal internal (inside ES world) document id- 🦂 Some Kibana suggestions are deprecated, like the
"_all": {"enabled": true}
metadata field, that is deprecated and unsupported
- 🦂 Some Kibana suggestions are deprecated, like the
- 💡 All the metadata fields start with the underscore “_” and they are “internal” fields, same meaning that private class attributes in Python
- Fields linked to the document but not directly created by them,
-
Mapping parameters - doc
- Fields that specify how to manage the document fields,
e.g. properties field is used to declare the document fields with the relative properties (like the type, define an analyzer to use etc.)
- Fields that specify how to manage the document fields,
-
🖱️ Code example
# ───────────────────────────────────────────── # Basic Index mapping # ───────────────────────────────────────────── PUT my-index-01 { "mappings": { "_source": { "enabled": true }, "properties": { "name": { "type": "text", "analyzer": "simple" }, "country": { "type": "keyword", "store": false } } } } # > 200 # Note: `_source` is a metadata field, if enabled # the original document data is stored. # Inside `properties` we will define how and # which fields store
-
-
For the fields type declaration we can use two approaches: the Dynamic mapping and the Explicit mapping
fields type declaration: declare what a document field will contain (e.g. text, object, dates, text, keyword)
-
Explicit mapping - doc
- Specify the fields information (name, type, analyzer etc.) at index creation time
- 💡 You can’t change the mapping or field type of an existing field.
(although there are some exceptions - info)- 🦂 If you want to change the index mapping, you need to reindex the data
-
Dynamic mapping - doc
- Using the dynamic field, we can declare how new/not already declared fields are managed
-
Dynamic field allows the following values:
🔗 list copied from docs
- true - New fields are added to the mapping (default)
- runtime - New fields are added to the mapping as runtime fields.
These fields are not indexed and are loaded from _source at query time. - false - New fields are ignored and only stored on _source
- strict - If new fields are detected, an exception is thrown
-
💡 We can also use the dynamic templates to declare how to define certain fields matching some naming criteria, see the dedicated chapter for more
-
- Elasticsearch automatically assign a type to new fields found on a new document using those rules
- 💡 You can use both the mappings approaches at the same time: explicitly declare fields already know and delegate the type management for new fields to Elasticsearch using the parameter dynamic
- Using the dynamic field, we can declare how new/not already declared fields are managed
-
🖱️ Code example
# ───────────────────────────────────────────── # Explicit / Dynamic Mapping # ───────────────────────────────────────────── # --- # Explicit mapping # --- PUT /my-index-01 { "mappings": { "properties": { "age": { "type": "integer" }, "email": { "type": "keyword" }, "name": { "type": "text" } } } } # > "acknowledged" : true GET _cat/shards/my-index-01?v # > 2 entries # Note: by default, elasticsearch create an index # with 1 primary shard and 1 replica PUT my-index-01/_doc/01 { "name" : "tyler", "age" : 33, "email" : "tyler@hotmail.com", "employee-id" : 1 } # > 200 # Note: "employee-id" wasn't declared in the mapping, # this field is dynamically mapped GET my-index-01/_doc/01 # > "employee-id" : 1 PUT /my-index-01/_mapping { "properties": { "employee-id": { "type": "long", "index": false } } } # > 400; "conflicts with existing mapper" # Note: we cannot update a field already exist, # indifferently if it is dynamic or explicit PUT /my-index-01/_mapping { "properties": { "born-city": { "type": "text", "analyzer": "simple" } } } # > 200 # Note: this update work because no document # indexed have "born-city" field, neither # the mapping properties PUT /my-index-01/_doc/02 { "name": "magda", "age": 45, "email": "magda@hotmail.com", "employee-id": 2, "born-city": "New York" } # > 200 GET my-index-01/_search { "query": { "match": { "born-city": "new" } } } # > "name" : "magda", # Note: this search is possible because the # `born-city` name use the `simple` analyzer # --- # Dynamic mapping # --- DELETE my-index-02 PUT my-index-02 { "mappings": { "dynamic": "strict", "properties": { "user": { "properties": { "name": { "type": "text" }, "social_networks": { "dynamic": true, "properties": {} } } } } } } # > 200 # Note: we have provided the field "dynamic" : "strict", # so no new fields are allowed on this index PUT my-index-02/_doc/1 { "user": { "name": "tyler" } } # > 200 PUT my-index-02/_doc/2 { "user": { "name": "tyler" }, "provider": "AWS" } # > 400; "strict_dynamic_mapping_exception" PUT my-index-02/_doc/2 { "user": { "name": "tyler", "social_networks": { "fb": "tyler-official" } } } # > 200 # Note: here the new field is correctly indexed, # altought not declared at mapping time, # because of the socia_networks.dynamic: true PUT my-index-03 { "mappings": { "dynamic": "false", "properties": { "user": { "properties": { "name": { "type": "text" }, "social_networks": { "dynamic": true, "properties": {} } } } } } } # 200 # Note: we have provided the field "dynamic" : "false", # new filelds will be ignored PUT my-index-03/_doc/01 { "user": { "name": "giorgio", "social_networks": { "fb": "giorgione" } } } # > 200 PUT my-index-03/_doc/02 { "user": { "name": "maccio", "social_networks": { "fb": "The Real Maccio" }, "provider": "AWS" } } # > 200 GET my-index-03/_search { "query": { "match": { "user.name": "maccio" } } } # > 200; hits.total: 1 # Note: the document retrieved has "provider" : "AWS" GET my-index-03/_search { "query": { "match": { "provider": "AWS" } } } # > 0 hits found # Note: `provider` is stored in the `_source` field # but isn't indexed for search because we had # defined "dynamic": "false" at mapping time
-
-
-
Field data types
🔗 Official doc
-
Each field has a field data type, although was user-defined (explicit mapping) or inferred by Elasticsearch (Dynamic mapping)
-
Interesting data types:
🔗 complete list
- keyword: used for structured content (e.g. ID, mail, tags).
- There are three more keyword types, see Keyword type family
- alias: defines an alternate name for a field in the index
- object: A JSON object.
- join: special field that creates parent/child relation
- range: continuous range of values between an upper and lower bound
- aggregate_metric_double: Pre-aggregated metric values.
- keyword: used for structured content (e.g. ID, mail, tags).
-
-
Runtime Fields
🔗 Official doc
-
🖱️ Code example
-
🦂 We cannot multi-field a field of type object or nested
PUT my-index-000004 { "mappings": { "properties": { "my-field": { "type": "object", "fields": { "raw": { "type": "keyword" } } } } } } # > 400; Failed to parse mapping [_doc]: Mapping definition for [my-field] has unsupported parameters PUT my-index-000004 { "mappings": { "properties": { "my-field": { "type": "text", "fields": { "raw": { "type": "keyword" } } } } } } # > 200
# ───────────────────────────────────────────── # Index mapping wrap up # ───────────────────────────────────────────── # --- # Try different mapping parameters & types # # All list here: # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/mapping-params.html # --- PUT test-index-01 { "mappings": { "properties": { "name": { "type": "text", "copy_to": "full_name", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "surname": { "type": "text", "copy_to": "full_name", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "full_name": { "type": "keyword" } } } } # > 200 # Note: properties `name` and `surname` are # indexed both as `text` and `keyword` types PUT test-index-01/_doc/01 { "name":"bat", "surname": "man" } GET test-index-01/_doc/01 # > 200; `full_name` not present GET test-index-01/_search { "query": { "match": { "full_name": "man" } } } # > 200; document found: `full_name` is only queryable PUT test-index-02 { "mappings": { "properties": { "full_name": { "type": "text", "analyzer": "simple", "term_vector": "with_positions", "fields": { "keyword": { "type": "keyword" } } }, "hobbies": { "type": "nested", "properties": { "name": { "type": "keyword" }, "outdoor": { "type": "boolean" } } }, "personal_info":{ "type": "flattened" } } } } # > 200 # Note: we use some interesting data types and fields, # like `term_vector`, `nested` and `flattened` type PUT test-index-02/_doc/01 { "full_name": "Angry Bird", "hobbies": [ { "name": "fly", "outdoor": true }, { "name": "Mobile gaming", "outdoor": "IDK" } ] } # > 400; Failed to parse value [IDK] # Note: we had defined `outdoor` as # boolean field PUT test-index-02/_doc/01 { "full_name": "Angry Bird", "hobbies": [ { "name": "fly", "outdoor": true }, { "name": "Mobile gaming", "outdoor": false } ], "personal_info":{ "born_on":"20101105", "android_user": true, "labels":[ "green", "red" ] } } # > 200 GET test-index-02/_doc/01 # > 200 GET test-index-02/_doc/01/_termvectors # > 200 # Note: with `termvectors` enabled we could # explore how the analyzer had tokenized # the text # --- # Explicit & Dynamic mapping # --- PUT test-index-03 { "mappings": { "properties": { "product_id": { "type": "keyword" } } } } # > 200 # Note: explicit mapping PUT test-index-04 { "mappings": { "dynamic": "true" } } # > 200 # Note: index that could ingest new fields, # and automatically detect field type # Warning: `dynamic` field is not suggested # by Kibana Webapp PUT test-index-04/_doc/01 { "product_id": "kiqhfi2iu3hf" } # > 200 GET test-index-04 # > "product_id" : "type" : "text" | "type" : "keyword" PUT test-index-05 { "mappings": { "dynamic": "runtime" } } # > 200 PUT test-index-05/_doc/01 { "product_id": "kiqhfi2iu3hf" } # > 200 GET test-index-05 # > "product_id" : "type" : "keyword" # Note: product_id isn't indexed as `text` field # like before in "test-index-04" # --- # Dynamic templates # --- # We can also define how to index some fields # without explicitly specifying the field's name, # but instead use some matching conditions # # More on the dedicated chapter and the official doc: # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/dynamic-templates.html PUT test-index-06 { "mappings": { "dynamic": "true", "dynamic_templates": [ { "customer_info_as_keywords": { "match_mapping_type": "string", "match": "customer_*", "mapping": { "type": "keyword" } } } ] } } # > 200 # Note: all fields start wit `customer_` prefix # will be indexed as `keyword` type. # Instead, others fields will be indexed as # bot `text` and `keyword` type because of # `dynamic`:`true` parameter # Warning: Kibana webapp doesn't suggest # "dynamic_templates" as available field PUT test-index-06/_doc/01 { "customer_name": "giorgio", "customer_gender": "male", "product_comment": "Everything about the physical device, i feel like is pretty well made." } # > 200 GET test-index-06 # > 200 # Note: fields indexed as expected # customer_gender -> keyword # customer_name -> keyword # product_comment -> text and keyword
-
🔹 Define and use a custom analyzer that satisfies a given set of requirements
-
Analyzer
-
Analyzers are instruments used in text fields and provide different ways to analyze and search the text.
- Through the analyzers, ES can return all relevant results, rather than just exact matches.
-
The analyzer is composed of multiple components:
🔗 Original doc
- An analyzer may have zero or more character filters, which are applied in order.
- e.g. mapping character filter, that replaces a sequence of characters with another sequence following a provided map
- An analyzer must have exactly one tokenizer.
- e.g. character group tokenizer, that split the text in tokens whenever it encounters a character which is in a provided set
- An analyzer may have zero or more token filters, which are applied in order.
- e.g. stop token filter, used to remove the stop words from the text before the insertion of the inverted index - doc
- An analyzer may have zero or more character filters, which are applied in order.
-
🦂 Note that the analyzer modifies the text only to enhance the search process, your original document text will not be changed when retrieved and displayed.
- This could lead to some mismatches, e.g. as explained in the documentation, the highlight process will be invalided if the analyzer process change the length of the original text
-
Since the analyzer could alter the text, it should be used for both the documents text and the query - link
- When an analyzer is used to parse the new text that will be stored in an index, is called index analyzer, while an analyzer used to parse the query at search time is named search analyzer
- In most cases, the same analyzer should be used at index and search time.
However, sometimes they could be different,
e.g. here is a good example of the use of different analyzers at indexing and query time: link
-
Create an analyzer at search time
- You can create a new analyzer at search time, but be aware: that analyzer will be used only on the search query text*,* not on the documents indexed
- 💡 This makes sense because at indexing time the analyzers parse the document and build all the internal structures used then to fast search through the documents. With a “runtime” analyzer those internal structures cannot be created/updated “on the fly” only for one query.
- You can create a new analyzer at search time, but be aware: that analyzer will be used only on the search query text*,* not on the documents indexed
-
-
🖱️ Code example
# ───────────────────────────────────────────── # Create a custom analyzer that # define all the three components: # 1. character filter [0+] # 2. tokenizer [1] # 3. token filters [0+] # # Legend: [X] = how many items of that # type we can define # ───────────────────────────────────────────── # --- # 1. Character filter # --- # Replace digits and emoji with text GET /_analyze { "tokenizer": "keyword", "char_filter": [ { "type": "mapping", "mappings": [ "0 => zero", "1 => one", "2 => two", "3 => three", "4 => FOUR", "5 => five", "6 => six", "7 => seven", "8 => eight", "9 => nine" ] } ], "text": "I have 2 bike and 4 laptop :)" } GET /_analyze { "tokenizer": "keyword", "char_filter": [ { "type": "mapping", "mappings": [ ":) => _happy_" ] } ], "text": "I have 2 bike and 4 laptop :)" } # --- # 2. Tokenizer # --- # By default, `lowercase` tokenizer remove digits from text # but we replace digits with text using the character filter # before the tokenization process POST _analyze { "tokenizer": "lowercase", "text": "I have 2 bike and 4 laptop :)" } # --- # 3. Token filters # --- # We will use filters to replace `two bike` with `bikes` GET /_analyze { "tokenizer" : "whitespace", "filter" : [ { "type": "common_grams", "common_words": ["two", "bike"] } ], "text" : "I have two bike and four laptop :)" } GET /_analyze { "tokenizer": "whitespace", "filter": [ { "type": "pattern_replace", "pattern": "(two_bike)", "replacement": "bikes" } ], "text": "I have two_bike and four laptop :)" } # --- # Create the index # --- DELETE my-index-0001 PUT my-index-0001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "char_filter": [ "my_mapping_numbers_to_text", "my_mapping_emoji" ], "tokenizer": "lowercase", "filter": [ "my_ngram_filter", "my_plural_bike_filter" ] } }, "char_filter": { "my_mapping_numbers_to_text": { "type": "mapping", "mappings": [ "0 => zero", "1 => one", "2 => two", "3 => three", "4 => four", "5 => five", "6 => six", "7 => seven", "8 => eight", "9 => nine" ] }, "my_mapping_emoji": { "type": "mapping", "mappings": [ ":) => _happy_" ] } }, "filter": { "my_ngram_filter": { "type": "common_grams", "common_words": [ "two", "bike" ] }, "my_plural_bike_filter": { "type": "pattern_replace", "pattern": "(two_bike)", "replacement": "bikes" } } } }, "mappings": { "properties": { "my_field": { "type": "text", "analyzer": "my_analyzer" } } } } PUT my-index-0001/_doc/1 { "my_field": "I have 2 bike and 4 laptop :)" } GET my-index-0001/_doc/1 # > Original text returned # [!] Remember: analyzer apply changes # only for search purposes and don't change # the original document text. To analyze the # terms used for search use `termvectors` GET my-index-0001/_termvectors/1?fields=my_field&field_statistics=false # > "two", "four", "_happy_", "bikes" tokens found # # [!] Tokenizer 'lowercase' remove the numbers from the text, # why we have "four" and "two" on the termvectors? Because # before the tokenizer run we map all digits to words # using the characters filter 'my_mapping_numbers_to_text' # # [!] "bikes" token is present because the token filters # are applied in order, and the "my_ngram_filter" build # the token "two_bike" before "my_plural_bike_filter" # apply the conversion from "two_bike" to "bikes". # And in fact, the token "two_bike" doesn't reported # on the termvectors list
🔹 Define and use multi-fields with different data types and/or analyzers
🔗 Official doc
-
A way to “index the same field in different ways for different purposes” - doc
- Different ways include using the different analyzer and different field’s type
-
🦂 In the documentations we could find the multi-fields specs under the following path of the official index page:
Mapping → Mapping parameters → fields -
🖱️ Code example
# ───────────────────────────────────────────── # Multi-fields examples # ───────────────────────────────────────────── # --- # Basic usage # --- DELETE test-index-01 PUT test-index-01 { "mappings": { "properties": { "movie_title": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "commentary": { "type": "text" } } } } # > 200 PUT test-index-01/_doc/01 { "movie_title": "american history x", "commentary": "American History X is a 1998 American crime drama film directed by Tony Kaye and written by David McKenna." } # > 200 GET test-index-01/_search { "query": { "bool": { "filter": [ { "term": { "movie_title.keyword": "american" } } ] } } } # > 0 hit GET test-index-01/_search { "query": { "bool": { "filter": [ { "term": { "movie_title": "american" } } ] } } } # > 1 hit # --- # Define multiple types # --- PUT test-index-02 { "mappings": { "properties": { "movie_title": { "type": "text", "analyzer": "english", "fields": { "sayt": { "type": "search_as_you_type" }, "keyword": { "type": "keyword" } } } } } } # > 200 # Note: "sayt" as acronym of "Search As You Type" # Note: we had defined two "different ways" to index # the same document field PUT test-index-02/_doc/01?refresh { "movie_title": "The Lord of the Rings: The Return of the King" } # > 200 GET test-index-02/_search { "query": { "prefix": { "movie_title": { "value": "of the" } } } } # > 0 hits GET test-index-02/_search { "query": { "prefix": { "movie_title.sayt": { "value": "of the" } } } } # > 1 hit # Note: same field but with `text` and `english` anlyzers # we cannot use the stopwords # --- # Define multiple analyzers # --- DELETE test-index-03 PUT test-index-03 { "settings": { "analysis": { "analyzer": { "agnostic_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "agnostic_filter" ] } }, "filter": { "agnostic_filter": { "type": "pattern_replace", "pattern": "(christianity)", "replacement": "<religion>" } } } }, "mappings": { "properties": { "user_id": { "type": "keyword" }, "user_opinion": { "type": "text", "analyzer": "english", "term_vector": "with_positions_offsets_payloads", "store": true, "fields": { "agnostic": { "type": "text", "analyzer": "agnostic_analyzer", "search_analyzer": "agnostic_analyzer", "term_vector": "with_positions_offsets_payloads", "store": true } } } } } } # > 200 PUT test-index-03/_doc/01 { "user_id": "A001", "user_opinion": "I have a long family tradition around christianity and their celebrations" } # > 200 GET test-index-03/_search { "query": { "match": { "user_opinion": "christianity tradition" } } } # > 0.575 GET test-index-03/_search { "query": { "match": { "user_opinion": "buddhist tradition" } } } # > 0.28 score GET test-index-03/_termvectors/01 # > "<religion>" is present with "term_freq" : 1 GET test-index-03/_search { "query": { "match": { "user_opinion.agnostic": "<religion> tradition" } } } # > 0.28 score # Note: should be higher than 0.28, # TODO follow the ticket: # https://discuss.elastic.co/t/custom-analyzer-with-token-replacement/289236 GET test-index-03/_search { "query": { "match": { "user_opinion.agnostic": "tradition" } } } # > 0.28 score, like with <religion> tag -
🔹 Use the Reindex API and Update By Query API to reindex and/or update documents
-
🦂 In the documentations we could find those **specs under the following path of the official index page
REST APIs → Document APIs → [Update by query | Reindex] -
Reindex
🔗 Reindex API official doc
-
“Copies documents from a source to a destination.” - doc
-
Basically, use a from index as source of documents (it must have _source enabled indeed) to index the data into a destination index
-
The reindexing process is useful for many applications, thanks also to their proprieties like
- Reindex from multiple sources
- Reindex only data that match a specific query
- Reindex with a max cap of documents
- … more examples in the code block
-
You could also reindex data from a remote cluster - doc
-
🖱️ Code example
# ───────────────────────────────────────────── # Reindex API # ───────────────────────────────────────────── # Add "Sample eCommerce orders" sample # data directly from Kibana: # https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana GET _cat/indices/kibana*?v # > "kibana_sample_data_ecommerce" # --- # Change index settings and reindex # --- GET kibana_sample_data_ecommerce # > "number_of_shards" : "1" PUT test-index-01 { "settings": { "number_of_shards": 3 }, "mappings": { "properties": { "category": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } } } # > 200 # Note: original index mapping properties skipped # for space and readability but in real world # scenario all the mapping properties should # be reported POST _reindex { "source": { "index": "kibana_sample_data_ecommerce" }, "dest": { "index": "test-index-01" } } # > 200; "total" : 4675, GET _cat/shards?v # > kibana_sample_data_ecommerce | one primary shard # > test-index-01 | three rows for primary shards # --- # Alias + reindex for transparent # index structure changes # --- PUT test-index-02 { "aliases": { "movies-info": {} }, "mappings": { "properties": { "title": { "type": "text" } } } } # > 200 PUT movies-info/_doc/01 { "title": "Star Wars" } # > 200 GET movies-info/_search { "query": { "match": { "title": "Star" } } } # > "title" : "Star Wars" # [!] Now we want the `search_as_you_type` # field type under the `title` field. # One of the ways to get this functionality # on the already indexed documents also # is with the reindex process. PUT test-index-03 { "mappings": { "properties": { "title": { "type": "text", "fields": { "sayt": { "type": "search_as_you_type" } } } } } } # > 200 # Note: create the index with the new requirements PUT test-index-02/_settings { "settings": { "index.blocks.write": true } } # > 200 # Note: block the insertion of new documents # on the "source" index PUT movies-info/_doc/02 { "title": "Fight Club" } # > 403; index [test-index-02] blocked # Note: alias movies-info refer to test-index-02 POST _reindex { "source": { "index": "test-index-02" }, "dest": { "index": "test-index-03" } } # > 200 POST /_aliases { "actions": [ { "add": { "index": "test-index-03", "alias": "movies-info" } } ] } # > 200 # Note: now behind `movies-info` we have two indexes GET movies-info/_search { "query": { "match": { "title": "Star" } } } # > "title" : "Star Wars" # > "title" : "Star Wars" # Note: One hit from `test-iondex-02`, one # from `test-index-03` DELETE test-index-02 # > 200 # Note: remove old data GET movies-info/_search { "query": { "multi_match": { "query": "star", "type": "bool_prefix", "fields": [ "title.sayt", "title.sayt._2gram", "title.sayt._3gram" ] } } } # > "title" : "Star Wars" # Note: query possible only with the # mapping of `test-index-03`
-
-
Update by query
🔗 Update by query official doc
-
“Updates documents that match the specified query” - doc
-
Useful to apply some changes sequentially to a big number of documents that satisfy a query
-
🦂 During the update query the documents could change (and accordingly the _version filed also), we can use the
conflicts
field to specify how to resolve this event - doc -
🖱️ Code example
# ───────────────────────────────────────────── # Update by query API # ───────────────────────────────────────────── # Add "Sample eCommerce orders" sample # data directly from Kibana: # https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana GET _cat/indices/kibana*?v # > "kibana_sample_data_ecommerce" # --- # Basic usage # --- PUT test-index-01/_doc/01 { "movie_name": "once upon a time in hollywood", "director": "Quentin Tarantino" } # > 200 GET test-index-01/_doc/01 # > _version: 1 POST test-index-01/_update_by_query # > 200 GET test-index-01/_doc/01 # > _version: 2 # Note: the `_update_by_query` take the document in # `_source` and use it to re-index the data # on the index. This process increases the _version # --- # Use update by query to change # the documents fields contents # --- PUT test-index-01/_doc/02 { "movie_name": "Paz!", "director": "Renato De Maria" } # > 200 GET test-index-01/_search { "query": { "match": { "movie_name": "once upon" } } } # > 1 hit; "movie_name" : "once upon a time in hollywood", POST test-index-01/_update_by_query { "conflicts": "proceed", "query": { "match": { "director": "quentin" } }, "script": { "source": "ctx._source.movie_name='obfuscated'", "lang": "painless" } } # > 200 # Note: we change only the movie_name # of docs with "director": "quentin" GET test-index-01/_doc/01 # > "movie_name" : "obfuscated" GET test-index-01/_doc/02 # > ovie_name" : "Paz!" GET test-index-01/_search { "query": { "match": { "movie_name": "once upon" } } } # > no hits # Note: the change has involved also # the structure used to search # --- # Special attributes # --- GET kibana_sample_data_flights/_search?version=true { "query": { "wildcard": { "Dest": "Sydney Kingsford *" } } } # > _version: 1 POST kibana_sample_data_flights/_update_by_query?conflicts=proceed { "query": { "wildcard": { "Dest": "Sydney Kingsford *" } }, "script": { "source": """ long version = ctx['_version']; ctx["_source"]["dangerous"] = true; ctx["_version"] = version; """, "lang": "painless" } } GET kibana_sample_data_flights/_search?version=true { "query": { "wildcard": { "Dest": "Sydney Kingsford *" } } } # > "dangerous" : true, # > _version: 2 # Note: version is read only and cannot be managed
-
🔹 Define and use an ingest pipeline that satisfies a given set of requirements, including the use of Painless to modify documents
🔗 Official doc
-
Ingest pipeline
-
“perform common transformations on your data before indexing” - doc
-
With an ingest pipeline we could parse the input document and change their structure and content (differently than the analyzer component, that change and parse the document only for internal purposes).
-
An ingest pipeline is composed by one or more processors, that are the “working unit” that apply some specific changes to the document
- The processors list is here but could be retrieved by API or with plugins
- Each processor is configurable, for the max flexibility there is the Script processor that runs a stored script
-
💡 A good approach could be to create the ingest pipeline from Kibana GUI (Stack Management → Ingest Node Pipelines) and then use the Show request button to get the equivalent Kibana code
-
🖱️ Code example
- You can also create an ingest pipeline from Kibana GUI
- At least one note should have the ingest role
# ───────────────────────────────────────────── # Ingest pipeline # ───────────────────────────────────────────── # --- # Cluster # --- # The cluster must have at least one `ingest` role, # or an "illegal_state_exception" exception will be returned # --- # Pipeline basics # --- PUT _ingest/pipeline/test-pipeline-01 { "description": "Basic pipeline example: test 'split' and 'rename' processors", "processors": [ { "split": { "field": "folder_path", "separator": "/" }, "rename": { "field": "folder_path", "target_field": "folder_path_parsed" }, "set": { "field": "parsed", "value": true } } ], "version": 1 } # > 200 POST _ingest/pipeline/test-pipeline-01/_simulate { "docs": [ { "_source": { "folder_path": "/foo/bar/folder/file.txt" } }, { "_source": { "folder_path": "file.txt" } } ] } # > "folder_path_parsed" : "", "foo", "bar", "folder", "file.txt" # > "parsed" : true # Note: use the `_simulate` endpoint to be sure # the pipeline perform as desired PUT test-index-01/ { "settings": { "default_pipeline": "test-pipeline-01" } } # > 200 # Note: the `default_pipeline` field is not suggested by Kibana PUT test-index-01/_doc/01 { "folder_path": "/foo/bar/folder/file.txt" } # > 200 GET test-index-01/_doc/01 # > "folder_path_parsed" : "", "foo", "bar", "folder", "file.txt" # --- # Modify already indexed documents # --- PUT test-index-02/_doc/01 { "user_id": 123, "nikname": "Dr1ppy" } # > 200 PUT _ingest/pipeline/test-pipeline-02 { "description": "Enrich forum data pipeline", "processors": [ { "set": { "field": "forum", "value": "warrock" } } ], "version": 1 } # > 200 POST test-index-02/_update_by_query?pipeline=test-pipeline-02 # > 200 GET test-index-02/_doc/01 # > "forum" : "warrock" # Note: we have updated one filed of **all** the documents # inside the index. What about to update only "some" documents? POST _bulk {"index":{"_index":"test-index-02","_id":"02"}} {"user_id":234,"nikname":"BadKarma"} {"index":{"_index":"test-index-02","_id":"03"}} {"user_id":234,"nikname":"DankGamer","forum":"steam"} # > 200 # Note: the last entry have "forum":"steam", # how we can set "forum" only for documents that # doesn't have it? # [Solution 1]: use the `query` inside update_by_query POST test-index-02/_update_by_query?pipeline=test-pipeline-02 { "query": { "bool": { "must_not": [ { "exists": { "field": "forum" } } ] } } } # > 200 # Note: To find documents that are missing an indexed value for a field see # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/query-dsl-exists-query.html#find-docs-null-values GET test-index-02/_doc/02 # > "forum" : "warrock" GET test-index-02/_doc/03 # > "forum" : "steam" # Note: the document that already had the # "forum" field was not modified, as desired
-
-
Painless language
-
“With scripting, you can evaluate custom expressions in Elasticsearch” - doc
- There are some scripts languages: painless, expression, mustache, java, see the doc to understand which use
-
“Painless is a performant, secure scripting language designed specifically for Elasticsearch” - doc
-
We can use painless - and scripts in other languages - for a wide range of reasons:
- “You can write a script to do almost anything, and sometimes, that’s the trouble” - doc
-
💡 Store the script when possible: the script compiler process is heavy. For the same reason don’t hardcode parameters inside the script.
-
🦂 Scripts are incredibly useful, but can’t use Elasticsearch’s index structures or related optimizations - doc
-
💡 Painless takeaways
- Access to document fields using
doc['field_name']
- To use the field content we need to specify what we want:
doc['goals'].length
← count list lenghtdoc['name.keyword'].value
← access to the keyword content
- To use the field content we need to specify what we want:
- Define and declare variables, e.g.
int total=0
- Access to document
_source
usingctx._source
- Use
params.<parameter_name>
to parametrize a script - Painless debug is based on use
Debug.explain
utility that throws an exception and print useful information like the type of an object - Use
emit
to return calculated values insideruntime_mapping
- doc - 🦂 When use
doc
andctx._source
?
“Depending on where a script is used” - stack overflow
- Access to document fields using
-
🖱️ Code example
-
Official guides and docs
-
Use a Painless script in an update by query operation to add, modify, or delete fields within each of a set of documents collected as the result of query - doc
-
🦂 For integers it looks like that
.value
is not required:
total += doc['grades'][i]
and not~~total += doc['grades'][i].value~~
# ───────────────────────────────────────────── # Painless language # ───────────────────────────────────────────── # --- # Add some data # --- PUT test-index-01/_doc/01 { "name": "John", "grades": [ 9.4, 8.0, 3.0 ] } # > 200 PUT test-index-01/_doc/02 { "name": "Bob", "grades": [ 10.0, 7.0, 8.5, 9.0 ] } # > 200 PUT test-index-01/_doc/03 { "name": "Zen", "grades": [ 4.4, 5.0 ] } # > 200 # --- # Use Painless for search # --- GET test-index-01/_search { "query": { "bool": { "must": [ { "script": { "script": "doc['grades'].length > 3" } } ] } } } # > _id: 02 # Note: with painless we could return only # documents with more than 3 grades. # This approach is inefficient, let's # use script parameters GET test-index-01/_search { "query": { "bool": { "must": [ { "script": { "script": { "source": "doc[params.field_name].length > params.min_cardinality", "lang": "painless", "params": { "field_name": "grades", "min_cardinality": 3 } } } } ] } } } # > _id: 02 # Note: same results as before, but with params # Note: the structure is "must -> script -> script" PUT _scripts/list-cardinality-filter { "script": { "lang": "painless", "source": """ doc[params.field_name].length > params.min_cardinality """ } } # > 200 # Note: the params are automatically # inferred by the script content GET test-index-01/_search { "query": { "bool": { "must": [ { "script": { "script": { "id": "list-cardinality-filter", "params": { "field_name": "grades", "min_cardinality": 3 } } } } ] } } } # > _id: 02 # Note: we can also use stored scripts, same results as # last two queries # Note: pay attention to the query nested structure # --- # Update document using painless # --- GET test-index-01/_search { "query": { "bool": { "filter": [ { "script": { "script": { "lang": "painless", "source": """ int total = 0; for (int i = 0; i < doc['grades'].length; i++) { total += doc['grades'][i]; } float avg = total / doc['grades'].length; avg > 7.0 """ } } } ] } } } # > _id : 02 # Note: only documents with an average # grade of 7.0 are returned POST test-index-01/_update_by_query { "script": { "source": """ int total = 0; for (int i = 0; i < ctx._source[params.field_name].length; i++) { total += ctx._source.grades[i]; } float avg = total / ctx._source.grades.length; if (avg > params.threshold){ ctx._source.elegible = true; } else { ctx._source.elegible = false; } """, "params": { "field_name": "grades", "threshold": 7 }, "lang": "painless" } } # > 200 # Note: we are updating the documents _source: # set elegible=true if the grades AVG is > 7.0 # Note: use params and ctx in the form `_source[params.field_name]` # Note: we have switched from `doc` to `ctx._source` because # we are in `update_by_query` API GET test-index-01/_search?size=10 # > _id : 01 -> elegible : false # > _id : 02 -> elegible : true # > _id : 03 -> elegible : false
-
-
-
⭐ Ingest Pipeline & Painless
# ───────────────────────────────────────────── # Ingest pipeline & Painless # ───────────────────────────────────────────── # --- # Basic ingest pipeline # --- PUT _ingest/pipeline/test-ingest-01 { "description": "Lowercase the csv row and extract the fields", "version": 1, "processors": [ { "lowercase": { "field": "csv_data" } }, { "csv": { "field": "csv_data", "target_fields": [ "nickname", "city", "degree", "role" ], "separator": ";", "trim": true, "empty_value": "None", "tag": "my_csv_processor" } } ] } # > 200 # Note: created from Kibana GUI and then # pasted using "Show request" PUT test-index-01/_doc/01?pipeline=test-ingest-01 { "csv_data": "pistocop; Bologna; CS; Data Engineer;" } # > 200 GET test-index-01/_doc/01 # > "role" : "data engineer" # > "city" : "bologna", # > "nickname" : "pistocop", # > "degree" : "cs" # --- # Insert Painless script # --- PUT _ingest/pipeline/test-ingest-02 { "description": """Lowercase the csv row, extract the fields and create "cities_short" field""", "version": 1, "processors": [ { "lowercase": { "field": "csv_data" } }, { "csv": { "field": "csv_data", "target_fields": [ "nickname", "city", "degree", "role" ], "separator": ";", "trim": true, "empty_value": "None", "tag": "my_csv_processor" } }, { "script": { "lang": "painless", "source": """ Map cities = new HashMap(); cities.put('bologna','bo'); cities.put('roma','rm'); cities.put('milano','mi'); String city_shorted = cities.get(ctx[params.city_field]); ctx[params.city_shorted_field] = city_shorted; """, "params": { "city_field": "city", "city_shorted_field": "city_shorted" } } } ] } # > 200 # Note: use the `script` processor # to run Painless code # Note: Painless is Java-like, do not forget to # declare variables type # Note: we are using on painless a field # that is just created from the pipeline: # "city", so is important that the script # processor is executed after the csv processor # Warning: we don't use `ctx._source` PUT test-index-02/_doc/01?pipeline=test-ingest-02 { "csv_data": "pistocop; Bologna; CS; Data Engineer;" } # > 200 GET test-index-02/_doc/01 # > ... same as before # > "city_shorted" : "bo" PUT test-index-02/_doc/02?pipeline=test-ingest-02 { "csv_data": "magneto; MiLaNo; History; Teacher;" } GET test-index-02/_doc/02 # > ... # > "city_shorted" : "mi", # --- # Create a dispacher # using pipeline + stored script # --- PUT _scripts/my-script { "script":{ "lang": "painless", "source": """ String checkString = ctx[params['fieldToCheck']]; if (checkString == params['checkValue']){ ctx["_index"] = params['destinationIndex']; } """ } } PUT _ingest/pipeline/my-dispacher-pipeline { "processors": [ { "script": { "id": "my-script", "params": { "fieldToCheck": "dispacher-type", "checkValue": "storic", "destinationIndex": "storic-index" } } } ] } # Note: params setted at pipeline level POST _ingest/pipeline/my-dispacher-pipeline/_simulate { "docs": [ { "_source": { "my-keyword-field": "FOO", "dispacher-type": "storic" } }, { "_source": { "my-keyword-field": "BAR" } } ] } # > "_index" : "storic-index" PUT storic-index PUT my-index-01 { "settings": { "number_of_shards": 1, "default_pipeline": "my-dispacher-pipeline" } } PUT my-index-01/_doc/01 { "my-keyword-field": "FOO", "dispacher-type": "storic" } PUT my-index-01/_doc/02 { "my-keyword-field": "FOO", "dispacher-type": "non-storic" } GET storic-index/_search # > _id" : "01" GET my-index-01/_search # > "_id" : "02",
🔹 Configure an index so that it properly maintains the relationships of nested arrays of objects
🔗 Official doc
-
“allows arrays of objects to be indexed in a way that they can be queried independently of each other.” - doc
-
In ES and NoSQL world there isn’t a true relationship between data: each document is independent and tasks like SQL join are not expected.
Nevertheless, the real-world data has relations and in ES this aspect could be mapped using specific (nested) fields, 💡 but be aware: we are “merely” storing all related data altogether. -
🦂 Nested vs Object vs Arrays:
- An arrays like
["text1", "text2"]
should be indexed as “string” and not “object”- In other words, declare as usual the field (e.g. text) and then index as an array.
Is important all elements will have the same type format.
- In other words, declare as usual the field (e.g. text) and then index as an array.
- An arrays like
-
To catch the relationship between the information we could create/use three different fields:
-
Object arrays - doc
-
The default type when subfields are dynamically found, stores all documents grouping the keys with associated a list of values
# e.g. > Document: { "my_subfield":[ { "key1":"val11" "key2":"val12" }, { "key1":"val21" "key2":"val22" }, { "key3":"val3" }, ] } # Will be mapped as: key1 : [val11, val21] key2 : [val12, val22] key3 : [val3]
-
-
Nested arrays - doc
-
Treat each subfield as independent (stored as hided document under the hood)
# e.g. > Document: { "my_subfield":[ { "key1":"val11" "key2":"val12" }, { "key1":"val21" "key2":"val22" }, { "key3":"val3" }, ] } # Will be mapped as: > Hided1 { "key1":"val11" "key2":"val12" } > Hided2 { "key1":"val21" "key2":"val22" } > Hided3 { "key3":"val3" }
-
-
Flattened - doc
-
Store all the keys values in a list of type keyword
# e.g. > Document: { "my_subfield":[ { "key1":"val11" "key2":"val12" }, { "key1":"val21" "key2":"val22" }, { "key3":"val3" }, ] } # Will be mapped as: ["val11","val12","val21","val22","val3"]
-
🖱️ Code example
# --- # First test with flattened # --- DELETE test04 PUT test04 { "mappings": { "properties": { "f-flat":{ "type": "flattened" } } } } PUT test04/_doc/02 { "f-flat": [ { "field1": { "sub1": "sky", "sub2": "earth" } }, { "field1": { "sub1": "sky", "sub2": "earth" } } ] } GET test04 GET test04/_search { "query": { "match": { "f-flat.field1": "sky" } } } # > 0 hits GET test04/_search { "query": { "match": { "f-flat": "sky" } } } # > 1 hit
-
-
-
🖱️ Code example
-
🦂 In order to search using a nested field, the search body must include
nested
withpath
parameters# Example GET my-index-01/_search { "query": { "nested": { "path": "my_nested_collection", "query": {...} } } }
-
🦂 To get highlights from nested sub-fields use the field
inner_hits
at the same level asnested
params
# ───────────────────────────────────────────── # Mapping relationships: objects, nested, flattened # ───────────────────────────────────────────── # --- # Basic object mapping # --- PUT test-index-01/_doc/01 { "user_id": 1, "user_stats": { "last_access": "20211117T101500", "device": "smartphone", "ip_country": "italy" }, "user_friends": [ { "name": "markus", "nationality": "canadian" }, { "name": "alice", "nationality": "belgian" }, { "name": "stephen", "deleted": true } ] } # > 200 # Note: we are using Object types in both # "user_stats" and "user_friends" (type dynamically inferred) # They will be flattened like # "user_stats.device = smartphone" # "user_friends.name = ["markus", "alice", "stephen"] GET test-index-01/_search { "query": { "bool": { "filter": [ { "term": { "user_friends.name.keyword": "markus" } }, { "term": { "user_friends.nationality.keyword": "belgian" } } ] } } } # > "user_id" : 1 # Warning: if the desired query was "return users with # at least one belgian friend named markus" # this result is wrong. This is because we haven't # used nested field. # We will resolve this issue in the next block. # --- # Basic nested mapping # --- PUT test-index-02/ { "mappings": { "properties": { "user_friends": { "type": "nested", "properties": { "name": { "type": "keyword" }, "nationality": { "type": "keyword" }, "deleted": { "type": "boolean" } } } } } } # > 200 # Note: "nested" type used, each entries on the # field will be treated as individual. # Note: not all properties was mapped, # the others will be inferred dynamically PUT test-index-02/_doc/01 { "user_id": 1, "user_stats": { "last_access": "20211117T101500", "device": "smartphone", "ip_country": "italy" }, "user_friends": [ { "name": "markus", "nationality": "canadian" }, { "name": "alice", "nationality": "belgian" }, { "name": "stephen", "deleted": true } ] } # > 200 # Note: same document as 1st block, # but different index GET test-index-02/_search { "query": { "nested": { "path": "user_friends", "query": { "bool": { "filter": [ { "term": { "user_friends.name": "markus" } }, { "term": { "user_friends.nationality": "belgian" } } ] } } } } } # > 0 hits # Note: same query as before, but now we don't # get any results because each entry in # `user_friends` is managed as independent # document and there isn't friends with # name `markus` and nationality `belgian` # Note: In order to search using a nested field, # the search body must include nested with # path parameters GET test-index-02/_search { "query": { "nested": { "path": "user_friends", "query": { "bool": { "filter": [ { "term": { "user_friends.name": "markus" } }, { "term": { "user_friends.nationality": "canadian" } } ] } } } } } # > "user_id" : 1 # Note: correct match, markus is canadian and # belong to user_id 1 friends list # --- # Flattened field # --- PUT test-index-03/ { "mappings": { "properties": { "user_friends": { "type": "flattened" } } } } # > 200 # Note: with `flattened` the entire object # is mapped as a single field. PUT test-index-03/_doc/01 { "user_id": 1, "user_stats": { "last_access": "20211117T101500", "device": "smartphone", "ip_country": "italy" }, "user_friends": [ { "name": "markus", "nationality": "canadian" }, { "name": "alice", "nationality": "belgian" }, { "name": "stephen", "deleted": true } ] } # > 200 # Note: same document as 1st block, # but different index GET test-index-03/_search { "query": { "bool": { "filter": [ { "term": { "user_friends": "markus" } }, { "term": { "user_friends": "belgian" } } ] } } } # > "_id" : "01" # Note: we had to change the query structure # brecause now ".<subfield>.keyword" is not # longer supported: all the keys values are # stored as keyword. # --- # Object vs Flattened # # What is the difference? # -> in object we aggregate subfield # values based on keys # -> in flattened we only store keys # as keywords family # --- GET test-index-01 # > user_friends - type not specified: is Object GET test-index-03 # > "user_friends" : "type" : "flattened" PUT test-index-01/_doc/02 { "user_friends": [ { "name": "mike" }, { "name": "robert" } ] } # > 200 PUT test-index-03/_doc/02 { "user_friends": [ { "name": "mike" }, { "name": "robert" } ] } # > 200 GET test-index-01/_search { "query": { "bool": { "minimum_should_match": 2, "should": [ { "match_phrase": { "user_friends.name.keyword": "mike" } }, { "match_phrase": { "user_friends.name.keyword": "robert" } } ] } }, "highlight": { "fields": { "user_friends.name.keyword": {} } } } # > "_id" : "02" + highlight # Note: with `minimum_should_match` we are sure # that both the queries has a match. # A better way is to put the queries in "and", # this format is only for study purposes. GET test-index-03/_search { "query": { "bool": { "minimum_should_match": 2, "should": [ { "match_phrase": { "user_friends": "mike" } }, { "match_phrase": { "user_friends": "robert" } } ] } }, "highlight": { "fields": { "user_friends": {} } } } # > "_id" : "02" - but without highlight # Note: with `flattened` we cannot get highlights # --- # Nested highlight # --- GET test-index-02 # > "type" : "nested" PUT test-index-02/_doc/02 { "user_friends": [ { "name": "mike", "age": 22 }, { "name": "robert", "age": 30 } ] } # > 200 GET test-index-02/_search { "query": { "nested": { "path": "user_friends", "query": { "bool": { "minimum_should_match": 2, "should": [ { "match_phrase": { "user_friends.name": "mike" } }, { "match_phrase": { "user_friends.name": "robert" } } ] } } } } } # > 0 hits # Note: same query as last block, but here # no hits are returned because each # subfield is managed individually GET test-index-02/_search { "query": { "nested": { "path": "user_friends", "query": { "bool": { "minimum_should_match": 2, "should": [ { "match_phrase": { "user_friends.name": "mike" } }, { "match_phrase": { "user_friends.age": 22 } } ] } }, "inner_hits": { "highlight": { "fields": { "user_friends.name": {} } } } } } } # > "_id" : "02" + highlights # Note: for highlighting we require to use # a special field named `inner_hits`, # placed at the **same level as `nested`** field
-
-
🔷 Cluster Management
-
Questions
🔹 Diagnose shard issues and repair a cluster’s health
-
Repair corrupted shard
We will corrupt a shard to simulate a Hardware issue, then explore the ES behavior, recover the corrupted shard using CLI utilities.
Note: in real word you should use, if possible, a backup system to recover the index shard - the following example approach may lose index data
-
🖱️ Code example
- ⚠️ We will broke ES files, so be sure to run on a containerized environment developed only for the exercise
- 💡 We will use the elasticsearch-shard CLI program
- 💡 The cluster used for the next code block is 07_autorun_disabled
# ───────────────────────────────────────────── # Shard issues repair # ───────────────────────────────────────────── # --- # Init # --- # 1. Start the cluster, `bash rerun` if # you are using https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/07_autorun_disabled # 2. The cluster will start one master node: es01 GET _cat/nodes?v # > master: * ; name: es01 GET _cluster/health?human # > "status" : "green" # --- # Start es02 # --- # 1. Open new WSL/CLI # 2. Run `$ docker exec -it es02 /bin/bash` # 3. Run `$ su - elasticsearch bin/elasticsearch &` # 4. Close the shell # Now we have started a new node on the second node GET _cat/nodes?v # > master: * ; name: es01 # > master: - ; name: es02 # --- # Data creation # --- # Add "Sample eCommerce orders" data directly from kibana, # Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana PUT /kibana_sample_data_ecommerce/_settings { "index": { "number_of_replicas": 0, "auto_expand_replicas": false } } # > 200 GET _cat/shards/kib*?v # > prirep : p ; node: es02 # Note: if the primary shard isn't on es02, # restart the cluster and the tutorial GET _cat/indices/kib*?v # > 4675 # --- # Invalidate the index # --- # 1. Go into es02: `$ docker exec -it es02 /bin/bash` # 2. Find where `kibana_sample_data_ecommerce` are: # - Go into `/usr/share/elasticsearch/data/nodes/0/indices` folder # - Search for a folder ~4.1M using `du -h` # - Go into the folder, e.g. `./G0u2hp4aSb2YUb_ukHaSNA/0/index` # - Open the first file and "mess" with the code, e.g. `vi _0.cfs` # - [!] Tricky point: write some data, save the file and check # with the following query if the index is broken. # Mess with the data until the next query don't return: # `corrupt_index_exception` GET kibana_sample_data_ecommerce/_search { "query": { "match": { "manufacturer": "Oceanavigations" } } } # > `corrupt_index_exception` # --- # Remove corrupted shard # --- # 1. Go into es02: `$ docker exec -it es02 /bin/bash` # 2. Stop the ES instance: # - Read the program ID using `$ps -aux` # - Kill the program using `kill <pid>` # 3. Run the recovery program: # `$ bin/elasticsearch-shard remove-corrupted-data --index kibana_sample_data_ecommerce --shard-id 0` # 4. Answer yes to all questions # 5. [!] Copy the last block of code, after the note: # "You should run the following command to allocate this shard:" # printed on CLI by the recovery program # 6. Paste the code on kibana, it should looks like the next block # 7. Re-run ES on es02 node: `$ su - elasticsearch bin/elasticsearch &` # 8. RUn the code you have pasted, with accept_data_loss set to true POST /_cluster/reroute { "commands" : [ { "allocate_stale_primary" : { "index" : "kibana_sample_data_ecommerce", "shard" : 0, "node" : "uG690rhBQ9GJTfDGqe9BIg", "accept_data_loss" : true } } ] } # > 200 GET kibana_sample_data_ecommerce/_search { "query": { "match": { "manufacturer": "Oceanavigations" } } } # > 200 # Note: the query now is working! GET kibana_sample_data_ecommerce/_count # > 4597 # Note: Originally we had 4675 documents, now 4597, # because the recovery process could lost some # data as advertised by the CLI program
-
-
Red or yellow cluster status
🔗 official doc
We will simulate an HW crash whit a node shutdown
# ───────────────────────────────────────────── # Repair cluster health # ───────────────────────────────────────────── # --- # Start the cluster # --- # 1. Run the cluster named `08_autorun-disabled-3nodes` # $ bash rerun # 2. Run ES on es02: # $ docker exec -u elasticsearch es02 /usr/share/elasticsearch/bin/elasticsearch # Tip: you could escape from the command (`ctrl + c`) without problems: # the ES instance will continue to run # 3. Wait... GET _cat/nodes?v # > name: es01; master: *; node.role: m # > name: es02; master: - PUT test-index-01 { "settings": { "number_of_shards": 1, "number_of_replicas": 0 } } # > 200 GET _cat/shards/test*?v # > shard:0; prurep: p; node: es02 # Note: we have the primary shard of the index # stored inside node es02 GET _cluster/health # > status: green # --- # Go to yellow state # --- PUT test-index-01/_settings { "index" : { "number_of_replicas" : 1 } } # > 200 GET _cluster/health # > status: yellow # > "unassigned_shards" : 1 # Note: ES should allocate a replca shard, # but no nodes are available. # `es01` is tecnically an available index # but doesn't have role `data` GET _cluster/allocation/explain { "index": "test-index-01", "shard": 0, "primary": true, "current_node": "es02" } # > 200 # --- # Go to greed state: start new instance # --- # 1. Start ES instance inside es03: # $ docker exec -u elasticsearch es03 /usr/share/elasticsearch/bin/elasticsearch # 2. Wait... GET _cat/nodes?v # > ...same as before # > name: es03 GET _cat/shards/test*?v # > prirep:r; node: es03 # Note: the replica shard was created # and placed on node es03 GET _cluster/allocation/explain { "index": "test-index-01", "shard": 0, "primary": false, "current_node": "es03" } # > 200 GET _cluster/health # > "status" : "green" # --- # VM fault simulation # --- # > What happen if we kill # the node with the primary shard? GET _cat/shards/test*? # > primary shard on es02 # 1. Connect to the node # $ docker exec -u root -it es02 /bin/bash # 2. Find the ES prigram PID and kill it # $ ps -aux # $ kill 11 GET _cat/nodes?v # > node es02 disappeared GET _cat/shards/test*?v # > node: es03; prirep: p # Note: the replica shard allocated to es03 now, # after the es02 kill, it is converted to primary GET _cluster/health # > "status" : "yellow" GET _cluster/allocation/explain { "index": "test-index-01", "shard": 0, "primary": true, "current_node": "es03" } # > 200 # --- # VM recovery # --- # > What happen if the node # come back in function? # 1. Start ES on es02 node # $ docker exec -u elasticsearch es02 /usr/share/elasticsearch/bin/elasticsearch # 2. Wait... GET _cat/nodes?v # > name: es02 GET _cat/shards/test*?v # > prirep: r; node: es02 GET _cluster/health # > "status" : "green",
🔹 Backup and restore a cluster and/or specific indices
🔗 Official doc
-
🔗 For more info see the chapter of this guide under Deepenings → Index management → Backup/restore snapshots chapter
-
💡 Takeaways
- use snapshot to store on disk ES resources, i.e. indexes and settings
- we will create a family of snapshots inside a resource named repository
- the path where store snapshot files is defined inside the repository and must be declared on each node setting (elasticsearch.yml - see doc)
- we could schedule the snapshots lifecycle (when making a snapshot, when deleting etc.) using Snapshot Lifecycle Management (SLM) - doc
-
🖱️ Code example
- Almost all the functionalities and tasks related to the snapshot ecosystem could be done with Kibana UI too other than the following code block
- The cluster to use for the next is 04_snapshots-locals
# ───────────────────────────────────────────── # Backup and restore a cluster and/or specific indices # ───────────────────────────────────────────── # Run a cluster with a repo path registered: # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/04_snapshots-locals # --- # Register the repository # --- PUT /_snapshot/my-repository { "type": "fs", "settings": { "location": "/mnt/bkp" } } # > 200 # Note: "location" value must coincide with # informations stored on settings inside # the elasticsearch.yml file of each node # --- # Create a snapshot # --- PUT test-index-01/_doc/01 { "name": "donald", "surname": "duck" } # > 200 PUT test-index-02/_doc/01 { "song": "song2" } # > 200 PUT _snapshot/my-repository/my-first-snapshot { "indices": "test-index-01,test-index-02", "ignore_unavailable": true, "include_global_state": false, "metadata": { "taken_by": "es exercises", "taken_because": "test the backup system" } } # > "state" : "SUCCESS" # Note: we are creating a snapshot named `my-first-snapshot`, # it will include `test-index-01` and `test-index-02` # Warning: don't put spaces on "indices" field, no error will # be raised and the second index will not be included GET _cat/snapshots/my-repository?v # > id: my-first-snapshot # > failed_shards: 0 # --- # Recovery from a snapshot # --- PUT test-index-01/_doc/02 { "name": "donald", "surname": "Knuth" } # > 200 DELETE test-index-01/_doc/01 GET test-index-01/_search # > 1 hit, donald Knuth POST test-index-01/_close # > 200 # Note: A closed index is blocked for read/write operations, # we need to close an index before restore it POST /_snapshot/my-repository/my-first-snapshot/_restore { "indices": "test-index-01", "ignore_unavailable": true, "include_global_state": false, "include_aliases": false } # > 200 GET test-index-01/_search # > 2 hits, both Knuth and duck # Note: the index has recovered the # deleted document. # --- # Recover a changed document # --- PUT test-index-01/_doc/01 { "name":"salvo", "surname": "errori" } # > 200 POST test-index-01/_close # > 200 POST /_snapshot/my-repository/my-first-snapshot/_restore { "indices": "test-index-01", "ignore_unavailable": true, "include_global_state": false, "include_aliases": false } # > 200 GET test-index-01/_search # > 2 hits, both Knuth and duck # Note: the changed document is overwritted by the snapshot recovery PUT /_slm/policy/nightly-snapshots { "schedule": "0 30 1 * * ?", "name": "<nightly-snap-{now/d}>", "repository": "my-repository", "config": { "indices": [ "*" ] }, "retention": { "expire_after": "30d", "min_count": 5, "max_count": 50 } } # > 200 # Note: the policy will create a snapshot of all indexes # daily at 1:30AM UTC, then clean snapshot if they are # more than 50 or are older than 1 month. # The above rules doesn't apply if the snapshots created # are less than 5. # --- # Restore on different index # --- POST /_snapshot/my-repository/my-first-snapshot/_restore { "indices": "*", "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "index_*", "rename_replacement": "restored_index_$1", "include_aliases": false } # > 500 # Note: "index_out_of_bounds_exception", this error is dued # the fact we cannot use "index_*" as parameter POST /_snapshot/my-repository/my-first-snapshot/_restore { "indices": "*", "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "test-index-(.+)", "rename_replacement": "restored-$0", "include_aliases": false } # > 200 # Note: the restored indexes will have # the naming form of restored-<original index name> # because we had used $0 as variable GET _cat/indices/restored*?v # > restored-test-index-01 # > restored-test-index-02 GET restored-test-index-01/_search # > both donald duck and knuth # --- # Store and restore the cluster # --- POST _aliases { "actions": [ { "add": { "index": "test-index-01", "alias": "users-census" } } ] } # > 200 PUT _snapshot/my-repository/my-cluster-snapshot { "indices": "*", "ignore_unavailable": true, "include_global_state": true, "metadata": { "taken_by": "es exercises", "taken_because": "first cluster complete backup" } } # > 200 # Note: we have setted "*" to say "all indices" and # "include_global_state": true to store GET _snapshot/my-repository/my-cluster-snapshot # > indices: .kibana_task_manager... # Note: the indices stored are more than the defined by us, # this is because system indices are included in the backup
🔹 Configure a snapshot to be searchable
- “use snapshots to search infrequently accessed and read-only data” - doc
- With searchable snapshots we could search through data stored on a repository without loading all the indexes - at the cost of slower speed we save nodes HW capabilities
- We will see different searchable snapshot usages in the code block because searchable snapshots is a versatile functionality
- e.g. we will create indices that after X second will change to searchable snapshot,
- how to use it on a Hot-Warm-Cold architecture,
- how mount a snapshot already done as a searchable index,
- how to integrate a searchable snapshot inside a data stream
- e.g. we will create indices that after X second will change to searchable snapshot,
- We will see different searchable snapshot usages in the code block because searchable snapshots is a versatile functionality
- Some Q&A about the searchable snapshots:
- Can we use searchable snapshots without templates?
- Yes, just attach the searchable snapshot functionality to the ILM
- Can we set the searchable snapshot functionality on a hot index?
- Yes, but you must use the rollover functionality
- Should we make the snapshot before creating the searchable snapshot?
- Yes and no: if the searchable snapshot functionality is created inside an ILM, the snapshot will be automatically created and mounted to be searched.
If you already have a snapshot, you could mount it and be searched
- Yes and no: if the searchable snapshot functionality is created inside an ILM, the snapshot will be automatically created and mounted to be searched.
- Can we create ILM without rollover?
- Yes
- 💡 Can we create ILM with rollover and no index template?
- Yes, but you must specify the index alias using the parameter
index.lifecycle.rollover_alias
at index creation: no rollover system will be activated if ES cannot know how to update the alias name
- Yes, but you must specify the index alias using the parameter
- Can we use searchable snapshots without templates?
-
🖱️ Code example
-
🦂 if you create a new Index Lifecycle Policies from Kibana UI you will not enable the Searchable snapshot option: this isn’t related to some index settings you must follow but instead is a license-related problem.
You must enable the functionality activating the license, got to:
Stack Management → License management → Start a 30-day trial
(or use the Kibana code as described in the next code block) -
🦂 Often the ILM system isn’t really responsive, especially if the timing between the phases is in the order of seconds. This delay is caused by the ILM checking system, described here.
-
To increase the ILM checking ratio use and set the following cluster parameter - doc
PUT _cluster/settings { "persistent": { "indices.lifecycle.poll_interval": "5s" # <-- default is } }
-
-
💡 min_age parameter between phases calculation - blog
- If the rollover is used, min_age is calculated off the rollover date
- Otherwise, min_age is calculated off the original index’s creation date.
-
🖱️ Section 1: explore ILM and searchable snapshot
- 🦂 In one example, ILM during the Searchable snapshot phase change the name of the index and create an alias point to the “original” name. The new index is
restored-<original-name>
and is a new index with the snapshot mounted.
# ───────────────────────────────────────────── # Configure a snapshot to be searchable # # Section 1: explore ILM and searchable snapshot # ───────────────────────────────────────────── # Cluster requirements: # - nodes with Hot & Cold tiers # - path registered for snapshots # Cluster to use: # `04_snapshots-locals` # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/04_snapshots-locals GET _cat/nodes?v # > es03 node.role: cm # Note: the es03 node has the cold role # --- # Cluster init # --- PUT /_snapshot/my-repository { "type": "fs", "settings": { "location": "/mnt/bkp/" } } # > 200 PUT _ilm/policy/my_policy { "policy": { "phases": { "cold": { "actions": { "searchable_snapshot": { "snapshot_repository": "my-repository" } } } } } } # > 400 # > "current license is non-compliant for [searchable-snapshots]" # Note: the basic license doesn't allow searchable-snapshots functionality GET _license # > "type" : "basic" POST /_license/start_trial?acknowledge=true # > "trial_was_started" : true # Note: now functionalities like searchable-snapshots # are unblocked PUT _ilm/policy/my_policy { "policy": { "phases": { "cold": { "actions": { "searchable_snapshot": { "snapshot_repository": "my-repository" } } } } } } # > 200 # Note: now we can use searchable snapshot functionality DELETE _ilm/policy/my_policy # > 200 PUT _cluster/settings { "persistent": { "indices.lifecycle.poll_interval": "5s" } } # > 200 # Note: increase the pool checking interval # because we will test ILM policies with # time between phases in the order of seconds # --- # Basic ILP # --- PUT _ilm/policy/test-policy-1 { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "set_priority": { "priority": 100 } } }, "warm": { "min_age": "10s", "actions": { "set_priority": { "priority": 50 } } }, "cold": { "min_age": "60s", "actions": { "set_priority": { "priority": 0 } } } } } } # > 200 # Note: move the index to warm after 10s # and to cold after 60s # Tip: code generated from Kibana webapp # under `Index Lifecycle Policies` PUT test-index-01 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "test-policy-1" } } # > 200 # Note: is important set replicas to 0, # with only 3 nodes (hot - warm - cold) # the replica shard cannot be instantiated GET _cat/shards/test*?v # > node: es01 PUT test-index-01/_doc/01 { "msg": "payload" } # > 200 # Wait 10s... GET _cat/shards/test*?v # > node: es02 # Note: now is in warm node es02 # Wait 60s... GET _cat/shards/test*?v # > node: es03 # Note: the shard is finally moved to the cold node es03 # --- # ILP with a searchable snapshot # --- PUT _ilm/policy/test-policy-2 { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "set_priority": { "priority": 100 } } }, "warm": { "min_age": "10s", "actions": { "set_priority": { "priority": 50 } } }, "cold": { "min_age": "60s", "actions": { "set_priority": { "priority": 0 }, "searchable_snapshot": { "snapshot_repository": "my-repository" } } } } } } # > 200 # Note: same policy as before but with # snapshot_repository in the cold phase PUT test-index-02 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "test-policy-2" } } # > 200 PUT test-index-02/_doc/1 { "msg": "payload" } # > 200 GET _cat/shards/test-index-02?v # > es01 # Wait 10s... GET _cat/shards/test-index-02?v # > es02 # Wait 60s... (maybe >> 60s) GET _cat/shards/test-index-02?v # > index: restored-test-index-02 # > node: es03 (memo: es03 is the cold node) # Note: the index name is changed! Under the hood # the ILM system did some things, let's explore... # From CLI we can visit the `04_snapshots-locals/backup` folder, # inside we can find some files: they are the test-index-02 # searchable snapshot GET test-index-02/_ilm/explain # > index" : "restored-test-index-02" # > "phase" : "cold" GET _cat/aliases/test*?v # > alias: test-index-02 # > index: restored-test-index-02 # Note: the ILM has created an alias with the # index name and a redirection to the restored index GET test-index-02/_search # > "_id" : "1" # Note: we can use the index for search PUT test-index-02/_doc/2 { "msg": "2nd payload" } PUT restored-test-index-02/_doc/2 { "msg": "2nd payload" } # > 403 - cluster_block_exception # Nore: we cannot store new data on an index # that have the snapshot stored on a file-system GET /_searchable_snapshots/stats # > restored-test-index-02 | "num_files" : 1 # --- # ILP with searchable snapshot on hot phase # --- PUT _ilm/policy/test-policy-3 { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "set_priority": { "priority": 100 }, "searchable_snapshot": { "snapshot_repository": "my-repository" } } } } } } # > 400 - the [searchable_snapshot] action(s) could not be used in the [hot] phase without an accompanying [rollover] action # Note: we cannot create a searchable snapshot in the hot # phase without the rollover functionality PUT _ilm/policy/test-policy-3 { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_age": "10s" }, "set_priority": { "priority": 100 }, "searchable_snapshot": { "snapshot_repository": "my-repository", "force_merge_index" : true } }, "min_age": "0ms" } } } } # > 200 # Note: searchable snapshot and rollover # Note: "force_merge_index" : true is a best practice, see # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/ilm-searchable-snapshot.html#ilm-searchable-snapshot-options PUT test-index-03 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "test-policy-3" } } # > 200 GET _cat/shards/test-index-03?v # > node: es01 PUT test-index-03/_doc/01 { "msg": "payload" } # > 200 GET _cat/indices/test-index-03*?v GET _cat/shards/test-index-03?v # > node: es01 # Note: the rollover cannot be done # because no index alias is found POST _aliases { "actions": [ { "add": { "index": "test-index-03", "alias": "test-index-03-alias" } } ] } # > 200 # > setting [index.lifecycle.rollover_alias] for index [test-index-03] is empty or not defined # Note: try the API multiple times to get the error # Note: the rollover cannot be completed # because we haven't set the rollover alias name PUT test-index-03/_settings { "index.lifecycle.name": "test-policy-3", "index.lifecycle.rollover_alias": "test-index-03-alias" } # > 200 # Note: we need to provide the index alias to update after the rollout, # api body structure from # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/getting-started-index-lifecycle-management.html#ilm-gs-alias-apply-policy GET _cat/indices/test*?v # > test-index-000004 | docs.count: 0 # > restored-test-index-03 | docs.count: 1 # Note: the `test-index-000004` is the index created # after the rollover # Note: the `restored-test-index-03` is the "original" index # after the rollover process, stored as a searchable index GET /_searchable_snapshots/stats # > restored-test-index-03 | "num_files" : 1 GET test-index-03/_search # > "_id" : "01" GET test-index-03-alias/_search # > 0 hit # Note: why zero hits? # -> because the alias now point to # the index created by the rollover process GET _cat/aliases/test*?v # > alias: test-index-03-alias | index: test-index-000004 # > alias: test-index-03 | restored-test-index-03 # Note: like before a new alias is created that point # to the new index with searchable snapshot PUT test-index-03/_doc/02 { "msg": "2nd payload" } # > cluster_block_exception # Note: cannot insert data on a snapshot GET _cat/snapshots/my-repository?v # > 2 entries: the test-index-02 and test-index-03 searchable snapshots
- 🦂 In one example, ILM during the Searchable snapshot phase change the name of the index and create an alias point to the “original” name. The new index is
-
🖱️ Section 2: real-world usages
# ───────────────────────────────────────────── # Configure a snapshot to be searchable # # Section 2: real-world usages # ───────────────────────────────────────────── # Cluster requirements: # - nodes with Hot & Cold tiers # - path registered for snapshots # Cluster to use: # `04_snapshots-locals` - https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/04_snapshots-locals GET _cat/nodes?v # > es03 node.role: cm # Note: the es03 node have a cold role # --- # Cluster init # --- POST /_license/start_trial?acknowledge=true PUT _cluster/settings { "persistent": { "indices.lifecycle.poll_interval": "5s" } } PUT /_snapshot/my-repository { "type": "fs", "settings": { "location": "/mnt/bkp/" } } # --- # Make indices in existing snapshot searchable # --- PUT test-index-01/_doc/01 { "msg": "payload" } PUT /_snapshot/my-repository/test-index-01-snapshot?wait_for_completion=true { "indices": "test-index-01", "include_global_state": false } # > "state" : "SUCCESS" GET _cat/snapshots/my-repository?v # > successful_shards: 1 POST /_snapshot/my-repository/test-index-01-snapshot/_mount?wait_for_completion=true { "index": "test-index-01", "renamed_index": "test-index-01-snapshot", "index_settings": { "index.number_of_replicas": 0 } } # > "successful" : 1 # Note: we have just mounted a snapshot as a new # index named `test-index-01-snapshot`, it # is a searchable snapshot GET _cat/indices/test*?v # > index: test-index-01 | health yellow | docs.count 1 # > index: test-index-01-snapshot | health green | docs.count 1 # Note: test-index-01 is yellow because it would instantiate # a replica shard but we cannot do it (no other indices with hot role). # Instead test-index-01-snapshot is green because has replica set to 0 PUT test-index-01-snapshot/_doc/02 { "msg": "2nd payload" } # > cluster_block_exception # Note: cannot insert data on a searchable snapshot PUT test-index-01/_doc/02 { "msg": "2nd payload" } # > 200 # Note: the "normal" index continue to work as usual GET test-index-01-snapshot/_doc/02 # > found: false # Note: how can align the two indexes? PUT /_snapshot/my-repository/test-index-01-snapshot02?wait_for_completion=true { "indices": "test-index-01", "include_global_state": false } DELETE test-index-01-snapshot POST /_snapshot/my-repository/test-index-01-snapshot02/_mount?wait_for_completion=true { "index": "test-index-01", "renamed_index": "test-index-01-snapshot", "index_settings": { "index.number_of_replicas": 0 } } GET test-index-01-snapshot/_doc/02 # > "_id" : "02" # Apply best practices: # > To mount an index from a snapshot that contains multiple indices, # we recommend creating a clone of the snapshot that contains only the # index you want to search, and mounting the clone. # https://www.elastic.co/guide/en/elasticsearch/reference/7.13/searchable-snapshots.html#using-searchable-snapshots PUT /_snapshot/my-repository/test-index-01-snapshot02/_clone/test-index-01-snapshot02-searchable { "indices": "test-index-01" } # > 200 DELETE test-index-01-snapshot POST /_snapshot/my-repository/test-index-01-snapshot02-searchable/_mount?wait_for_completion=true { "index": "test-index-01", "renamed_index": "test-index-01-snapshot", "index_settings": { "index.number_of_replicas": 0 } } GET test-index-01-snapshot/_doc/02 # > "_id" : "02" # Info: the above process could be automatizated # using the ILM functionalities + aliases: let's do it # --- # Searchable snapshot and Hot-Warm-Cold ILM # --- PUT _ilm/policy/test-policy-02 { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_docs": 1 }, "set_priority": { "priority": 100 }, "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } }, "min_age": "0ms" }, "warm": { "min_age": "0d", "actions": { "set_priority": { "priority": 50 } } }, "cold": { "min_age": "60s", "actions": { "set_priority": { "priority": 0 }, "searchable_snapshot": { "snapshot_repository": "my-repository" } } } } } } # > 200 # Note: create new index after 1 document indexed, # move the old index to warm immediately, # then wait 1m and move to cold node # and make the index a searchable snapshot PUT test-index-02-000001 { "aliases": { "test-index-02": {} }, "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "test-policy-02", "index.lifecycle.rollover_alias": "test-index-02" } } # > 200 # Warning: without `index.lifecycle.rollover_alias` # the rollover will not start GET _cat/aliases/test*?v # > alias: test-index-02 | index: test-index-02-000001 PUT test-index-02/_doc/01 { "msg": "payload" } # > 200 GET _cat/indices/test-index-02*?v # > index: test-index-02-000001 | health green # > index: test-index-02-000002 | health yellow # Note: the rollover have created the new index `test-index-02-000002`, # but without a template, the new index will have a replica setings # set to 1 and no nodes with hot role for replica shards are available GET _cat/shards/test-index-02*?v # > test-index-02-000001 | node: es02 GET _cat/shards/test-index-02*?v # > restored-test-index-02-000001 | node: es03 # Note: the searchable index is set with the `restored...` index GET _cat/aliases/test*?v # > alias: test-index-02-00001 | index=restored-test-index-02-000001 PUT test-index-02/_doc/02 { "msg": "2nd payload" } GET _cat/shards/test-index-02*?v # > restored-test-index-02-000001: the searchable snapshot # > test-index-02-000002: the index created from rollover # Note: no new indices are created when the new document is indexed. # This is because `test-index-02-00002` created from the rollover # process doesn't have the policy attached (no template was used) # --- # Searchable snapshot and data stream # --- PUT _ilm/policy/test-policy-03 { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_docs": 1 }, "set_priority": { "priority": 100 }, "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } }, "min_age": "0ms" }, "warm": { "min_age": "0d", "actions": { "set_priority": { "priority": 50 } } }, "cold": { "min_age": "60s", "actions": { "set_priority": { "priority": 0 }, "searchable_snapshot": { "snapshot_repository": "my-repository" } } } } } } # > 200 PUT _index_template/my-index-template { "index_patterns": [ "test-index-03*" ], "data_stream": {}, "template": { "mappings": { "properties": { "@timestamp": { "type": "date", "format": "date_optional_time||epoch_millis" } } }, "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "test-policy-03" } }, "priority": 500 } # > 200 # Note: under settings we don't need to specify the # "index.lifecycle.rollover_alias" parameter, # will be the data_stream to manage this parameter PUT _data_stream/test-index-03 GET _data_stream/test-index-03 # > 200 POST test-index-03/_doc?refresh=true { "@timestamp": "2020-01-01T00:00:00", "msg": "payload" } # Note: differently from normal indices, # data streams want a POST API to index # new data and NOT specify the index ID GET _cat/shards/*03*?v # > index: xxxx-000001 | node: es02 # > index: xxxx-000002 | node: es01 # Note: the ILM policy have created the new index and moved the old # Wait 60s... GET _cat/shards/*03*?v # > index: restored-xxxx-000001 | node: es03 # > index: xxxx-000002 | node: es01 # Note: the ILM policy have created the searchable snapshot POST test-index-03/_doc?refresh=true { "@timestamp": "2021-01-01T00:00:00", "msg": "2nd payload" } GET _cat/shards/*03*?v # > index: restored-xxxx-000001 | node: es03 # > index: xxxx-000002 | node: es02 # > index: xxxx-000003 | node: es01 # Wait 60s... GET _cat/shards/*03*?v # > index: restored-xxxx-000001 | node: es03 # > index: restored-xxxx-000002 | node: es03 # > index: xxxx-000001 | node: es01 # Note: the data stream continue to apply the policy of # rollover when a new index is uploaded, move the old # index to warm node and after 1m move to cold node # and create a searchable snapshot.
-
🔹 Configure a cluster for cross-cluster search (remote cluster)
🔗 Official doc
-
“You can connect a local cluster to other Elasticsearch clusters, known as remote clusters.” - doc
-
To get a cross-cluster functionality you must configure a connection to the remote cluster; following you’re able to search across all configured clusters
- Not only simple searches:
- here the list of available APIs that could be used on remote clusters
- we will see also how to sync data between clusters using cross-cluster replication (next exam question)
- Not only simple searches:
-
How to configure a remote cluster
Steps to connect cluster2 as a remote cluster on cluster1
-
Run cluster 1 with (at least) one node with the
remote_cluster_client
role - info1 info2 -
Be sure the cluster 2 nodes can connect could be reached
- e.g. if you are on docker-composer: open a shell on a node in cluster 1 and use curl to test the connection
-
Connect the remote cluster
There are two ways to create the connection
- Hot mode
- Open Kibana and connect the remote cluster using the dedicated API
- 🦂 Warning: you must specify in the API the remote cluster host and port, pay attention that the port to use isn’t the 9200 but instead the transport port (default 9300) - doc
- Open Kibana and connect the remote cluster using the dedicated API
- Cold mode
- Editing the elasticsearch.yml settings file of the remote_cluster_client node - doc
There are also two connection architectures
- Sniff mode (default)
- (remote) cluster state is retrieved from one of the seed nodes and up to three gateway nodes are selected as part of remote cluster requests
- 🦂 Dedicated master nodes (on the remote cluster) are never selected as gateway nodes - we will test this setting on the
block
🖱️ Code example
- Proxy mode
- a cluster is created using a name and a single proxy address
- The proxy is required to route those connections to the remote cluster.
- The proxy mode is not the default connection mode and must be configured
- Hot mode
-
Search on the remote cluster using
cluster2:<idx name>
as index name, you could also search in multiple remote clusters and local index - doc
-
-
🖱️ Code example
# ───────────────────────────────────────────── # Configure a cluster for cross-cluster search # ───────────────────────────────────────────── # Cluster requirements: # - 3 clusters # - 3 networks # - 1 node with some specs: # - registered on all 3 the networks # - with the role `remote_cluster_client` # Cluster to use: # `10_cross-cluster` - https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/10_cross-cluster # --- # Connect cluster2 as remote cluster of cluster1 # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET _cat/nodes?v # > name: es01 | role: dmr # Note: the node *must* have the `r` role, it represent `remote_cluster_client` role # Optional: check the clusters2 connection # - From CLI enter in es01 and query cluster2: # - $ docker exec -u elasticsearch -it es01 /bin/bash # - $ curl es02:9200 # - > ..."cluster_name" : "cluster2"... PUT _cluster/settings { "persistent": { "cluster": { "remote": { "cluster2": { "seeds": [ "es02:9300" ] } } } } } # > "acknowledged" : true # Note: the port 9300 is used GET _remote/info # > "connected" : true # Note: if receive `node [es01] does not have the [remote_cluster_client] role` # you shuld add to the master node the `remote_cluster_client` role # --- # Insert data on cluster2 # # > Run the following code on Cluster 2, # using kibana at localhost:5602 # --- GET _cat/nodes?v # > name: es02 | role: dm GET _remote/info # > 400 | "node [es02] does not have the [remote_cluster_client] role # Note: the `remote_cluster_client` isn't required on the remote cluster PUT idx-cluster2/_doc/01 { "msg" : "Hello from `cluster2`!" } # > 200 # --- # Query data from cluster 2 # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET cluster2:idx-cluster2/_search { "query": { "match_all": {} } } # > "msg" : "Hello from `cluster2`!" # --- # Run cluster 3 nodes # --- # Run the master and data ES nodes of cluster 3 # - Connect to both the nodes and run the ES program, # - $ docker exec -u elasticsearch -it es03d /bin/bash # - $ bin/elasticsearch & # - $ exit # - $ docker exec -u elasticsearch -it es03m /bin/bash # - $ bin/elasticsearch & # - $ exit # - wait ~1m # > Run the following code on Cluster 3, # using kibana at localhost:5603 GET _cat/nodes?v # > name: es03m | role: m # > name: es03d | role: d PUT idx-cluster3/_doc/01 { "msg" : "Hello from `cluster3`!" } # > 200 # --- # Connect cluster3 as remote cluster of cluster1 # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET _cat/nodes?v # > name: es01 | role: dmr PUT _cluster/settings { "persistent": { "cluster": { "remote": { "cluster3m": { "seeds": [ "es03m:9300" ], "transport.ping_schedule": "30s" }, "cluster3d": { "seeds": [ "es03d:9300" ], "transport.ping_schedule": "30s" } } } } } # > "acknowledged" : true # Note: we try to connect both at the # "only master" node and the "data" node GET _remote/info # > cluster3d.num_nodes_connected: 1 # > cluster3m.num_nodes_connected: 1 GET cluster3m:idx-cluster3/_search { "query": { "match_all": {} } } # > "msg" : "Hello from `cluster3`!" GET cluster3d:idx-cluster3/_search { "query": { "match_all": {} } } # > "msg" : "Hello from `cluster3`!"
🔹 Implement cross-cluster replication *
🔗 Official doc
-
“With cross-cluster replication (CCR) you can replicate indices across clusters” - doc
-
Benefits:
- In case of disaster, you have a hot backup - doc
- Distribute search copies near users geolocation for cut network latency - doc
- Implement different architectures to implement project-required functionalities like disaster recovery resilience, increase data availability etc.
-
CCR is is a xpack functionality and require the license is activated
-
CCR work in an active-passive model:
- “You index to a leader index, and the data is replicated to one or more read-only follower indices” - doc
- When the leader index indexes new data the follower’s indices pull changes from the leader index
- You can also chain replica: attach a follower index to another follower indices
-
Replication mechanism - doc
- Elasticsearch achieves replication at the shard level, so the follower index will have the same number of shards as its leader index.
- As a matter of fact, you cannot change the shard number on the create follower index API
- The follower index shard updates shard information, and immediately sends another read request to the leader index shard
- If the following index read request fails:
- If the read fails for an error that could auto-recovery (e.g. network issue), the follower index entry on a retry loop
- For errors cannot auto-recovery, follower index pause the read requests until you resume it
- Tip: we will test both cases under
block
🖱️ Code example
- Tip: we will test both cases under
- Cross-cluster replication works by replaying the history of individual write operations that were performed on the shards of the leader index.
- This could work only if the leader index has activated the history retention - doc
- Elasticsearch achieves replication at the shard level, so the follower index will have the same number of shards as its leader index.
-
How setup cross-cluster replication
🔗 Official tutorial
- Setup both the clusters
- There are cluster global settings parameters (elasticsearch.yml) to set different CCR aspects (e.g. chunk size requested)
- A license that includes cross-cluster replication must be activated on both clusters.
- Setup leader cluster
- The leader indices must have the soft-deletion feature activated - API
- Setup follower cluster
- Setup both the clusters
-
🔗 Resources
-
🖱️ Code example
-
🦂 You must enable the Cross Cluster functionality activating the license, got to:
Stack Management → License management → Start a 30-day trial
(or use the Kibana code as described in the next code block) -
🖱️ Section 1: create a follower index
# ───────────────────────────────────────────── # Configure a cluster for cross-cluster search # # Section 1: create a follower index # ───────────────────────────────────────────── # Cluster to use for the test: `10_cross-cluster` # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/10_cross-cluster # --- # Connect es02 as remote cluster # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET _cat/nodes?v # > name: es01 | role: dmr POST _license/start_trial?acknowledge=true GET _license # > "status" : "active", PUT _cluster/settings { "persistent": { "cluster": { "remote": { "cluster2": { "seeds": [ "es02:9300" ] } } } } } # > "acknowledged" : true, GET _remote/info # > "num_nodes_connected" : 1 # --- # Create indices on cluster2 # # > Run the following code on Cluster 2, # using kibana at localhost:5602 # --- GET _cat/nodes?v # > name: es02 | role: dm POST _license/start_trial?acknowledge=true GET _license # > "status" : "active" PUT idx-cluster2 { "settings": { "index.soft_deletes.enabled": true } } # > 200 PUT idx-cluster2-nosoft { "settings": { "index.soft_deletes.enabled": false } } # > 200 PUT idx-cluster2/_doc/01 { "msg" : "Hello from `cluster2`!" } # > 200 PUT idx-cluster2-nosoft/_doc/01 { "msg" : "Hello from `cluster2`!" } # > 200 # --- # Create follower index on cluster1 # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET _cat/nodes?v # > name: es01 | role: dmr PUT follower-idx-cluster2/_ccr/follow?wait_for_active_shards=1 { "remote_cluster": "cluster2", "leader_index": "idx-cluster2" } # > "follow_index_shards_acked" : true GET follower-idx-cluster2/_search?size=10 # > "msg" : "Hello from `cluster2`!" PUT follower-idx-cluster2-nosoft/_ccr/follow?wait_for_active_shards=1 { "remote_cluster" : "cluster2", "leader_index" : "idx-cluster2-nosoft" } # > 400 | leader index [idx-cluster2-nosoft] does not have soft deletes enabled # Note: indices without soft-delete parameter enabled cannot be # used for cross-cluster replications GET follower-idx-cluster2/_ccr/info # > remote_cluster" : "cluster2" # > "status" : "active" GET follower-idx-cluster2/_ccr/stats # > "remote_cluster" : "cluster2" PUT follower-idx-cluster2/_doc/99 { "foo": "bar" } # > 403 | status_exception # --- # Add more data on cluster2 # # > Run the following code on Cluster 2, # using kibana at localhost:5602 # --- PUT idx-cluster2/_doc/02 { "msg" : "2nd msg" } # > 200 # --- # Check automatically updated data # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET follower-idx-cluster2/_search?size=10 # > "msg" : "2nd msg"
-
🖱️ Section 2: simulate different outages
# ───────────────────────────────────────────── # Configure a cluster for cross-cluster search # # Section 2: simulate different outages # ───────────────────────────────────────────── # Cluster to use for the test: `10_cross-cluster` # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/10_cross-cluster # --- # Run cluster 3 nodes # --- # Run the master and data ES nodes of cluster 3 # - Connect to both the nodes and run the ES program, # - $ docker exec -u elasticsearch -it es03d /bin/bash # - $ bin/elasticsearch & # - $ exit # - $ docker exec -u elasticsearch -it es03m /bin/bash # - $ bin/elasticsearch & # - $ exit # - wait ~1m # > Run the following code on Cluster 3, # using kibana at localhost:5603 GET _cat/nodes?v # > name: es03m | role: m # > name: es03d | role: d PUT idx-cluster3/_doc/01 { "msg" : "Hello from `cluster3`!" } # > 200 POST _license/start_trial?acknowledge=true GET _license # > "status" : "active" # --- # Create follower index on cluster1 # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- POST _license/start_trial?acknowledge=true GET _license # > "status" : "active" PUT _cluster/settings { "persistent": { "cluster": { "remote": { "cluster3": { "seeds": [ "es03m:9300" ] } } } } } # > "acknowledged" : true GET _remote/info # > cluster3 | "connected" : true PUT /follower-idx-cluster3/_ccr/follow { "remote_cluster": "cluster3", "leader_index": "idx-cluster3", "max_read_request_operation_count": 5120, "max_outstanding_read_requests": 12, "max_read_request_size": "32mb", "max_write_request_operation_count": 5120, "max_write_request_size": "9223372036854775807b", "max_outstanding_write_requests": 9, "max_write_buffer_count": 2147483647, "max_write_buffer_size": "512mb", "max_retry_delay": "500ms", "read_poll_timeout": "1m" } # > "follow_index_created" : true # Note: the above configuration could be # created from Kibana GUI under # Stack Management -> Cross-Cluster Replication GET follower-idx-cluster3/_search?size=10 # > "msg" : "Hello from `cluster3`!" # --- # Simulate connection interruption # --- # From CLI: # - $ docker network disconnect 10_cross-cluster_cluster03net es01 # - $ docker exec -u elasticsearch -it es01 /bin/bash # - $ curl es03m:9200 # > curl: (6) Could not resolve host: es03m # - $ exit # --- # Add data on cluster3 index # # > Run the following code on Cluster 3, # using kibana at localhost:5603 # --- PUT idx-cluster3/_doc/02 { "msg" : "2nd payload" } # > 200 PUT idx-cluster3/_doc/03 { "msg" : "3th payload" } # > 200 # --- # Check index isn't updated # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET follower-idx-cluster3/_search?size=10 # > total.value: 1 # Note: new data from idx-cluster3 aren't fetched # --- # Reestablish connection # --- # From CLI: # - $ docker network connect 10_cross-cluster_cluster03net es01 # - $ docker exec -u elasticsearch -it es01 /bin/bash # - $ curl es03m:9200 # > "cluster_name" : "cluster3" # - $ exit # --- # Check index is automatically updated # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET follower-idx-cluster3/_search?size=10 # > total.value: 3 # Note: new data fetched from idx-cluster3 # --- # Simulate outage # # Tip: ES automatically handle reconnection to the remote cluster # if the problem is at network level (like before), but suspend the # reconnection if the problem is from different nature # --- # From CLI: # - $ docker exec -u elasticsearch -it es03d /bin/bash # - $ ps -aux # # copy the PID of ES process # - $ kill -s SIGKILL <ES PID> # - $ exit # - wait ~1m # --- # Check "follower index" connection error # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET follower-idx-cluster3/_ccr/info # > "status" : "active" # Note: the index is active, but let's check the stats GET follower-idx-cluster3/_ccr/stats # > java.lang.IllegalStateException: Unable to open any connections to remote cluster [cluster3] # Note: the follower index cannot connect to the remote index, # because we have shut down the data node on cluster3, # essential for the cluster functioning # --- # Recovery from the outage # --- # From CLI: # - $ docker exec -u elasticsearch -it es03d /bin/bash # - $ bin/elasticsearch & # - $ exit # # --- # Add some data on cluster3 # # > Run the following code on Cluster 3, # using kibana at localhost:5603 # --- GET _cat/nodes?v # > name: es03m | role: m # > name: es03d | role: d # Note: after the restart, wait ~1m if the es03d isn't displayed PUT idx-cluster3/_doc/01 { "msg" : "Hello from `cluster3`! ---updated---" } # 200 # --- # Check if index is automatically recovered (no) # # > Run the following code on Cluster 1, # using kibana at localhost:5601 # --- GET follower-idx-cluster3/_search?size=10 # > total.value: 3 # > "msg" : "Hello from `cluster3`!" # Note: the msg of document 01 isn't updated GET follower-idx-cluster3/_ccr/stats # > java.lang.IllegalStateException: Unable to open any connections to remote cluster [cluster3] # Note: ES hasn't recovered the connection although we had # restarted the service on es03d. # We need to restart the following process POST follower-idx-cluster3/_ccr/pause_follow # > ack: true # Note: we need to both pause & resume the index POST follower-idx-cluster3/_ccr/resume_follow # > ack: true GET follower-idx-cluster3/_search?size=10 # > "msg" : "Hello from `cluster3`! ---updated---" # Note: the index is again up to date with cluster3 data
-
🔹 Define role-based access control (RBAC) using Elasticsearch Security
🔗 Official doc
-
“The Elastic Stack security features add authorization, which is the process of determining whether the user behind an incoming request is allowed to execute the request.” - doc
-
Security is based on two different processes:
-
User authentication:
the process of identify a specific user (username + password) - doc- Basic security features (like RBAC and basic logging system) are included in ES basic license, for more advanced features buy the license or enable the 30 days trial
- Must be enabled on all nodes, under elasticsearch.yml using the
xpack.security.enabled: true
setting- For a complete cluster setup see the Minimal security guide
- There are some special built-in users that serve for specific purposes and are not intended for general use **(e.g. underlying Kibana connection) - doc
- The
elastic
built-in user can be used to set all of the built-in user passwords (superuser) - The
kibana_system
built-in user is used by Kibana to connect and communicate with Elasticsearch. - These built-in users are stored in a special
.security
index that is a full-fledged index: “If your .security index is deleted or restored from a snapshot, however, any changes you have applied are lost” - doc- What happen if we lost the admin credentials? how we could continue to use the cluster? - a solution could be to recreate a superuser account
- The CLI program
/bin/elasticsearch-setup-passwords
provided could be used to setup the built-in passwords - doc- Warning: you cannot run the
elasticsearch-setup-passwords
command a second time.
- Warning: you cannot run the
- The
- “standard” users (e.g. people that will work with the ES infrastructure) management use the realms to manage the login process - doc
- Realms are basically the who and how the user credentials are checked, some realms are: - doc
- native, An internal realm where users are stored in a dedicated Elasticsearch index
- kerberos, authenticates a user using Kerberos authentication
- Realms are basically the who and how the user credentials are checked, some realms are: - doc
-
User authorization:
the process of checking if a user could access a specific resource (e.g. cluster settings) - doc- We could create users with specific roles that specify the permissions they are allowed to perform
- “assigning privileges to roles and assigning roles to users or groups” - doc
- Glossary
- Secured Resource = what will be protected, could be “indices, aliases, documents, fields, users, and the Elasticsearch cluster itself”
- Privilege = what the user could do with the resource
- Permissions = set of privileges; available privileges list
- Role = permissions + a name to identify the set
- User = authenticated user
- Group = set of users
- The users, roles and the mapping between the two groups could be managed:
-
-
🖱️ Code example
- 🦂 ES password will be generated when all ES instances up & running
# ───────────────────────────────────────────── # Configure a RBAC access # ───────────────────────────────────────────── # Cluster to use: 12_basic-security # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles # --- # Generate the built-in credentials # --- # Open new CLI: # $ docker exec -u elasticsearch -it es01 /bin/bash # $ bin/elasticsearch # Open new CLI: # $ bin/elasticsearch-setup-passwords auto # > store the printed psw # $ exit # --- # Connect Kibana to ES # --- # Open new CLI: # $ docker exec -u kibana -it kibana /bin/bash # $ ./bin/kibana-keystore create # $ ./bin/kibana-keystore add elasticsearch.password # [ Paste the psw of the user: kibana_system # $ curl es01:9200 # > Error: security_exception # $ curl --user kibana_system:<PASSWORD> es01:9200 # > "cluster_name" : "es-docker-cluster" # $ bin/kibana # --- # Connect to Kibana # --- # Open Kibana at http://localhost:5601/ # User Usr and psw of user `elastic` GET .security-7/_count # > 55 GET .security-7/_search { "_source": [ "type", "password" ] } # > _id: reserved-user-kibana_system # Note: psw are hashed # Create indices for future tests PUT test-index-01/_doc/01 { "foo": "bar" } # > 200 PUT test-index-protected # > 200 # --- # Create Users and Roles # --- # Two possible approaches: # - API: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-api-put-user.html # - GUI: management -> security -> create user # From GUI, create: # - New User `test-user` with role `editor` # --- # Test the user `test-user` roles # --- # Open a new Incognito page on the browser # Open Kibana at http://localhost:5601/ # User credentials of the just created `test-user` GET _cat/indices # > Error, "type" : "security_exception" PUT test-index-02 # > Error, "type" : "security_exception" GET test-index-01/_search # > 200; "foo" : "bar" # --- # Create new role # --- # From whe 1st Kibana page (user `elasticsearch`) # Create new role from GUI: management -> security -> Roles # New role info: # - Name: `protected-index-writer` # - Indices: add `test-index-protected` with privileges `write` # Under Kibana section (bottom page): # - Add Kibana privilege -> all spaces -> All privileges -> Create # Create new user from GUI: management -> security -> users # New User info: # - Name: `protected-writer` # - Role: `protected-index-writer` # Open a new Incognito page on the browser # Open Kibana at http://localhost:5601/ # User credentials of the just created `protected-writer` PUT test-index-protected/_doc/01 { "foo": "bar" } # > 200 GET test-index-protected/_search # > 403; security_exception
-
👨🏭 How to
Guides to setting up ES and running experiments
Run ES locally: docker setup
🔗 Docker based: official guide
🔗 Docker compose based: official guide
-
ES docker setup
Single instance
# Run es node on `elastic` network docker network create elastic docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.0 docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.0 # Run kibana service on `elastic` network docker pull docker.elastic.co/kibana/kibana:7.13.0 docker run --name kib01-test --net elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.13.0 # Stop everything docker stop es01-test docker stop kib01-test
Multiple instances
- Use docker compose with those configurations:
- https://github.com/pistocop/elastic-certified-engineer
- 🔗 Original documentation - doc
- Use docker compose with those configurations:
-
Docker containers troubleshooting
Use case: you have mess with elasticsearch.yml file and now the container doesn’t start.
-
Steps:
# Create a new image from the container $ docker commit $CONTAINER_NAME user/test_image # Create a new container from the image $ docker run -ti --entrypoint=bash user/test_image # Explore the image and find the problem. # E.g. an error on the file `/usr/share/elasticsearch/config/elasticsearch.yml` # Copy the file from the container $ docker cp $CONTAINER_NAME:/usr/share/elasticsearch/config/elasticsearch.yml . # Apply the fix changes on the file $ vi ./elasticsearch.yml # Replace the config file of the container $ docker cp ./elasticsearch.yml $CONTAINER_NAME:/usr/share/elasticsearch/config/
-
Test hot-warm-cold architecture
-
Process based on docker containers
-
Start an ES cluster with 3 nodes, each of which with a different role
- Tip: use the hot-warm-cold architecture from elastic-certified-engineer repo
-
Kibana code
- 🦂 The parameter min_age indicate a value to pass between phases, but, actually, you will wait more time before the shards are moved
# Check the cluster status GET _cluster/health GET _cat/nodes?v # > you should have 3 nodes with [mw, hms, cm] roles # Create the policy PUT _ilm/policy/hwc-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_age": "30d", "max_primary_shard_size": "50gb", "max_docs": 5 }, "set_priority": { "priority": 100 }, "readonly": {} }, "min_age": "0ms" }, "warm": { "min_age": "5m", "actions": { "set_priority": { "priority": 50 } } }, "cold": { "min_age": "15m", "actions": { "set_priority": { "priority": 0 } } } } } } GET _ilm/policy # Create the indexes template PUT _template/my-index-template { "index_patterns": [ "my-index-*" ], "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index.lifecycle.name": "hwc-policy", "index.lifecycle.rollover_alias": "my-index" }, "mappings": { "properties": { "foo": { "type": "keyword" } } } } # Create the index PUT my-index-01 { "aliases": { "my-index": { "is_write_index": true } } } # Check alias creation GET _cat/aliases # Check index ILM GET my-index-01/_ilm/explain?human # > "phase": "hot" # > "policy" : "hwc-policy" # Check shard allocation GET _cat/shards/my-index*?v # > index `my-index-01` primary shard on node `es01` (hot node) # Fill the index PUT my-index/_doc/1 { "foo":"bar" } PUT my-index/_doc/2 { "foo":"bar" } PUT my-index/_doc/3 { "foo":"bar" } PUT my-index/_doc/4 { "foo":"bar" } PUT my-index/_doc/5 { "foo":"bar" } PUT my-index/_doc/6 { "foo":"bar" } # Wait 5 minutes... GET _cat/indices/my-*?v # > New index: my-index-000002 PUT my-index-01/_doc/7 { "foo":"bar" } # > Error: policy set "old" indexes to `Read only` PUT my-index/_doc/7 { "foo":"bar" } # > 200: the alias point to new the new index GET _cat/shards/my-index*?v # > `my-index-01` is on `es01` node # Wait 20/30 minutes... # [hot -> warm] GET my-index-01/_ilm/explain # > "phase":"warm" GET _cat/shards/my-index*?v # > index `my-index-01` primary shard on node `es02` (warm node) # Wait 20/30 minutes... # [warm -> cold] GET my-index-01/_ilm/explain # > "phase":"cold" GET _cat/shards/my-index*?v # > index `my-index-01` primary shard on node `es03` (cold node)
-
Configure a multicluster architecture
-
Process based on docker containers
-
Create two clusters and two networks with only one node (c1n1) that is on both the networks, then use the code to connect cluster1 to cluster2 and query one of its indices
- 🔗 Cluster creation & configuration docker files: GitHub
-
🖱️ Code tutorial
- Another cluster could be used: 10_cross-cluster
# ───────────────────────────────────────────── # Connect a remote cluster # # Note: # - To run the experiment cluster architecture: # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/05_multicluster # - Pay attention to the comments: some kibana # code should be run on a different host # ───────────────────────────────────────────── # --- # Kibana code for `cluster2` # Tip: open `cluster2` kibana at localhost:5602 # and paste the following code # --- GET / # > `cluster2` GET _cat/nodes # > 1 node # Check remote cluster GET /_remote/info # > no results # Create some data PUT c2-index/_doc/01 { "msg": "Hello world form cluster 2!" } GET c2-index/_doc/01 # > 200 # --- # [!] Kibana code for `cluster1` # Tip: open `cluster1` kibana at localhost:5601 # and paste the following code # --- GET / # > `cluster1` GET /_cat/nodes?v # > 2 nodes # Check remote cluster GET /_remote/info # > no results # Connect to `cluster2` PUT _cluster/settings { "persistent": { "cluster": { "remote": { "cluster2": { "mode": "sniff", "seeds": [ "c2n1:9300" ], "transport.ping_schedule": "30s" } } } } } # Check remote cluster GET /_remote/info # > `cluster2` found # Note: "num_nodes_connected" : 1, # if a wrong port is specified on the # seeds list (e.g. 9200) this number is zero GET c2-index/_doc/01 # > Error: index not found GET cluster2:c2-index/_search { "query": { "match_all": {} } } # > "msg" : "Hello world form cluster 2!" GET cluster2:c2-index/_doc/01 # > error # Note: not all the API are allowed to be # done on remote cluster PUT c1-index/_doc/01 { "msg": "Hello world form cluster 1!" } # --- # Kibana code for `cluster2` # Tip: open `cluster2` kibana at localhost:5602 # and paste the following code # --- GET cluster1:c1-index/_search { "query": { "match_all": {} } } # > Error # Note: a connection is not bidirectional, # you should also open a connection # from `cluster2` to `cluster1`
-
🐳 Deepenings
More in-depth topics useful for a more comprehensive learning
Cluster infrastructure
Cluster formation
-
How to set configurations when creating a new cluster
- Some glossary before start:
- bootstrapping = the first time a cluster is started is an event called
- master = a node with the master role, it will take part in the voting system and is (one of them) responsible to manage the cluster (along with the other master nodes) - doc
- voting system = cluster-level decisions (like deciding which shards to allocate to which nodes) are taken by master nodes, but because ES is distributed some master nodes could be unreachable (connection error, node fault etc.). So to avoid that two sub-groups working independently for a connection error (split brain) there is a voting system with a quorum to take the decisions - doc
- Cluster (in)formation
- At bootstrapping, nodes don’t know how many of them are present in the cluster, nor how many and which are the master nodes, moreover both those parameters could change during the time with add/remove nodes.
- ES have a system to “automagically” create the cluster, balance the voting system, permit the nodes resizing, but some information must be provided in order to allow those functionalities to work well
- Information to provide
- At bootstrapping, each node with master role should have set the
cluster.initial_master_node
parameter with the list of all of the other master nodes- This parameter should be removed after the bootstrapping
- This information, after the first start, will be stored (with other cluster information) inside the data folder of each node
- Nodes without master role should instead have
discovery.seed_hosts
parameter set. This parameter contains a list of hosts to call when the node start, in order to “ask for taking part of (join) the cluster”. Those hosts do not necessarily coincide with the master nodes but is a good idea if they do because we should provide resilient and stable nodes. - 🦂 We say master nodes to indicate nodes with ES instance with a master role, but after the cluster bootstrapping the master in a cluster is only one, elected after a voting system
- 🦂 Note that both
cluster.initial_master_node
anddiscovery.seed_hosts
parameters are required for each master eligible node at bootstrap time. This makes sense because the first parameter is used only one time and should be removed after bootstrapping, so the latter is essential for the node functioning
- At bootstrapping, each node with master role should have set the
- Some glossary before start:
Index management
Removal of mapping types
-
ES has decided to remove the concept of *mapping types* from Elasticsearch.
- “In an Elasticsearch index, fields that have the same name in different mapping types are backed by the same Lucene field internally” - link
- Alternatives to types
- Have an index per document type
- Custom type field - link
- implement your own custom type field which will work in a similar way to the old _type
Change Static Index modules (reindex)
-
change a mapping *static* parameter and use *reindex/aliases* to update the indices
- Index modules - doc
- Basically all the information linked to the index (e.g. shards, replicas, analyzers…), some are static and cannot be changed without reindex the data, others are dynamic (e.g. replicas) and could be changed using the
_mapping
index endpoint
- Basically all the information linked to the index (e.g. shards, replicas, analyzers…), some are static and cannot be changed without reindex the data, others are dynamic (e.g. replicas) and could be changed using the
PUT test02 { "mappings": { "properties": { "text-field":{ "index_options": "docs", "type": "text" } } } } PUT test02/_doc/01 { "text-field": "hello i'm a computer and this is a test" } # --- # We want different index_options: "offset" # this parameter cannot be "hot changed" # --- PUT test02/_mapping { "properties": { "text-field": { "index_options": "offsets", "type": "text" } } } # > 400; Mapper for [text-field] conflicts with existing mapper PUT test03 { "mappings": { "properties": { "text-field": { "index_options": "offsets", "type": "text" } } } } POST _reindex { "source": { "index": "test02" }, "dest": { "index": "test03" } } GET test03/_search GET test03 # Check everything is fine # --- # Two solutions: # 1. Delete test02 and use an alias to redirect index02 to index03 # 2. Delete test02 and reindex test03 to test02 # --- DELETE test02 # > 200 POST _aliases { "actions": [ { "add": { "index": "test03", "alias": "test02" } } ] } # > 200 GET test02/_search { "query": { "match": { "text-field": "hellooooo test" } }, "highlight": { "fields": { "text-field": {} } } } # > "hello i'm a computer and this is a <em>test</em>"
- Index modules - doc
Search
Access the analyzers tokens
-
Define custom analyzers through templates and inspect their tokens
- The following code cover various topics: composable templates, custom analyzers, custom tokenizers, termvectors, subfields, that is suggested to be already familiar with
# ───────────────────────────────────────────── # Intermediate example: # Create a template that defines custom analyzers # and inspect their behaviour # ───────────────────────────────────────────── # --- # Create the template # --- # Tip: test analyzer's behaviour before define it: GET _analyze { "tokenizer": { "type": "char_group", "tokenize_on_chars": [ "," ] }, "text": [ "To be, or not to be, that is the question" ] } # Template components PUT _component_template/whitespace_analyzer_template { "template": { "settings": { "analysis": { "analyzer": { "my_whitespace_analyzer": { "type": "custom", "tokenizer": "whitespace" } } } }, "mappings": { "properties": { "text_whitespace_field": { "type": "text", "analyzer": "my_whitespace_analyzer", "fields": { "length": { "type": "token_count", "analyzer": "my_whitespace_analyzer" } } } } } } } PUT _component_template/ngram_analyzer_template { "template": { "settings": { "analysis": { "tokenizer": { "my_ngram_tokenizer": { "type": "ngram", "min_gram": 3, "max_gram": 3 } }, "analyzer": { "my_ngram_analyzer": { "type":"custom", "tokenizer": "my_ngram_tokenizer" } } } }, "mappings": { "properties": { "text_ngram_field": { "type": "text", "analyzer": "my_ngram_analyzer", "fields": { "length": { "type": "token_count", "analyzer": "my_ngram_analyzer" } } } } } } } PUT _component_template/char_group_analyzer_template { "template": { "settings": { "analysis": { "tokenizer": { "my_chargroup_tokenizer": { "type": "char_group", "tokenize_on_chars": [ "," ] } }, "analyzer": { "my_char_group_analyzer": { "type": "custom", "tokenizer": "my_chargroup_tokenizer" } } } }, "mappings": { "properties": { "text_chargroup_field": { "type": "text", "analyzer": "my_char_group_analyzer", "fields": { "length": { "type": "token_count", "analyzer": "my_char_group_analyzer" } } } } } } } PUT _component_template/pattern_analyzer_template { "template": { "settings": { "analysis": { "tokenizer": { "my_pattern_tokenizer": { "type": "pattern", "pattern": "to be" } }, "analyzer": { "my_pattern_analyzer": { "type": "custom", "tokenizer": "my_pattern_tokenizer", "filter": [ "lowercase" ] } } } }, "mappings": { "properties": { "text_pattern_field": { "type": "text", "analyzer": "my_pattern_analyzer", "fields": { "length": { "type": "token_count", "analyzer": "my_pattern_analyzer" } } } } } } } PUT _component_template/pattern_analyzer_enhanced_template { "template": { "settings": { "analysis": { "tokenizer": { "my_pattern_enhanced_tokenizer": { "type": "pattern", "pattern": "[Tt]o be" } }, "analyzer": { "my_pattern_enhanced_analyzer": { "type": "custom", "tokenizer": "my_pattern_enhanced_tokenizer", "filter": [ "lowercase" ] } } } }, "mappings": { "properties": { "text_pattern_enhanced_field": { "type": "text", "analyzer": "my_pattern_enhanced_analyzer", "fields": { "length": { "type": "token_count", "analyzer": "my_pattern_enhanced_analyzer" } } } } } } } # Create the template POST _index_template/analyzer_family_template { "index_patterns": ["test_*"], "composed_of": [ "whitespace_analyzer_template", "ngram_analyzer_template", "char_group_analyzer_template", "pattern_analyzer_template", "pattern_analyzer_enhanced_template" ] } # --- # Index creation & insertion # --- DELETE test_index PUT test_index { "mappings": { "properties": { "text_standard_field": { "type": "text", "analyzer": "standard", "fields": { "length": { "type": "token_count", "analyzer": "standard" } } }, "text_simple_field": { "type": "text", "analyzer": "simple", "fields": { "length": { "type": "token_count", "analyzer": "simple" } } }, "text_stop_field": { "type": "text", "analyzer": "stop", "fields": { "length": { "type": "token_count", "analyzer": "stop", "enable_position_increments": "false" } } }, "text_keyword_field": { "type": "text", "analyzer": "keyword", "fields": { "length": { "type": "token_count", "analyzer": "keyword" } } }, "keyword_field": { "type": "keyword", "fields": { "length": { "type": "token_count", "analyzer": "keyword" } } } } } } # [!] Note the "enable_position_increments": "false", # here why: https://github.com/elastic/elasticsearch/issues/39276#issuecomment-466278696 GET test_index PUT test_index/_doc/1 { "text_standard_field": "To be, or not to be, that is the question", "text_chargroup_field": "To be, or not to be, that is the question", "text_ngram_field": "To be, or not to be, that is the question", "text_whitespace_field": "To be, or not to be, that is the question", "text_pattern_field": "To be, or not to be, that is the question", "text_pattern_enhanced_field": "To be, or not to be, that is the question", "text_simple_field": "To be, or not to be, that is the question", "text_stop_field": "To be, or not to be, that is the question", "text_keyword_field": "To be, or not to be, that is the question", "keyword_field": "To be, or not to be, that is the question" } # --- # Inspect the analyzer's behaviour # --- GET test_index/_search { "_source": [ "" ], "fields": [ "*.length" ], "query": { "term": { "_id": 1 } } } # > stop_field.length = 1 because only "question" isn't a stopword # # > pattern_field.length = 2 because we split on "to be" text. # Tip: note that the first section of the sentence "To be" is not # used for the split but is reported on the results text. # This occour because the tokenizer run before the `lowercase` filter # # > text_pattern_enhanced_field.length = 2 because we split on "[tT]o be" text. # Tip: use the next termvectors API to compare this resault with `pattern_field` # # > text_whitespace_field.length, # text_standard_field.length, # text_simple_field.length # = 10 because the sentence is composed by 10 words # # > text_ngram_field.length = 39 because the string is 41 characters and # we have 39 positions for a sliding window of size 3 # # > text_keyword_field.length, # keyword_field.length # = 1 because a keywork token is created with all field text # # > text_chargroup_field.length = 3 because we will create one token # for each comma, and the sentence contain three commas # Inspect the analyzers tokens GET test_index/_termvectors/1?fields=text_stop_field&field_statistics=false GET test_index/_termvectors/1?fields=text_pattern_field&field_statistics=false GET test_index/_termvectors/1?fields=text_pattern_enhanced_field&field_statistics=false GET test_index/_termvectors/1?fields=text_whitespace_field&field_statistics=false GET test_index/_termvectors/1?fields=text_standard_field&field_statistics=false GET test_index/_termvectors/1?fields=text_simple_field&field_statistics=false GET test_index/_termvectors/1?fields=text_ngram_field&field_statistics=false GET test_index/_termvectors/1?fields=text_keyword_field&field_statistics=false GET test_index/_termvectors/1?fields=keyword_field&field_statistics=false GET test_index/_termvectors/1?fields=text_chargroup_field&field_statistics=false # Tip: test an index analyzer on the fly GET test_index/_analyze { "analyzer": "my_ngram_analyzer", "text": ["Text not indexed"] }
Backup
Backup/restore snapshots
-
The supported way to back up a cluster is by taking a snapshot
🔗 docs
-
For a complete cluster backup you should:
-
Back up the data:
- Based on *snapshot* API, you can backup a cluster including all its data streams and indices
- Elasticsearch takes snapshots incrementally
- Snapshot repository
-
The snapshot could be stored on different repositories, like GCS or S3.
Here the list of available repositories. -
API to create a snapshot repository:
PUT /_snapshot/my_repository { "type": "fs", # Types: [fs, source, url] "settings": { "location": "my_backup_location", # only if "fs" type - folder path e.g. /mnt/my-fs/ "url": "url_root_filesystem", # only if "url" type - URL location of the root of the shared filesystem "compress": true , # metadata (e.g. mappings) compressed "max_number_of_snapshots": 500 # Maximum number of snapshots the repository can contain } }
- For a complete example see the Shared file system repository official guide
- 🦂 Be aware: although the distributed file system where make the backup must be mounted on each note at the same path, you need anyway register the path on the elasticsearch.yml file and make a rolling restart. - see the official guide
- If the path isn’t registered on elasticsearch.yml file, an error like this will be returned:
"[my_backup] location [/this/path/doesnt/exist] doesn’t match any of the locations specified by path.repo because this setting is empty”
- If the path isn’t registered on elasticsearch.yml file, an error like this will be returned:
-
- Create a snapshot
-
API to create a snapshot:
# Default: includes all data streams and open indices in the cluster PUT /_snapshot/my_repository/my_snapshot # Query parameters PUT /_snapshot/my_repository/snapshot_2?wait_for_completion=true # request returns a response when the snapshot is complete { "indices": "index_1,index_2", "ignore_unavailable": true, # ignores missing or closed data streams and indices "include_global_state": false, # store global state also (Index templates, ILM, ...) "metadata": { # arbitrary metadata "taken_by": "user123", "taken_because": "backup before upgrading" }, "partial": true, # do not fail if one or more indices included in the snapshot do not have all primary shards available }
-
A snapshot could also be searched
-
-
Use the SLM (Snapshot lifecycle management) to automatically take and manage snapshots
-
API to create a SLM
# Create the policy PUT /_slm/policy/nightly-snapshots { "schedule": "0 30 1 * * ?", # cron syntax "name": "<nightly-snap-{now/d}>", "repository": "my_repository", "config": { # same info of snapshot API "indices": ["*"] }, "retention": { "expire_after": "30d", # period after which a snapshot is considered expired "min_count": 5, "max_count": 50 # Maximum number of snapshots to retain - should not exceed 200 } } # Test the policy POST /_slm/policy/nightly-snapshots/_execute # trigger the policy GET /_slm/policy/nightly-snapshots?human # get info about the execution
-
-
Restore the data
🔗 official docs
-
API to restore a snapshot
-
🦂
index_out_of_bounds_exception
error"type" : "index_out_of_bounds_exception", "reason" : "index_out_of_bounds_exception: No group 1"
- If the above error is raised, probably you have messed with the parameters
rename_pattern
andrename_replacement
, try to change those settings (i.e. remove the*
usage)
- If the above error is raised, probably you have messed with the parameters
# [opt] Close local index that exist POST /index_1/_close # Restore a snapshot POST /_snapshot/my_repository/snapshot_2/_restore?wait_for_completion=true # the request returns a response when the restore operation completes { "indices": "index_1,index_2", # which index restore "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "index_(.+)", # index match this pattern... "rename_replacement": "restored_index_$1", # ... will be renamed with this pattern "include_aliases": false # do not restore aliases from snapshot } # [opt] Open local index POST /index_1/_open
-
-
💡 You are not obligated to restore everything from the snapshot:
“You can select specific data streams or indices to restore.”
-
🦂 “Existing indices can only be restored if they are closed and have the same number of shards as the indices in the snapshot."
-
-
🖱️ Code example
# Register snapshot repository PUT /_snapshot/fs_bkp { "type": "fs", "settings": { "location": "/mnt/cluster_fs/es_bkp/" } } POST /_snapshot/fs_bkp/_verify # > Check passed # [opt] Create an index PUT test_index # Create a cluster snapshot PUT /_snapshot/fs_bkp/snapshot_001?wait_for_completion=true { "metadata":{ "taken_by": "My first bkp attempt", "taken_because": "Test es bkp functionality, all snapshot defaults maintained" } } GET /_snapshot/fs_bkp/_current? # > "state" : "IN_PROGRESS" # Waiting... GET /_snapshot/fs_bkp/_current? # > [<empty>] GET _snapshot/fs_bkp/snapshot_001 # > "state": "SUCCESS" PUT test_index/_doc/bkp_test_01 { "foo":"bar" } GET test_index/_doc/bkp_test_01 # > 200 POST _snapshot/fs_bkp/snapshot_001/_restore { "indices": "test_index", "rename_pattern": "test_(.+)", "rename_replacement": "restored_$1" } GET restored_index/_doc/bkp_test_01 # > "found": false GET test_index/_doc/bkp_test_01 # > "found": true # --- # Make a policy for daily snapshots # --- PUT /_slm/policy/daily-snapshots { "schedule": "0 30 22 * * ?", "name": "<daily-snap-{now/d}>", "repository": "fs_bkp", "config": { "ignore_unavailable": false, "include_global_state": true, "metadata":{ "taken_by": "Policy named: `daily-snapshots`" } }, "retention": { "expire_after": "30d", "min_count": 7, "max_count": 60 } }
-
Security
Unsecured node connect to cluster with minimal security
-
Setup a minimal ES security system and connect unsecured node
- We will see how a new node on the cluster can get access to “secured” data.
- This example describes why “If your cluster has multiple nodes, you must enable minimal security and then configure Transport Layer Security (TLS) between nodes. If your cluster has multiple nodes, you must enable minimal security and then configure Transport Layer Security (TLS) between nodes.” - doc
# ───────────────────────────────────────────── # Setup a minimal ES security system # and connect unsecured node # ───────────────────────────────────────────── # Cluster to use: 11_blank-minicluster # https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/11_blank-minicluster # --- # Configure ES security # --- # Open new shell: # $ docker exec -u elasticsearch -it es01 /bin/bash # $ echo "xpack.security.enabled: true" >> config/elasticsearch.yml # $ bin/elasticsearch # Open new shell: # $ docker exec -u elasticsearch -it es01 /bin/bash # $ ./bin/elasticsearch-setup-passwords auto # > store all the psw (we will use kibana & elastic) # Test the credentials: # $ curl es01:9200 # > missing authentication credentials # $ curl --user elastic:oM2vXErEqaxhznsDilB0 -XGET localhost:9200 # > "cluster_name" : "es-docker-cluster" # --- # Start kibana # --- # Open new shell: # $ docker exec -u kibana -it kibana /bin/bash # $ echo "elasticsearch.username: kibana_system" >> config/kibana.yml # [1] $ echo "elasticsearch.password: YVU119gR44nO0Qh6A0Zt" >> config/kibana.yml # $ bin/kibana # Create new index: # Visit localhost:5601 & use usr:"elastic" psw:"<es psw generated before>" GET _cat/nodes?v # > name: es01 PUT secret_index/_doc/01 { "psw": "secret" } # 200 # Open new shell: # $ docker exec -u elasticsearch -it es01 /bin/bash # $ curl -XGET "http://es01:9200/secret_index/_search" # > error: security_exception # [1] Note: this is an insecure mode to set the password, use # instead keystore: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html#add-built-in-users # --- # Connect insicure node # --- # Open new shell: # $ docker exec -u elasticsearch -it es02 /bin/bash # $ cat config/elasticsearch.yml # node.name: es02 # cluster.name: es-docker-cluster # network.host: 0.0.0.0 # discovery.seed_hosts: # - es01 # cluster.initial_master_nodes: # - es01 # bootstrap.memory_lock: true # $ bin/elasticsearch # Open new shell: # $ docker exec -u elasticsearch -it es02 /bin/bash # $ curl -XGET "http://es01:9200/secret_index/_search" # > error: security_exception # $ curl -XGET "http://es02:9200/secret_index/_search" # [!] > 200; "psw": "secret" # Note: new node without psw have read index content
- We will see how a new node on the cluster can get access to “secured” data.
💊 Pills
Bullets for a last-minute review
-
Hot topics
-
Use
_index_template
instead_template
(latter is deprecated) -
Attach multiple analyzers to a field using
fields: { "raw":{ type: ...
,
this process is namedmulti-fields
-
🦂 Pay attention/do not use Kibana suggestions, are often misleading and incorrect.
Always open the documentation page. -
Painless functions available for strings object: Painless doc → Painless API Reference (contain all API available) → Ingest API (API available during ingestion pipeline) → String (API list)
-
Under
query -> match
function there are a lot of settings,
e.g.operator=AND
to force the search of all words. Use it for an exact match or match_phrase if the order is relevant# Example GET test01/_search { "query": { "match": { "message_field": { "query": "the old", "operator": "OR" } } } }
-
On search API,
bool
statement usages:must
→ query must be satisfied and track the scorefilter
→ like must, but without the scoreshould
→ match not required but if verified score increasedmust_not
→ if match discard doc
-
To query a
date
use thequery.range.<field_name>
API field -
wildcard
query could be done on bothkeyword
andtext
fields -
access to object type keys using the
.
, e.g.products.price
-
pipeline/nested
aggs
should be read top-down: the 1st level aggregation/metric is done before the nested one.# E.g. to calculate products bought daily # Before (1_level) aggregate by day number and # **then** (2_level) calculate the value for each bucket POST kibana_sample_data_ecommerce/_search?size=0 { "aggs": { "1_level": { "date_histogram": { "field": "order_date", "calendar_interval": "day" }, "aggs": { "2_level": { "value_count": { "field": "products._id.keyword" } } } } } }
-
Highlight system require offset-strategy to know “where” the match sections are,
and_source.store:enabled
because the source text is used by the highlighter -
Mapping a field has a lot of parameters, don’t forget to use it
-
Pagination
- There are two main ways to paginate the documents:
- using from and size fields: recommended if the total hits to paginate are < 10.000
- using search_after field: recommended if the total hits to paginate are > 10.000
- Could be used only if the
sort
order is provided (memo: keyword fields could be ordered in alphabetical order)
- Could be used only if the
- Both pagination systems could use PIT: generate a token that represents the status of the cluster and then pass this token during the pagination.
Usage- generate from index
POST kibana_sample_data_ecommerce/_pit?keep_alive=60m
- pass the received id to the query
pit.id:...
- generate from index
- There are two main ways to paginate the documents:
-
Aliases could map multiple indices behind the name and apply a filter to the data
-
Search template: store script with parameters and the query using mustache under
PUT _scripts/<script_name>
and then use it at query time usingGET <index>/_search/template{ "id":<script_name>
-
Dynamic mapping
-
Dynamic field mapping = how the index, automatically, manage new fields that weren’t declared (e.g.
strict
raise an error).-
💡 A subfield could overwrite the
dynamic
parameter, in this way we could for example “restrict” the insertion of only some subfields.Example:
PUT my-index-02 { "mappings": { "dynamic": "strict", "properties": { "user": { "properties": { "name": { "type": "text" }, "social_networks": { "dynamic": true, "properties": {} } } } } } } # > 200 # Note: we have provided the field "dynamic" : "strict", # so no new fields are allowed on this index PUT my-index-02/_doc/1 { "user": { "name": "tyler" } } # > 200 PUT my-index-02/_doc/2 { "user": { "name": "tyler" }, "otherfield": "foo" } # > 400; mapping set to strict PUT my-index-02/_doc/2 { "user": { "name": "tyler", "surname": "foo" } } # > 400; mapping set to strict PUT my-index-02/_doc/2 { "user": { "name": "tyler", "social_networks": { "facebook": { "nick": "foo" } } } } # > 200 # Note: possible because the latter "dynamic": true, # overwrite the general "dynamic": "strict"
-
-
Dynamic template = we declare **some matching rules to catch the new fields (e.g.
location-*
) and how to manage it
-
-
Nested arrays of objects
-
If the field will store unknown fields, we can easily store them as an object.
Use nested or flattened only for arrays of objectsDELETE test04 PUT test04 { "mappings": { "properties": { "f-obj":{ "type": "object" }, "f-nested":{ "type": "nested" }, "f-flat":{ "type": "flattened" } } } } PUT test04/_doc/01 { "f-obj": { "field1": "mouse", "field2": "keyboard" }, "f-nested": { "field1": "mouse", "field2": "keyboard" }, "f-flat": { "field1": "mouse", "field2": "keyboard" } } PUT test04/_doc/02 { "f-obj": { "field1": "keyboard", "field2": "mouse" }, "f-nested": { "field1": "keyboard", "field2": "mouse" }, "f-flat": { "field1": "keyboard", "field2": "mouse" } } # --- # Test searches # --- GET test04/_search { "query": { "match": { "f-obj.field1": "mouse" } } } # "_id" : "01", as expected
-
Object vs flattener types*:* object maintain *keys* information, instead of the *flattened* only store an array with all the values of the JSON
-
-
**In custom analyzer: composed by (and applied in order)
- character filters - preprocess characters
- 🦂 tokenizer - split in tokens and could do other things (e.g. lowercase, remove punctuation etc.)
- token filter - manage the tokens: remove (stopword), add (synonyms), lowercase
-
_update_by_query
take the document in_source
and use it to re-index the data on the index. This process increases the_version
-
In the
_reindex
API we could specify aprocessor
- doc -
Ingest pipeline + script
- Write & Store script under
_script
with parameters - Create pipeline component of type
_script
and set parameters values - Call the pipeline using
PUT <index>... ?pipeline=<pipName>
or set during index mapping usingdefault_pipeline=<pipName>
-
Example for fast look
# --- # Create a dispacher # using pipeline + stored script # --- PUT _scripts/my-script { "script":{ "lang": "painless", "source": """ String checkString = ctx[params['fieldToCheck']]; if (checkString == params['checkValue']){ ctx["_index"] = params['destinationIndex']; } """ } } PUT _ingest/pipeline/my-dispacher-pipeline { "processors": [ { "script": { "id": "my-script", "params": { "fieldToCheck": "dispacher-type", "checkValue": "storic", "destinationIndex": "storic-index" } } } ] } # Note: params setted at pipeline level POST _ingest/pipeline/my-dispacher-pipeline/_simulate { "docs": [ { "_source": { "my-keyword-field": "FOO", "dispacher-type": "storic" } }, { "_source": { "my-keyword-field": "BAR" } } ] } # > "_index" : "storic-index" PUT storic-index DELETE my-index-01 PUT my-index-01 { "settings": { "number_of_shards": 1, "default_pipeline": "my-dispacher-pipeline" } } PUT my-index-01/_doc/01 { "my-keyword-field": "FOO", "dispacher-type": "storic" } PUT my-index-01/_doc/02 { "my-keyword-field": "FOO", "dispacher-type": "non-storic" } GET storic-index/_search # > _id" : "01" GET my-index-01/_search # > "_id" : "02",
- Write & Store script under
-
Snapshots
- 💡 If you register the same snapshot repository with multiple clusters, only one cluster should have write access to the repository (others readonly activated)
- Register where store snapshot (elasticsearch.yml) on each node and create a repository to use it.
Then you could make snapshots using thePUT /_snapshot
API, moreover, schedule snapshot lifecycle (SLM) usingPUT /_slm/policy/
API.
Note: everything could be done also with Kibana UI. - Restore using
POST /_snapshot/<repoName>/<snapName>/_restore
API, we could restore everything or cherry-pick only some indices- we need to close an index before restoring it from a snapshot
-
Searchable snapshots
- 💡 Searchable snapshots is the functionality, it could be part of an ILM or we could mount a snapshot
- mount = restore an index stored into a snapshot without creating new shards on the cluster but instead searching directly into the snapshot
- ilm = we could include searchable snapshots (ss) inside the ILM phases (hot or cold, usually in the latter)
- cold phase + ss = when reaching the cold phase, under the hood the ILM:
store the index on a snapshot, delete the original index, mount the index on the snapshot on a new index (restored-<indexName>
), create an alias<index-name> --> restored-<indexName>
- Best practices are to reserve a clone of the snapshot only for the mounting and ss service
- 🦂 Use Kibana Code instead of the GUI. The Searchable snapshot button from GUI is disabled and only shown under the cold section
- Follow the ILM process through
GET test-index-03/_ilm/explain
API
- cold phase + ss = when reaching the cold phase, under the hood the ILM:
- 💡 Searchable snapshots is the functionality, it could be part of an ILM or we could mount a snapshot
-
Cross-cluster (CC)
- Create monodirectional connections between two ES clusters, for cross-search or data replication
- The node of the cluster that wants to establish the connection must have the
remote_cluster_client
role - The node chosen as seed node on the cluster to reach for the connection must be reached at the transport port (
:9300
) and should be stable (better choose master)
- The node of the cluster that wants to establish the connection must have the
- CC replication
- Copy indices (leader) to remote cluster (replica).
- If we want to create a replica of idx1 on c1 to idxr1 on c2:
- c2 must have
remote_cluster_client
role and connect to c1 - Use the
PUT /idxr1/_ccr/follow?
API on c2
- c2 must have
- The leader indices must have the soft-deletion feature activated - API
- 🦂 In the cluster that will have follower indices all nodes with the master node role must also have the remote_cluster_client role - doc
- If we want to create a replica of idx1 on c1 to idxr1 on c2:
- If “not auto recovering” outage appear: pause & resume the follower index
- Copy indices (leader) to remote cluster (replica).
- Create monodirectional connections between two ES clusters, for cross-search or data replication
-
Security
-
Enable on the cluster the security
xpack.security.enabled: true xpack.license.self_generated.type: trial # <-- optional but good to have
-
Run ES, generate keys, run Kibana, create keystore, add
elasticsearch.password
to keystore, usekibana_system
as username (elasticsearch.yml → elasticsearch.username: kibana_system: kibana_system
), run kibana, access with elastic generated credentials, create a new role, create new user can use that role
-
-
Data streams
- How setup data stream:
- Create ILM
- Create index template with mandatory:
@timestamp
anddata_stream:{}
- 💡 Note that
data_stream:{}
is a index_template paramter!
- 💡 Note that
- Create data stream using dedicated API:
PUT _data_stream/<dataStreamName>
- Use
<dataStreamName>
like a normal index, under the hood ES automatically rollover the index and apply ILM
- The difference with ILM “standard”:
We can obtain similar functionalities without specificdata_stream
parameter:- Create ILM with rollover
- Create index template with
index.lifecycle.rollover_alias: <aliasName>
parameter - Create an index use the template
- 🦂💡 The index name must be in the form
<index-name>-000001
- 🦂💡 The index name must be in the form
- Create alias that link
<aliasName>
and<index-name>-000001
and"is_write_index": true
- How setup data stream:
-
-
Less relevant topics
-
For use Data Visualizer to upload a file, at least 1 ingest node must be declared
-
ssh {{username}}@{{remote_host}}
to ssh as specific user -
For a debian installation, on doc are specified all settings endpoint (logs, config…)
-
curl usages
# Tip: write the curl on vim and use 2nd CLI to run the script # Tip: in vim ":set tabstop=2" for a better indentation # Curl with body # Tip: generate using a Kibana UI and adapt curl -XPUT localhost:9200/test-index-01 -H 'Content-Type: application/json' -d' { "mappings": { "properties": { "foo": { "type":"text" } } } }' # Curl with body and security enabled curl --user elastic:LxZ9PHGTh07oOWhnwKjn -XPUT localhost:9200/test-index-01 -H 'Content-Type: application/json' -d' { "mappings": { "properties": { "foo": { "type":"text" } } } }'
-
Remember use
size=0
duringaggs
if no query is provided -
We can connect a remote cluster through the Kibana UI
-
Fast highlight with
fvh
highlighter, but it require field has indexed with"term_vector": "with_positions_offsets"
and this double the size of the field - doc -
If
sort
is specified andmax_score
is required set"track_scores": true
on the query -
There is a page named “Fix common cluster issues” under “How To” section: useful for a guide on how resolve some cluster problems
-
the path where store snapshot files is defined inside the repository and must be declared on each node setting (elasticsearch.yml - see doc)
-
In hot-warm-cold architecture, the number of replicas for each phase are defined inside the ILM
-
🤝 Advices
Some exam advice and tips
-
Use Kibana shortcuts
- Use the Kibana shortcuts, the complete list on Kibana UI “help” window
ctrl + i
→ indent the blockctrl + ↑
andctrl + ↓
→ navigate b etween blocksCtrl + /
→ open API documentation page
- Use the Kibana shortcuts, the complete list on Kibana UI “help” window
-
Search through documentation
- At the exam, the official documentation will be provided
- To better search through the documentation, expand all the sections of the official Guide and use the browser finder (
ctrl + f
)- Where push to expand all sections: image
- We can also use the integrated website search system, but if you have familiarity with the documentation the latter approach is faster
-
Use the API *common options*
- API parameters useful to better work on Kibana, see the examples to understand how to use - doc
- Most useful:
?v
- add the output columns name-
Example
GET _cat/shards # > .kibana_7.13.0_001 0 p STARTED 105 5.1mb 172.20.0.5 es01 GET _cat/shards?v # > index shard prirep state docs store ip node # > .kibana_7.13.0_001 0 p STARTED 105 5.1mb 172.20.0.5 es01
-
-
Make an index backup
-
During the exam you could mess with the index, so making a backup before running index changes could be a good thing - idea from Guido Lena Cota post
# --- # Clone an index # --- PUT kibana_sample_data_ecommerce/_settings { "settings": { "index.blocks.write": true } } POST kibana_sample_data_ecommerce/_clone/bkp_kibana_sample_data_ecommerce PUT kibana_sample_data_ecommerce/_settings { "settings": { "index.blocks.write": false } }
-
-
Useful bash commands
- You will ssh into VM, better know some useful commands
# Get VM users $ cat /etc/passwd # Get all processes $ ps aux # Run as <user> su - <user> # e.g. `su - elasticsearch bin/elasticsearch`
📔 Dictionary
Relevant Keywords/Concepts explanations
-
Closed index
- A closed index is blocked for read/write/search operations
- “A closed index is blocked for read/write operations and does not allow all operations that opened indices allow. […] resulting in a smaller overhead on the cluster.” - doc
- Usually you close the index before some maintaining processes (e.g.)
-
Data tier
“A data tier is a collection of nodes with the same data role that typically share the same hardware profile” - doc -
(ECS) Elastic Common Schema
- “common set of fields to be used when storing event data in Elasticsearch, such as logs and metrics.” - doc
- With ECS we can use a standardised form of values mapping, this let us reach better data analytics, charts, and other common goals (ECS fields integrate with several Elastic Stack features by default)
-
Heap size
- The JVM heap: area of memory used to store objects instantiated by applications running on the JVM. Objects in the heap can be shared between threads. - 📎 azul
- Elasticsearch automatically set the heap size based on the node’s role- doc
- You can always override the heap size using the parameter ES_JAVA_OPTS - doc:
- E.g. on docker is useful to limit the heap memory using the env variable:
"ES_JAVA_OPTS=-Xms512m -Xmx512m"
- E.g. on docker is useful to limit the heap memory using the env variable:
- You can always override the heap size using the parameter ES_JAVA_OPTS - doc:
-
History retention
- “Elasticsearch keeps track of the operations it expects to need to replay in future using a mechanism called shard history retention leases.” - doc
- 💡 ES store operations (insert and deletion) done on an index, so they can be replayed.
In CCR leader index send only the last operations done and the following index will reply to them
-
⭐ Mapping *Fields* term
🔗 official doc
- “index the same field in different ways for different purposes” - doc
- Use the term fields inside the document field:
-
🖱️ Code example
PUT my-index-000001 { "mappings": { "properties": { "city": { "type": "text", "fields": { "raw": { # <--- we will refer as city.raw "type": "keyword" } } } } } }
-
-
Memory locking requested
-
When you run an ES instance, for example on a new container instance, you could receive an error like this on the command line logs:
bootstrap check failure [1] of [1]: memory locking requested for elasticsearch process but memory is not locked
-
To resolve this issue lock the memory on the machine, using e.g. for docker compose the parameter
ulimits: memlock: soft: -1 hard: -1
-
-
References:
ulimit
set to-1
- so
-
-
⭐ Node roles
- Each instance of Elasticsearch ran is a node - docs
- Usually, you start multiple nodes on different VM with different hardware (e.g. for the Hot-Warm-Cold Architecture, or for ML purposes)
- 🦂 If you set custom node.roles, ensure you specify every node role your cluster needs - docs
- e.g. If you don’t use the role data, be sure to have defined both data_content and data_hot
- List of available nodes roles
- 💡 data_content is the node preferred to put data that doesn’t fit a time series - docs
-
Realm
“The authentication process is handled by one or more authentication services called realms” - doc
- A realm is used to resolve and authenticate users based on authentication tokens. - doc
- The system that stores and check the user credentials. There are internal and external realms (external realms like kerberos require interaction between ES and 3th parties)
-
Remote recovery process
- In CCR (Cross Cluster Replication) is the process of copying data from leader index to follower
- information about an in-progress remote recovery: - cat-recovery API
-
Runtime Fields
- A runtime field is a field that is evaluated at query time - docs
- 💡 Useful for adding fields to existing documents without reindexing your data
- Defined at mapping time or query time
-
Soft deletes
- See history retention dictionary entry*,* they are the same thing
- Soft deletes is a feature ES provide that is activated when history retention is activated
-
Seed node | Seed hosts
Official doc
- Inside each node configuration (elasticsearch.yml) there is a field named discovery.seed_hosts that take a list of hosts names.
This hosts list will be used to join the cluster and for the cluster formation - “In short discovery.seed_hosts is the list of master nodes” - so
- Inside each node configuration (elasticsearch.yml) there is a field named discovery.seed_hosts that take a list of hosts names.
-
Segments
Info from medium article
- “The Lucene index is divided into smaller files called segments.
A segment is a small Lucene index. Lucene searches in all segments sequentially.” - Segments are immutable
- More segments: slow searches (because ES search sequentially)
- So you can merge: “During a merge, Lucene takes 2 segments, and moves the content into a third, new one”
- This allows us to not copy “deleted” documents into the new segment
- More segments: slow searches (because ES search sequentially)
- “The Lucene index is divided into smaller files called segments.
-
Shard Doc (search field)
🔗 Official doc
- “The
_shard_doc
value is the combination of the shard index within the PIT and the Lucene’s internal doc ID, it is unique per document and constant within a PIT” - Used inside the _search API to paginate using the search after functionality
- “The
-
Soft delete
The underlying mechanism used to retain these operations (history of individual write operations) is soft deletes. - doc
- ES maintain a history file with individual write operations, useful for example during the update of the following index in a Cross Cluster Replica architecture.
The retaining mechanism of this file is called Soft Delete - doc
- ES maintain a history file with individual write operations, useful for example during the update of the following index in a Cross Cluster Replica architecture.
-
X-pack
- X-pack is an ES Stack extension
- Provides security, alerting, monitoring, reporting, machine learning, and many other capabilities.
- X-Pack is open, but not everything is free
- “Many features in X-Pack are free […], some features in X-Pack are paid” - link
- You always have a 30-day trial
- Here list of what is free and what is payed
- “Many features in X-Pack are free […], some features in X-Pack are paid” - link
🙏 Resources
-
Useful online resources
- Preparing for the Elastic Certified Engineer Exam - Get Elasticsearch Certified - youtube
- ⭐ Elastic Certified Engineer Exam - My Experience and How I Prepped - linkedin
- ⭐ Guido Lena Cota medium posts (2019)
- Elastic Certified Engineer Exam — what to expect and how to rock it - medium
- Exercises for the Elastic Certified Engineer Exam: Deploy and Operate a Cluster - medium
- Exercises for the Elastic Certified Engineer Exam: Store Data into Elasticsearch - medium
- Exercises for the Elastic Certified Engineer Exam: Model Data into Elasticsearch - medium
- Exercises for the Elastic Certified Engineer Exam: Search and Aggregations - medium
- Querying and aggregating time series data in Elasticsearch - ES blog
- Designing the Perfect Elasticsearch Cluster: the (almost) Definitive Guide - medium
- Official Elasticsearch examples - GitHub
- Searchable Snapshots - Daily Elastic Byte S01E14 - yt
- Troubleshooting Elasticsearch ILM: Common issues and fixes - blog
- 📚 Books
- Old but gold: Elasticsearch: The Definitive Guide - physical book, online version
- 🦂 A lot of the new features are created after the book (first edition: 2015) and some API used on the book are now deprecated.
- 💡 Anyway, the book is really well written and have some meaningful insights and descriptions about the internal operation of Elasticsearch that aren’t version-specific and not easily deducible from the official documentation
- A book about running Elasticsearch: running-elasticsearch-fun-profit - web, github
- Old but gold: Elasticsearch: The Definitive Guide - physical book, online version