Wednesday, October 3, 2018

Elasticsearch monitoring

What is Elastic Search?
  • Elasticsearch is an open source distributed document store and search engine that stores and retrieves data structures in near real-time.
  • Elasticsearch represents data in the form of structured JSON documents, and makes full-text search accessible via RESTful API and web clients for languages like PHP, Python, and Ruby.
Few Key Areas to monitor Elastic Search in DataDog:
  • Search and indexing performance
  • Memory and garbage collection
  • Host-level system and network metrics
  • Cluster health and node availability
  • Resource saturation and errors

Search and indexing performance:
Search  Performance Metrics:
  1. Query load : Monitoring the number of queries currently in progress can give you a rough idea of how many requests your cluster is dealing with at any particular moment in time.
  2. Query latency:  Though Elasticsearch does not explicitly provide this metric, monitoring tools can help you use the available metrics to calculate the average query latency by sampling the total number of queries and the total elapsed time at regular intervals
  3. Fetch latency: The second part of the search process, the fetch phase, should typically take much less time than the query phase. If you notice this metric consistently increasing, this could indicate a problem with slow disks, enriching of documents (highlighting relevant text in search results, etc.), or requesting too many results.


Indexing Performance Metrics:
  1. Indexing latency:
If you notice the latency increasing, you may be trying to index too many documents at one time (Elasticsearch’s documentation recommends starting with a bulk indexing size of 5 to 15 megabytes and increasing slowly from there).
If you are planning to index a lot of documents and you don’t need the new information to be immediately available for search, you can optimize for indexing performance over search performance by decreasing refresh frequency until you are done indexing
      2. Flush latency:
If you see this metric increasing steadily, it could indicate a problem with slow disks; this problem may escalate and eventually prevent you from being able to add new information to your index

Memory usage and garbage collection:
  • JVM heap in use:
if any node is consistently using over 85 percent of heap memory; this indicates that the rate of garbage collection isn’t keeping up with the rate of garbage creation. To address this problem, you can either increase your heap size (as long as it remains below the recommended guidelines stated above), or scale out the cluster by adding more nodes.

  • JVM heap used vs. JVM heap committed:
The amount of heap memory in use will typically take on a sawtooth pattern that rises when garbage accumulates and dips when garbage is collected. If the pattern starts to skew upward over time, this means that the rate of garbage collection is not keeping up with the rate of object creation, which could lead to slow garbage collection times and, eventually, OutOfMemoryErrors.

Host-level network and system metrics
Host metrics to monitor:
  • Disk space
  • I/O utilization
  • Open file descriptors
Resource saturation and errors
Thread pool queues and rejections:
In general, the most important ones to monitor are search, index, merge, and bulk, which correspond to the request type (search, index, and merge and bulk operations).The size of each thread pool’s queue represents how many requests are waiting to be served while the node is currently at capacity.

Some Useful command to monitor Elasticsearch cluster and node status
Node Stats API:
curl localhost:9200/_nodes/stats
curl localhost:9200/_nodes/datanode1/stats/jvm,http
curl localhost:9200/_nodes/node1,node2/stats
Cluster Stats API:
curl localhost:9200/_cluster/stats
Index Stats API:
curl localhost:9200/index_name/_stats?pretty=true
Cluster Health HTTP API:
curl localhost:9200/_cluster/health?pretty=true
Pending Tasks API:
curl localhost:9200/_cluster/pending_tasks

Monitoring Elasticsearch performance
Problem #1:
My cluster status is red or yellow. What should I do?
cluster status is reported as red if one or more primary shards (and its replicas) is missing, and yellow if one or more replica shards is missing. Normally, this happens when a node drops off the cluster for whatever reason (hardware failure, long garbage collection time, etc.).
Once the node recovers, its shards will remain in an initializing state before they transition back to active status.
if you notice that your cluster status is lingering in red or yellow state for an extended period of time, verify that the cluster is recognizing the correct number of Elasticsearch nodes, either by consulting Datadog’s dashboard or by querying the Cluster Health API.
curl localhost:9200/_cluster/stats
Problem 2:
If the number of active nodes is lower than expected, it means that at least one of your nodes lost its connection and hasn’t been able to rejoin the cluster. To find out which node(s) left the cluster, check the elastic search logs.
- if it is a temporary failure, you can try to get the disconnected node(s) to recover and rejoin the cluster.
- If it is a permanent failure, and you are not able to recover the node, you can add new nodes and let Elasticsearch take care of recovering from any available replica shards; replica shards can be promoted to primary shards and redistributed on the new nodes you just added.
Problem #3:
Data nodes are running out of disk space?
- If all of your data nodes are running low on disk space, you will need to add more data nodes to your cluster.
-  if only certain nodes are running out of disk space, this is usually a sign that you initialized an index with too few shards.
Two remedies for low disk space
1) Remove outdated data and store it off the cluster.
2)if you need to continue storing all of your data on the cluster: scaling vertically or horizontally. If you choose to scale vertically, that means upgrading your hardware or choose to scale horizontally,that meants to roll over the index by creating a new index, and using an alias to join the two indices together under one namespace.
Problem #4:
What should I do about all these bulk thread pool rejections?
Thread pool rejections are typically a sign that you are sending too many requests to your nodes, too quickly. If this is a temporary situation (for instance, you have to index an unusually large amount of data this week, and you anticipate that it will return to normal soon), you can try to slow down the rate of your requests. However, if you want your cluster to be able to sustain the current rate of requests, you will probably need to scale out your cluster by adding more data nodes. In order to utilize the processing power of the increased number of nodes, you should also make sure that your indices contain enough shards to be able to spread the load evenly across all of your nodes.


Post a Comment