Elasticsearch

Elastic Stack Sizing consideration and architecture

Written by Ajit Gadge

| Feb 13, 2019

7 MIN READ

We have recently completed a few challenging but successful implementations using the Elastic stack. One of the biggest challenges we came across is, ‘How one can size Elastic cluster?’. And my best response to this is – ‘It Depends’! Because there is no magic wand, there are many parameters to be considered.
In this article I am sharing some of the considerations based on my experience of recent Elastic implementations.
There are many factors which you need to consider while sizing and architecting an Elastic cluster. Usually I start with few questions to all my customers before architecting & deploying Elastic cluster are as below:

  • What is the use case of Elastic Cluster? Centralize Logging OR Infrastructure Monitoring OR Business Analytics OR Site or Enterprise Search OR APM, etc.
  • What is the source of data? Metric data through beats OR log file data through filebeat OR RDBMS data OR Application logs OR data from the web, etc.
  • What is the frequency of data getting ingested into cluster? Continuous with multiple devices, files, etc.
  • Do you know the High, Low, Average volume of data ingestion?
  • What is the output you are expecting from Elastic cluster? Only Search OR Dashboards OR Alerts OR Machine Learning OR Analytics, etc.
  • What performance is acceptable to you to be considered as output? Real-time in sub-second OR less than minute OR Minutes
  • How many users are going to access Elastic output?
  • What is the timeframe of data archiving you looking at ? 7 days OR 1 month OR 6 months
  • Do you know the size of each document/event/records that are getting ingested into Elastic?

We have experienced that many times information for above questions is not available at the time of starting the project. And if you start the implementation without having above data, then you face bottleneck once the data starts flowing into Elastic cluster.
While Elastic is a very versatile, fast and flexible technology and one can derive many use cases from the same Elastic cluster, users/organization need to have answers of above questions before going in production.
In this article, we will see how we can answer the above questions to identify the possible options to decide the right architecture to deploy an Elastic cluster.
loadImg
As you are aware that Elasticsearch is a very powerful search engine, which is the primary reason users opt Elasticsearch for. Companies like Facebook, Dell, eBay, Uber, Netflix, and many more use Elasticsearch as a search engine. If you know your data source very well & its growth pattern, then it is very easy to factor the sizing for an Elasticsearch.
Eg; If your data is coming from RDBMS as source and getting ingested into Elastic or your data is coming from data files like .csv or .text and you are aware of its daily growth pattern then it would be very easy to size Elasticsearch.
But if your use case is for infrastructure monitoring or centralize logging then it is difficult to size your Elastic cluster.
In such types of use cases, typically you use different types of beats which pump in data very fast and you have different and many data sources which send data through filters like Logstash or directly to Elasticsearch. The sizing here becomes a challenge and knowing the growth of these sources also becomes difficult. So, we need to make sure what use case we need from our Elastic cluster to address the sizing issue.
Below table shows few sample use cases and it’s required data sources with customers who are using these use cases.

Use Case Data Source example Customers
Application Search R/DBMS, Application Log, Application Data, external files etc Grab, Argos, Dell, Facebook, eBay etc.
Site Search Web crawling data SurveyMonkey, Shopify, TechCrunch etc.
Business Analytics R/DBMS, events, App logs, App Data, external files etc Sun hotels, Dell, Sprint etc.
Enterprise Search R/DBMS, Application Log, Application Data, external files or any data that can search etc Goldman Sach, Orange, Airbus etc.
Metric Analytics Event and metric data, infra log data etc. Microsoft, Nvidia, Walmart etc.
Operational Log Analytics Meticdata, Infra log data, N/w data etc Grab, Tv2, GoDaddy etc.
Security Analytics Metric data, event logs, Info Logs, N/w data etc. Symantec, NetApp, Slack etc.

Have you chosen your data wisely?

databaseBig
As I mentioned above, different use cases require different types of source data. So, you need to decide carefully which data is useful for you to achieve your desired output. Eg; ‘Do you really require metricbeat as well as winlogbeat on the same windows machine?’ OR ‘Do you really require to capture TCP packet data through packetbeat all the time?’ Also, decide which fields are absolutely required to get the desired output. You can filter out the fields which you are never going to use.
Elastic is flexible & smart enough to automatically understand your data pattern, based on which it creates its own dynamic index pattern, but this may cause an issue on your
production setup and may ingest unnecessary fields and events. So, I would recommend creating your own fixed index template with only the required fields. You can always keep some buffer to add some fields at a later stage if needed.
Also, while ingesting string/text data, Elastic creates type as ‘Text’ & ‘keyword’ on the same field, but you may not want this. Eg; If you are ingesting content from your own document you might not need the ‘keyword’ but only the ‘text’, while you may want Category as ‘keyword’ type & not ‘text’.  All these fields use resources from your cluster, so you need be very careful.

Frequency of data:

While we are trying to ingest our own data into Elastic search in the form of metric, various logs, structure data etc. for multiple use cases, it is important for us to know at what frequency we are going to ingest this data. Apart from this, we should also consider the size per document /event. The data might not flow consistently, so we should also understand the peak time for a number of documents/events which is getting ingested. These are again very important considerations while you are designing an Elastic cluster.
In typical infra monitoring or log analysis cases, you have many sources of data and it is in various shape and sizes. So, the frequency of ingesting these data is very high. Also, for search type of use cases, we need to find out many users are sending search requests and at what frequency.  Elasticsearch has to maintain thread_pool size for each of these operations like for write operation, the thread_pool size is (1+ # of processor available) *200 documents/events which is fixed. If you really want to change it, you can do that but that might hurt other operation like search, analysis etc. which is not recommended. To get request, thread_pool size is # of available process * 1000. Here is a complete list of thread_pool size. This will help you to size write number of Elasticsearch data and ingest nodes along with their number of cores.
Another possible angle to consider here is, do we need to consider message queue like in Kafka, JMS in such a project? If Yes, then when? These things really matter and may not be considered at the beginning of the project.
It is also better to understand the size of each document/event. In case, if we have multiple types of documents/event, check different types of document size. If possible, also check for JSON format by ingesting a sample. I will talk on how to ingest sample data and benchmark sizing separately in another blogpost.

What do you expect from Elastic Stack?

personBig
I have often seen many organizations and individuals start exploring Elastic technology but are not clear on the outcome they’d like to build. Though they know that there are few uses cases but are not sure about the forward path.  As I mentioned above, using the Elastic stack, one can derive many uses cases for your business but how exactly you need the output is never sure at beginning of the project. Eg; If you are building Log analytics using Elastic, what are the dashboards and visualization we would want to see may not be known at the beginning of the project.
Also, once data starts onboarding into an Elastic cluster, you may like to build multiple use cases that you might not have thought of at the beginning of the project. Using Elastic X-Pack one can configure to send alerts on data which is searchable. One can build a machine learning job like pattern matching for anomaly detection, one can build recommendation engine etc. So, using the same data, one can build different use cases and output. But if we do this at a later stage, we face performance and other challenges. Hence if we know what output we are expecting while designing the project itself then it helps in sizing the elastic deployment.
We have done a few deployments of Elastic search where customers were not sure at the beginning of their projects about what they can exactly visualize from the output of data. Though they knew that they wanted to go for Log analytics and Infra monitoring. After setting up the Elastic platform and once the data started onboarding, there are plenty of dashboards, visualization, alerts etc. and that is when the customers started building expectations. All these impacts your initial cluster sizing consideration, your deployment architecture etc.

Have you decided on your archiving plan?

Elasticsearch is known for real-time data search and analytics and people often expect result in sub-second/seconds. But to achieve your goal, you need to plan properly. You might expect to provide a real-time result on your freshly ingested data like 1 hour or 1 day prior and you may like to archive your older data (after every 7 days or 1 month or 6 months, etc) into another slower system or delete from the system if you really do not want these data. So, you need to decide what on the archiving policy of your Elasticsearch data at the beginning of the project.  One can build HOT-WARM architecture using the Elasticsearch parameters which can store more important fresh data into HOT nodes and less important/slow data on slower H/w which is on the WARM node. But it is very important to decide what data is important for you to search in sub-second / second and which data is okay to search on a slower platform.
There is also a curator tool which helps you to delete data on based on conditions as well as helps to archive data to your storage or S3. There are plenty of options available, but we need to decide what we need as o/p, performance and design architecture accordingly.
One can use message queue like Kafka, processing engine like spark, big data storage like Hadoop along with Elastic stack based on use case. I will discuss such scenarios in detail in my next articles.
As I have mentioned in this article that there are many parameters that go into designing  Elastic cluster and you need to consider the above things carefully to start deploying the Elastic cluster.


Go to Top