{"id":20590,"date":"2019-06-10T16:30:26","date_gmt":"2019-06-10T15:30:26","guid":{"rendered":"https:\/\/www.intercom.com\/blog\/?p=20590"},"modified":"2020-07-30T12:54:38","modified_gmt":"2020-07-30T11:54:38","slug":"upgrading-elasticsearch","status":"publish","type":"post","link":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/","title":{"rendered":"A step-by-step guide to how we upgraded Elasticsearch with no downtime"},"content":{"rendered":"<p>Elasticsearch is a <a href=\"https:\/\/www.intercom.com\/blog\/run-less-software\/\" target=\"_blank\" rel=\"noopener noreferrer\">core technology<\/a> at Intercom. It powers everything from article, conversation and user search to reporting, billing, message delivery and even our internal log management and analytics.<\/p>\n<p>Because Elasticsearch has been at <a href=\"https:\/\/www.intercom.com\/blog\/elasticsearch-at-intercom\/\" target=\"_blank\" rel=\"noopener noreferrer\">the core of Intercom for a long time<\/a>, upgrading it is a challenging problem. Any version upgrade needs to be completely invisible to our customers. To quote <a href=\"https:\/\/www.youtube.com\/watch?v=6t8ae1Jgf_s&amp;feature=youtu.be\" target=\"_blank\" rel=\"noopener noreferrer\">Aaron Brady from Shopify<\/a>, \u201cOur customers should not be disadvantaged by the fact that we have chosen to upgrade our infrastructure.\u201d<\/p>\n<p>This story begins a few months ago.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20592\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago.jpg\" alt=\"\" width=\"1382\" height=\"254\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago.jpg 1382w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago-300x55.jpg 300w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago-768x141.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago-700x129.jpg 700w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago-600x110.jpg 600w\" sizes=\"auto, (max-width: 1382px) 100vw, 1382px\" \/><\/p>\n<p>It\u2019s September 2018, and there are 7 Elasticsearch clusters at Intercom. These are single purpose clusters, and they are owned by independent product teams. Intercom believes that teams should own their own infrastructure. This lets us <a href=\"https:\/\/www.intercom.com\/blog\/move-fast-and-optimize-for-the-long-term\/\" target=\"_blank\" rel=\"noopener noreferrer\">move fast and optimize for the long term<\/a> (one of our core engineering values) by giving us control over decisions such as indices, mappings and types without being blocked for approval by red tape.<\/p>\n<blockquote class=\"pullquote-style-one\"><p>&#8220;We strive to ensure it is always fast, safe and easy for these product teams to utilize our infrastructure services to build amazing things&#8221;<\/p><\/blockquote>\n<p>In Intercom, we are focused on <a href=\"https:\/\/www.intercom.com\/blog\/shipping-is-your-companys-heartbeat\/\" target=\"_blank\" rel=\"noopener noreferrer\">shipping great product at high velocity<\/a>, and it&#8217;s critical that product teams are empowered and enabled to do this. We strive to ensure it&#8217;s always fast, safe and easy for these product teams to utilize our infrastructure services to build amazing things. This is why we made the decision to centralize this upgrade effort within a single infrastructure team.<\/p>\n<p>We thus chose to upgrade all of our Elasticsearch clusters, starting with our oldest cluster which was running version 2.3.3. There were 54 releases of Elasticsearch between versions 2.3.3 and 6.3.0. That\u2019s a long jump, and one we had to land perfectly.<\/p>\n<h2 id=\"why-keep-elasticsearch-up-to-date\">Why keep Elasticsearch up to date<\/h2>\n<p>Before starting this project, we made sure we were solving the right problem by asking ourselves, \u201cWhy should we keep Elasticsearch up to date?\u201d<\/p>\n<p>Running lots of different versions of Elasticsearch is not optimal for a few reasons:<\/p>\n<ul>\n<li>It\u2019s difficult to debug problems with individual clusters because engineers can\u2019t transfer their knowledge and context of how one cluster works to another.<\/li>\n<li>It\u2019s impossible to provide generic shared tooling for all clusters because they all support different APIs.<\/li>\n<li>The older a cluster grows, the harder it becomes to upgrade.<\/li>\n<\/ul>\n<p>Running outdated versions of Elasticsearch (and software in general) is <em>not okay<\/em> for a few reasons:<\/p>\n<ul>\n<li><strong>Security:<\/strong>\u00a0It is unacceptable to our customers if we were to build Intercom on top of software or systems that are at risk of being exploited.<\/li>\n<li><strong>Productivity:<\/strong>\u00a0Most software tends to improve over time, adding features, fixing bugs and supporting new ways of doing things. A lot of software tends to get faster over time.<\/li>\n<li><strong>Maintainability:<\/strong>\u00a0Deploying software updates ensures that your team knows how to build, test and deploy the software they own.<\/li>\n<\/ul>\n<h2 id=\"the-effect-of-continuous-integration-on-the-elasticsearch-upgrade\">The effect of continuous integration on the Elasticsearch upgrade<\/h2>\n<p>Intercom has an amazing <a href=\"https:\/\/www.intercom.com\/blog\/why-continuous-deployment-just-keeps-on-giving\/\" target=\"_blank\" rel=\"noopener noreferrer\">continuous integration (CI) pipeline<\/a>, on a good day we <a href=\"https:\/\/www.intercom.com\/blog\/moving-faster-with-smaller-steps\/\" target=\"_blank\" rel=\"noopener noreferrer\">ship to production over a hundred times<\/a>. In order to do this safely, every code change is subjected to an ever growing battery of unit and integration tests. At the time of this writing, we were running 47,080 tests against each build in CI.<\/p>\n<p>By first upgrading Elasticsearch in CI, we were able to flush out most of the breaking changes that affected our specific use of Elasticsearch. This helped us find things like mapping changes, invalid cluster settings and deprecated query syntax. We shipped fixes and iterated here until our CI was passing with both Elasticsearch 2.3.3 and Elasticsearch 6.3.0.<\/p>\n<p>We know our CI test coverage is really good, but it\u2019s not perfect, and it never will be. So before moving on we captured a week&#8217;s worth of real ingestion and search traffic and replayed them against a test cluster running Elasticsearch 6.3.0. We used this mechanism to verify that every request had the same response from both the new and old versions of Elasticsearch.<\/p>\n<h2 id=\"the-step-by-step-upgrade-process\">The step-by-step upgrade process<\/h2>\n<p>Elasticsearch have good documentation around <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/6.3\/setup-upgrade.html\" target=\"_blank\" rel=\"noopener noreferrer\">version upgrades<\/a>. However, they note that upgrades across major versions prior to Elasticsearch 6.0 require a full cluster restart. We wanted a zero downtime and minimum risk upgrade, where we could fail back to the old cluster instantly at the first sign of trouble.<\/p>\n<p>Our process is a two step process: first we upgraded from 2.3.3 to 5.6.9 and then from 5.6.9 to 6.3.0. This was necessary because of our use of <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/6.0\/modules-snapshots.html\" target=\"_blank\" rel=\"noopener noreferrer\">snapshot and restore<\/a> \u2013 newer versions of Elasticsearch can only read snapshots created from the current or previous major version.<\/p>\n<p>Before we upgraded Elasticsearch, our clusters had:<\/p>\n<ul>\n<li>40 dedicated data nodes<\/li>\n<li>3 dedicated master eligible nodes<\/li>\n<li>3 dedicated coordinating only nodes<\/li>\n<li>10k document per second peak ingestion rate<\/li>\n<li>4k document per second peak search rate<\/li>\n<li>2.4 billion documents<\/li>\n<li>6.8 terabytes of data<\/li>\n<\/ul>\n<h3>How to upgrade Elasticsearch from version 2.3.3. to 6.3.0.<\/h3>\n<p style=\"padding-left: 40px;\"><strong>1.<\/strong> Set up a new 6.3.0 cluster with identical hardware and enable dual writing.<\/p>\n<p><em><strong>Important<\/strong>: It\u2019s crucial not to perform any real deletes on the 6.3.0 cluster while dual writing. If a document is deleted from the 2.3.3 cluster then it should only be marked as deleted in the 6.3.0 cluster. The reason for this is that later we will restore from a backup and we don\u2019t want to restore documents which have been deleted in the meantime. We achieved this constraint by adding a boolean field called &#8220;Deleted&#8221; to each document. We then transformed deletions into updates during dual writing.<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20595\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-1-Setup-new-cluster.jpg\" alt=\"step 1 set up the new elasticsearch cluster\" width=\"1072\" height=\"584\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-1-Setup-new-cluster.jpg 1072w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-1-Setup-new-cluster-300x163.jpg 300w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-1-Setup-new-cluster-768x418.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-1-Setup-new-cluster-700x381.jpg 700w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-1-Setup-new-cluster-600x327.jpg 600w\" sizes=\"auto, (max-width: 1072px) 100vw, 1072px\" \/><\/p>\n<p style=\"padding-left: 40px;\"><strong>2.<\/strong> Take a snapshot of the old 2.3.3 cluster.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20596\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-2-Take-snapshot-of-old-cluster.jpg\" alt=\"Step 2 take a snapshot of the old cluster\" width=\"1078\" height=\"975\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-2-Take-snapshot-of-old-cluster.jpg 1078w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-2-Take-snapshot-of-old-cluster-300x271.jpg 300w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-2-Take-snapshot-of-old-cluster-768x695.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-2-Take-snapshot-of-old-cluster-700x633.jpg 700w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-2-Take-snapshot-of-old-cluster-600x543.jpg 600w\" sizes=\"auto, (max-width: 1078px) 100vw, 1078px\" \/><\/p>\n<p style=\"padding-left: 40px;\"><strong>3.<\/strong> Set up a temporary 5.6.9 cluster.<\/p>\n<p style=\"padding-left: 80px;\"><strong>a.<\/strong> Restore from the 2.3.3 snapshot taken in the previous step.<\/p>\n<p style=\"padding-left: 80px;\"><strong>b.<\/strong> Reindex into the Elasticsearch 5x format.<\/p>\n<p style=\"padding-left: 80px;\"><strong>c.<\/strong> Take another snapshot.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20597\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-3-Setup-temporary-cluster.jpg\" alt=\"Step 3 setup a temporary cluster and restore the snapshot\" width=\"1340\" height=\"1137\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-3-Setup-temporary-cluster.jpg 1340w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-3-Setup-temporary-cluster-300x255.jpg 300w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-3-Setup-temporary-cluster-768x652.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-3-Setup-temporary-cluster-700x594.jpg 700w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-3-Setup-temporary-cluster-600x509.jpg 600w\" sizes=\"auto, (max-width: 1340px) 100vw, 1340px\" \/><\/p>\n<p style=\"padding-left: 40px;\"><strong>4.<\/strong> Delete the temporary 5.6.9 cluster.<\/p>\n<p style=\"padding-left: 80px;\"><strong>a.<\/strong> Restore from the 5.6.9 snapshot taken in the previous step into a new temporary index.<\/p>\n<p style=\"padding-left: 80px;\"><strong>b.<\/strong> Reindex from the temporary index into the live index, the data will now be in the Elasticsearch 6x format.<\/p>\n<p style=\"padding-left: 80px;\"><strong>c.<\/strong> Delete the temporary index.<\/p>\n<p><em><strong>Important<\/strong>: If a document exists in the live index then we do not want to overwrite it with an older version during the reindexing operation. To achieve this we set op_type to create, which caused the reindexing to only create missing documents in the target index. All existing documents caused a version conflict so we also added the proceed on conflicts setting.<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20598\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-4-Delete-temporary-cluster.jpg\" alt=\"Step 4 delete the temporary cluster\" width=\"1078\" height=\"1188\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-4-Delete-temporary-cluster.jpg 1078w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-4-Delete-temporary-cluster-272x300.jpg 272w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-4-Delete-temporary-cluster-768x846.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-4-Delete-temporary-cluster-635x700.jpg 635w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-4-Delete-temporary-cluster-544x600.jpg 544w\" sizes=\"auto, (max-width: 1078px) 100vw, 1078px\" \/><\/p>\n<p style=\"padding-left: 40px;\"><strong>5.<\/strong> At this point in time we have two identical Elasticsearch clusters. They have the exact same hardware, the exact same data and they are both being kept in sync. We can now at any point switch the read load over to the new cluster and tear down the old cluster.<\/p>\n<p><em><strong>Important<\/strong>: Before switching over for real, we used <a href=\"https:\/\/github.com\/github\/scientist\" target=\"_blank\" rel=\"noopener noreferrer\">Github Scientist<\/a> to verify that both clusters had a 100% match rate on the read path.<\/em><\/p>\n<p style=\"padding-left: 40px;\"><strong>6.<\/strong> Turn off dual writing and make the new cluster the authority for both reads and writes. We can now delete all documents which have the boolean field &#8220;Deleted&#8221; set to true. We used the <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/docs-delete-by-query.html\" target=\"_blank\" rel=\"noopener noreferrer\">delete_by_query<\/a> API for this.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20599\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-6-Retire-old-cluster.jpg\" alt=\"Step 6 retire the old cluster and use the new cluster\" width=\"448\" height=\"603\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-6-Retire-old-cluster.jpg 448w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-6-Retire-old-cluster-223x300.jpg 223w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Step-6-Retire-old-cluster-446x600.jpg 446w\" sizes=\"auto, (max-width: 448px) 100vw, 448px\" \/><\/p>\n<p>Finally, we can retire the old cluster.<\/p>\n<h2 id=\"the-issues-we-discovered-during-the-upgrade\">The issues we discovered during the upgrade<\/h2>\n<p>We kept the old and new clusters dual writing and dual reading (step 5 above) for more than a week so we could verify that the new cluster was 100% stable and returning the correct documents. This turned out to be a really good idea because we discovered two serious performance issues with both Elasticsearch 5.6.9 and 6.3.0.<\/p>\n<p>Keeping the old cluster around bought us the time we needed to report these issues to Elastic, and work with them on eventually rolling out fixes:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/elastic\/elasticsearch\/pull\/31105\" target=\"_blank\" rel=\"noopener noreferrer\">elastic\/elasticsearch\/pull\/31105<\/a>: This was an indexing performance regression that affected indexes with a large number of fields. We saw elevated CPU usage on data nodes, really bad ingestion latency and regular stop-the-world garbage collection stalls.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20591\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Comparing-elasticsearch-versions.jpg\" alt=\"Graphs showing performance regression from the Elasticsearch 2.3.3. to 5.6.9. in CPU usage, indexing rate, indexing latency, young garbage collection time and old garbage collection time\" width=\"1382\" height=\"1143\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Comparing-elasticsearch-versions.jpg 1382w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Comparing-elasticsearch-versions-300x248.jpg 300w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Comparing-elasticsearch-versions-768x635.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Comparing-elasticsearch-versions-700x579.jpg 700w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Comparing-elasticsearch-versions-600x496.jpg 600w\" sizes=\"auto, (max-width: 1382px) 100vw, 1382px\" \/><\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/elastic\/elasticsearch\/issues\/32537\" target=\"_blank\" rel=\"noopener noreferrer\">elastic\/elasticsearch\/issues\/32537<\/a>: This was a memory leak caused by slow logging. After a few days we started to see regular stop-the-world garbage collection pauses and after five days we saw data nodes completely fall over with OutOfMemory exceptions.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20600\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak.jpg\" alt=\"Memory leak caused by slow logging over 3 days\" width=\"1338\" height=\"1404\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak.jpg 1338w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak-286x300.jpg 286w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak-768x806.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak-667x700.jpg 667w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak-572x600.jpg 572w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/The-memory-leak-1334x1400.jpg 1334w\" sizes=\"auto, (max-width: 1338px) 100vw, 1338px\" \/><\/p>\n<h2 id=\"the-benefits-of-upgrading-elasticsearch\">The benefits of upgrading Elasticsearch<\/h2>\n<p>The <a href=\"https:\/\/www.elastic.co\/products\/elasticsearch\" target=\"_blank\" rel=\"noopener noreferrer\">Elasticsearch<\/a> and <a href=\"https:\/\/lucene.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Lucene<\/a> teams have done some incredible work over the past few years.<\/p>\n<ul>\n<li><strong>Performance:<\/strong> The Elasticsearch 5x release was focused on ingestion and search performance. The Elastic blog promised \u201c<a href=\"https:\/\/www.elastic.co\/blog\/elasticsearch-5-0-0-released\" target=\"_blank\" rel=\"noopener noreferrer\">somewhere between 25% &#8211; 80% improvement to indexing throughput<\/a>,\u201d and we saw exactly this after we applied the two bug fixes that we previously discussed. Most of our clusters showed a greater than 50% improvement in average indexing and search latency and a 40% reduction in average CPU usage.<\/li>\n<\/ul>\n<div id=\"attachment_20593\" style=\"width: 1271px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-20593\" class=\"size-full wp-image-20593\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Improved-performance-from-Elasticsearch-2.3.3.-to-5.69..jpg\" alt=\"Graphs showing improved performance in average ingestion latency, average search latency and average CPU usage from Elasticsearch 2.3.3. to 5.6.9.\" width=\"1261\" height=\"1340\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Improved-performance-from-Elasticsearch-2.3.3.-to-5.69..jpg 1261w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Improved-performance-from-Elasticsearch-2.3.3.-to-5.69.-282x300.jpg 282w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Improved-performance-from-Elasticsearch-2.3.3.-to-5.69.-768x816.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Improved-performance-from-Elasticsearch-2.3.3.-to-5.69.-659x700.jpg 659w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Improved-performance-from-Elasticsearch-2.3.3.-to-5.69.-565x600.jpg 565w\" sizes=\"auto, (max-width: 1261px) 100vw, 1261px\" \/><p id=\"caption-attachment-20593\" class=\"wp-caption-text\"><em>Elasticsearch 2.3.3 is plotted in red and 5.6.9 is in blue. Both clusters are under an identical (ingestion + search) work load.<\/em><\/p><\/div>\n<ul>\n<li><strong>Resilience:<\/strong>\u00a0Elasticsearch 6x was focused on \u201c<a href=\"https:\/\/www.elastic.co\/blog\/elasticsearch-6-0-0-released\" target=\"_blank\" rel=\"noopener noreferrer\">faster restarts and recoveries with sequence IDs.<\/a>\u201d This has been a complete game changer for us in terms of cluster maintenance. We now regularly rolling-restart all of our Elasticsearch clusters in order to install security patches and even Linux kernel upgrades. These rolling-restarts are automated and cause zero impact to our clusters which are continuously under both heavy ingestion and search loads.<\/li>\n<li><strong>Efficiency:<\/strong>\u00a0Elasticsearch 6.0 also includes Lucene 7.0, which has a <a href=\"https:\/\/www.elastic.co\/blog\/minimize-index-storage-size-elasticsearch-6-0\" target=\"_blank\" rel=\"noopener noreferrer\">major storage benefit in how sparsely populated fields are stored<\/a>. As a result we saw a massive reduction in disk usage (&gt;40%) across some of our larger clusters.<\/li>\n<li><strong>Cost:<\/strong> After the version upgrade, our clusters were massively over-provisioned, which enabled us to move to a newer EC2 instance family and reduce our overall instance counts. This was expected, however we made a conscious decision not to change both the hardware and software at the same time. This made it much easier to debug the two performance issues outlined above.<\/li>\n<\/ul>\n<p>A month after the upgrades we moved from c3.8xlarge to m5d.4xlarge for all of our Elasticsearch data nodes. Due to the combined effects of upgrading both Elasticsearch and EC2 instances, we cut the cost of running Elasticsearch at Intercom in half.<\/p>\n<h2 id=\"elasticsearch-at-intercom-today\">Elasticsearch at Intercom today<\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20594\" src=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Publish-date.jpg\" alt=\"\" width=\"1382\" height=\"252\" srcset=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Publish-date.jpg 1382w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Publish-date-300x55.jpg 300w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Publish-date-768x140.jpg 768w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Publish-date-700x128.jpg 700w, https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Publish-date-600x109.jpg 600w\" sizes=\"auto, (max-width: 1382px) 100vw, 1382px\" \/><\/p>\n<p>It\u2019s now June 2019, and there are 10 Elasticsearch clusters at Intercom, all running the latest and greatest version of Elasticsearch. They are still owned by individual teams because we still believe that product teams should own their own infrastructure, but we now also have a single team for cross cutting concerns like Elasticsearch major version upgrades.<\/p>\n<p class=\"inline-cta-quote\"><a href=\"https:\/\/www.intercom.com\/blog\/careers\" target=\"_blank\" rel=\"noopener noreferrer\">If you&#8217;re excited about Elasticsearch and other large scale distributed datastore related projects, please reach out to us. We&#8217;re hiring!<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging. Here&#8217;s how we did it. <\/p>\n","protected":false},"author":408,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"category":[12898],"tags":[14178,335,329],"coauthors":[17264],"class_list":["post-20590","post","type-post","status-publish","format-standard","hentry","category-engineering","tag-elasticsearch","tag-engineering","tag-infrastructure"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>A Guide to How We Upgraded Elasticsearch With No Downtime<\/title>\n<meta name=\"description\" content=\"At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A step-by-step guide to how we upgraded Elasticsearch with no downtime\" \/>\n<meta property=\"og:description\" content=\"At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/\" \/>\n<meta property=\"og:site_name\" content=\"The Intercom Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/intercominc\" \/>\n<meta property=\"article:published_time\" content=\"2019-06-10T15:30:26+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-07-30T11:54:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Upgrading-Elasticsearch-social-share-image.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"900\" \/>\n\t<meta property=\"og:image:height\" content=\"471\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Cathal Coffey\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@coffeycathal\" \/>\n<meta name=\"twitter:site\" content=\"@intercom\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Cathal Coffey\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/\"},\"author\":{\"name\":\"Cathal Coffey\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#\\\/schema\\\/person\\\/0df97d4b8c4ddd3ae5f71e436d2426bb\"},\"headline\":\"A step-by-step guide to how we upgraded Elasticsearch with no downtime\",\"datePublished\":\"2019-06-10T15:30:26+00:00\",\"dateModified\":\"2020-07-30T11:54:38+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/\"},\"wordCount\":1729,\"publisher\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/Date-9-months-ago.jpg\",\"keywords\":[\"elasticsearch\",\"Engineering\",\"infrastructure\"],\"articleSection\":[\"Engineering\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/\",\"url\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/\",\"name\":\"A Guide to How We Upgraded Elasticsearch With No Downtime\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/Date-9-months-ago.jpg\",\"datePublished\":\"2019-06-10T15:30:26+00:00\",\"dateModified\":\"2020-07-30T11:54:38+00:00\",\"description\":\"At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/upgrading-elasticsearch\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/Date-9-months-ago.jpg\",\"contentUrl\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/Date-9-months-ago.jpg\",\"width\":1382,\"height\":254},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/\",\"name\":\"The Intercom Blog\",\"description\":\"Articles and Podcasts on Customer Service, AI and Automation, Product, and more\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#organization\",\"name\":\"The Intercom Blog\",\"url\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Intercom-logo-sq-black-trans.png\",\"contentUrl\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Intercom-logo-sq-black-trans.png\",\"width\":1000,\"height\":1000,\"caption\":\"The Intercom Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/intercominc\",\"https:\\\/\\\/x.com\\\/intercom\",\"https:\\\/\\\/www.instagram.com\\\/intercom\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/2491343\",\"https:\\\/\\\/www.pinterest.ie\\\/intercom\\\/\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCJG0MvLP03kyzzAkD-w98aQ\",\"https:\\\/\\\/en.wikipedia.org\\\/wiki\\\/Intercom_(company)\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/#\\\/schema\\\/person\\\/0df97d4b8c4ddd3ae5f71e436d2426bb\",\"name\":\"Cathal Coffey\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3c5074a919d282da51af6da86497009894be4a2263d108dbba780de9b40f08e3?s=96&d=mm&r=pge63c16a9bf8aabc28995075e7ebc3d56\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3c5074a919d282da51af6da86497009894be4a2263d108dbba780de9b40f08e3?s=96&d=mm&r=pg\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3c5074a919d282da51af6da86497009894be4a2263d108dbba780de9b40f08e3?s=96&d=mm&r=pg\",\"caption\":\"Cathal Coffey\"},\"description\":\"Cathal is an engineer with a mixed background of both industry experience (Intercom, Amazon, IBM, Microsoft) and academic research (Maynooth University, UC Berkeley). He is the creator of the open source project DocX which has over 100,000 downloads on CodePlex.\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/cathalcoffey\\\/\",\"https:\\\/\\\/x.com\\\/coffeycathal\"],\"url\":\"https:\\\/\\\/www.intercom.com\\\/blog\\\/author\\\/coffeycathal\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"A Guide to How We Upgraded Elasticsearch With No Downtime","description":"At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/","og_locale":"en_US","og_type":"article","og_title":"A step-by-step guide to how we upgraded Elasticsearch with no downtime","og_description":"At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging.","og_url":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/","og_site_name":"The Intercom Blog","article_publisher":"https:\/\/www.facebook.com\/intercominc","article_published_time":"2019-06-10T15:30:26+00:00","article_modified_time":"2020-07-30T11:54:38+00:00","og_image":[{"width":900,"height":471,"url":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Upgrading-Elasticsearch-social-share-image.jpg","type":"image\/jpeg"}],"author":"Cathal Coffey","twitter_card":"summary_large_image","twitter_creator":"@coffeycathal","twitter_site":"@intercom","twitter_misc":{"Written by":"Cathal Coffey","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/#article","isPartOf":{"@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/"},"author":{"name":"Cathal Coffey","@id":"https:\/\/www.intercom.com\/blog\/#\/schema\/person\/0df97d4b8c4ddd3ae5f71e436d2426bb"},"headline":"A step-by-step guide to how we upgraded Elasticsearch with no downtime","datePublished":"2019-06-10T15:30:26+00:00","dateModified":"2020-07-30T11:54:38+00:00","mainEntityOfPage":{"@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/"},"wordCount":1729,"publisher":{"@id":"https:\/\/www.intercom.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/#primaryimage"},"thumbnailUrl":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago.jpg","keywords":["elasticsearch","Engineering","infrastructure"],"articleSection":["Engineering"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/","url":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/","name":"A Guide to How We Upgraded Elasticsearch With No Downtime","isPartOf":{"@id":"https:\/\/www.intercom.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/#primaryimage"},"image":{"@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/#primaryimage"},"thumbnailUrl":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago.jpg","datePublished":"2019-06-10T15:30:26+00:00","dateModified":"2020-07-30T11:54:38+00:00","description":"At Intercom, we rely on Elasticsearch as a core technology, which made the task of upgrading with zero disruption to our customers all the more challenging.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.intercom.com\/blog\/upgrading-elasticsearch\/#primaryimage","url":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago.jpg","contentUrl":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/06\/Date-9-months-ago.jpg","width":1382,"height":254},{"@type":"WebSite","@id":"https:\/\/www.intercom.com\/blog\/#website","url":"https:\/\/www.intercom.com\/blog\/","name":"The Intercom Blog","description":"Articles and Podcasts on Customer Service, AI and Automation, Product, and more","publisher":{"@id":"https:\/\/www.intercom.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.intercom.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.intercom.com\/blog\/#organization","name":"The Intercom Blog","url":"https:\/\/www.intercom.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.intercom.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/08\/Intercom-logo-sq-black-trans.png","contentUrl":"https:\/\/www.intercom.com\/blog\/wp-content\/uploads\/2019\/08\/Intercom-logo-sq-black-trans.png","width":1000,"height":1000,"caption":"The Intercom Blog"},"image":{"@id":"https:\/\/www.intercom.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/intercominc","https:\/\/x.com\/intercom","https:\/\/www.instagram.com\/intercom\/","https:\/\/www.linkedin.com\/company\/2491343","https:\/\/www.pinterest.ie\/intercom\/","https:\/\/www.youtube.com\/channel\/UCJG0MvLP03kyzzAkD-w98aQ","https:\/\/en.wikipedia.org\/wiki\/Intercom_(company)"]},{"@type":"Person","@id":"https:\/\/www.intercom.com\/blog\/#\/schema\/person\/0df97d4b8c4ddd3ae5f71e436d2426bb","name":"Cathal Coffey","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/3c5074a919d282da51af6da86497009894be4a2263d108dbba780de9b40f08e3?s=96&d=mm&r=pge63c16a9bf8aabc28995075e7ebc3d56","url":"https:\/\/secure.gravatar.com\/avatar\/3c5074a919d282da51af6da86497009894be4a2263d108dbba780de9b40f08e3?s=96&d=mm&r=pg","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/3c5074a919d282da51af6da86497009894be4a2263d108dbba780de9b40f08e3?s=96&d=mm&r=pg","caption":"Cathal Coffey"},"description":"Cathal is an engineer with a mixed background of both industry experience (Intercom, Amazon, IBM, Microsoft) and academic research (Maynooth University, UC Berkeley). He is the creator of the open source project DocX which has over 100,000 downloads on CodePlex.","sameAs":["https:\/\/www.linkedin.com\/in\/cathalcoffey\/","https:\/\/x.com\/coffeycathal"],"url":"https:\/\/www.intercom.com\/blog\/author\/coffeycathal\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/posts\/20590","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/users\/408"}],"replies":[{"embeddable":true,"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/comments?post=20590"}],"version-history":[{"count":0,"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/posts\/20590\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/media?parent=20590"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/category?post=20590"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/tags?post=20590"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.intercom.com\/blog\/wp-json\/wp\/v2\/coauthors?post=20590"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}