Demystified – Zero Downtime with AWS Elastic Beanstalk

I honestly feel that this blog post is going to be one of the most important posts of this blog site and it is my great pleasure bringing it to you. Definitely please let us know your opinion in the comments section after you finish reading this post.Let us completely dissect this topic “Zero Downtime with Amazon Elastic Beanstalk” by analyzing various options that are available at our disposal. Let us get straight into various methods of achieving zero downtime deployments with detailed description of each method along with its pros and cons.

1. Using CNAME swap feature provided by Elastic Beanstalk

You can read more about it in Amazon Elastic Beanstalk documentation by clicking here and read the discussion in AWS forums by clicking here.

I shall also try to explain the method. When you create an Elastic Beanstalk application you need to give a name for your environment and for example let us say the name is “my-app”. Then beanstalk once after successfully creating your environment it provides an URL in the form “my-app.elasticbeanstalk.com”. Your application becomes accessible through that URL.

But if you want your users to access your application using your domain name instead of using beanstalk URL, then you need to create a CNAME entry in your DNS records. For example if your domain name is “my-domain.com” and you want your users to access your application by typing “www.my-domain.com” in the browser, you should create a CNAME record for “www.my-domain.com” and set its value to “my-app.elasticbeanstalk.com”.

That’s it. You now have a production version running and everything is looking hunky-dory.

Now if you need to deploy a new version of your application you need to bring up or launch another new beanstalk environment. For this environment, let us say you provide a name “my-app-staging”. When successful you can access your new application at the URL “my-app-staging.elasticbeanstalk.com” and test your application. Once you are happy with its functioning, you can make it available to your users using “Swap Environment URL” menu in the “Actions” drop down available in Elastic Beanstalk console. Alternatively you could also use the command elastic-beanstalk-swap-environment-cnames which is part of the Elastic Beanstalk command line utilities.

Once the traffic to your earlier environment completely stops then you could terminate that environment.

Pros

Extremely simple to use and it can be done using Elastic Beanstalk Console. In a production environment I would not prefer to use console but rather have a script which makes use of the command line utility elastic-beanstalk-swap-environment-cnames to do the job.

Cons

It takes long time for a new environment to be created from scratch and it also depends on your application’s startup tasks and what you do during application setup time such as installing the required packages, downloading essential files, etc. In our case to deploy a new application version of www.hudku.com, it takes around 20 minutes.

You cannot terminate your older environment immediately after you do the CNAME swap. This depends on the TTL (Time To Live) settings of your DNS records and you should also be aware that not all DNS servers in the Internet strictly honor the TTL settings. In our case, we use the default TTL provided / recommended by Amazon when we create a Route53 record and it is currently 300 seconds. With that settings we have experienced that the traffic comes to complete halt only after 4 hours of the CNAME swap. So if we care for our users and do not want them to see “Server Unreachable” error then we should apply couple of more hours as buffer and keep the old environment for at least 6 hours. This adds to the cost as this period is going to be considered as “On Demand Instance” for one of the instances and it does not matter which one. So in this example we incur a cost of “On Demand Instance” for 6 hours.

Even if you can afford the cost, still it is a waste of resources as you are keeping an unwanted machine running as you do not have any other option and hence the electricity, cpu time, etc. are definitely wasted.

In this method you are using “CNAME” DNS record for your main domain which makes it slower for your users when compared to using DNS “A” record which is slightly faster. You should read about DNS records elsewhere as it falls beyond the scope of this discussion. You do not have to worry too much about this performance issue and I am mentioning it just to make this analysis complete.

2. Using Route53 to do URL update

This is almost identical to the first method but instead of using beanstalk’s CNAME swap you are using Route53 to do the job. Refer to Amazon’s documentation on Route53 and try to understand various entries you can create and in particular try to understand “AliasTarget” record.

In this method you create an AliasTarget record for example “my-app-production.my-domain.com” by choosing “A” for the type, Click the radio button “Yes” next to “Alias” and then fill in the value for the field “Alias Target” with the URL of the ELB (Elastic Load Balancer) created by the Elastic Beanstalk.

Then you create two more AliasTarget entries “my-domain.com” and “www.my-domain.com” and this time for the value do not provide load balancer, instead provide “my-app-production.my-domain.com” for both the entries. Then your application becomes accessible by “my-domain.com” as well as “www.my-domain.com”.

Now just like in the first method, for deploying new version bring up a new staging environment and call it “my-app-staging”. Now just like the production environment you make an AliasTarget entry “my-app-staging.my-domain.com” and for the AliasTarget value provide the URL of the new staging load balancer. You can then test the new staging environment using the url “my-app-staging.my-domain.com”.

Once you are satisfied with your testing and are ready to deploy it, just update the entry “my-app-production.my-domain.com” and change its AliasTarget to the URL of the new staging load balancer and voila you have done the zero downtime deployment using Route53.

Pros

This method is also simple though you might have to read Route53 documentation and understand its service which is actually good for you and also your application can make use of the benefits that Route53 provides. Here also you can use Route53 console to update the DNS entry to change the value of AliasTarget but again I would use a script to do the job if I am serious about my website.

Cons

All the disadvantages mentioned for the first method are applicable here also except for the last one regarding the CNAME. In this method CNAME record is not used and hence this method indeed enjoys that small gain in performance. For your information, AliasTarget is a concept present only in Route53 and it essentially behaves like a “A” record.

3. Lending EC2 Instances to Production Load Balancer

As of this moment, this method is not mentioned anywhere in the Internet including Amazon’s documentation and Amazon’s AWS forums. We at hudku are delighted to share this solution with you.
 
Let us assume just like in the case of above two methods, you are having your production environment running and are bringing up a new staging environment with a new version of your application. Then instead of using Elastic Beanstalk console as in the first method or using Route53 console as in the second method, this time you head straight to EC2 console and click on “Load Balancers”. Correctly identify the production load balancer and click on it. At the bottom of the page click on “Instances” tab and click on “+/-” icon and bring up the popup titled “Add and Remove Instances”. Select the newly created EC2 instance that belongs to the staging environment and push the “Save” button.

Alternatively you could use the Elastic Load Balancer command line utility “elb-register-instances-with-lb” to register the instance with the load balancer.

Now the production load balancer has two instances as its members, one belonging to the production environment and the other belonging to the newly created staging. Then go to the Elastic Beanstalk console and deploy the new application version to the production environment.

Please note that I mentioned “Production Environment” as in “PRODUCTION”, “P R O D U C T I O N”.

Please pardon me. Just did that to focus your attention that you are now directly deploying your new application version on production environment.

Now stay normal. Try wearing a smile. Do not be too tense that you are deploying directly on to the production machine. Have a sense of great satisfaction that you have proven it to the machine that after all it is dumb and its job is to simply obey your commands. You are the boss and are in complete control. Feel empowered. Just feee…eeel it!

During the deployment, let beanstalk take your production machine down and let website become unavailable from production machine during that deployment period. But your users would never notice it as they would be served happily by the instance belonging to the staging environment. If you follow the above steps then you indeed are doing a zero downtime deployment in great style.

We have tested this and ensured that this method really works. There is no problem with an EC2 instance being member of multiple load balancers and why should there be?

Once the production machine is back and its environment is shown as GREEN, then simply terminate the staging environment. I hope the termination automatically deregisters an instance from all the load balancers with which it is registered. If that is not the case then detach / deregister the staging instance from the production load balancer and then terminate the staging environment.

I never ever thought that lending or taking loan could be so innovative, beautiful and joyful.

Pros

This method eliminates ALL the disadvantages mentioned in the earlier two methods. Just for the joy of it let us examine one by one.

The problem of TTL is just not there. Hence there is no need to keep both the environments running. Once the production environment gets back to Ready / GREEN state and is functional, you can immediately terminate the staging environment.

No criminal wastage of resources such as electricity or cpu cycles. Would definitely make Al Gore happy.

Old application gets replaced with the new application version instantly. So if there is any bug in the earlier version, users do not have to live with it for a longer time which is the case in earlier two methods.

CNAME swap or URL update is not necessary.

Cons

Still a new staging environment has to be created from scratch which takes a longer time.

Even though the staging environment can be terminated as quickly as possible once the production environment is back, you incur an hourly instance cost of at least one hour even if the staging environment gets terminated within half an hour. In other words creating a staging environment is not avoided and hence you incur the cost of running of staging environment.

3b. Lending Previous Production Load Balancer to New EC2 Instances

This is a minor variation to the previous method. In previous method we lent the services of EC2 instance to a load balancer. In this method we are trying to lend the services of a load balancer to an EC2 instance. Just the reverse of the earlier one.

When would we need this?
You would need this if for some reason you HAVE to terminate the previous production environment and if you MUST use the newly created staging environment. This could be the case if there is a change in the Elastic Beanstalk Environment Configuration itself which results in terminating the production load balancer and re-creating a new one. If production load balancer itself has to be changed and cannot be retained then we cannot use the method 3 mentioned above which reuses the production environment.

You would also need this if you have used either the first method or the second method to do the zero downtime deployment. Even when TTL issues are present, you want to force and ensure that all the users are using the new version of the application and the previous version is to be decommissioned immediately. This might be the case if you have a serious bug in your previous application version and you want to force the usage of the newer version.

So you do the CNAME swap as mentioned in method one or do the Route53 URL update as mentioned in method two or do both. (We at hudku actually do both. More on that in later blog posts). The first two methods by themselves provide the zero downtime deployment. Then the new environment starts getting the traffic but the old environment also continues to get traffic because of the TTL issues discussed earlier.

To decommission the old application immediately first add the new EC2 instance as a member of old load balancer. After saving it then again open the popup titled “Add and Remove Instances” of the old load balancer and unselect the older EC2 instance and click “Save” there by removing the older instance from the older load balancer. We are doing this in two steps to ensure that things are in proper order.

Do not first remove the old instance and then add the new instance as it will result in small downtime. That is like from a tree cutting and detaching the very branch on which we are sitting. The branch would get unnecessarily broken into more pieces than necessary because of our weight when we both fall down. Just kiddin…..

Then after waiting for required amount of time to take care of TTL issues, terminate the previous production environment.

Since this is not a zero downtime method by itself it cannot be considered as a zero downtime deployment solution. This technique comes into play when method one or method two are used and then we want to force the users to see only the new version of the application.

Note:
If you are thinking, hey we only need the previous load balancer, but we do not need the previous EC2 which is disconnected anyway. Why not terminate it?

Your thinking is correct but you get into bigger trouble than simply paying the cost advised by the billing meter.

Elastic Beanstalk keeps monitoring the environment. As long as its EC2 instance is alive (whether connected to a load balancer or not) beanstalk is happy. The moment you terminate the instance, beanstalk immediately starts a new instance. Not only you pay for the full hourly price for the terminated instance, you now start paying for the newly created instance. This is called getting damned twice.

But actually it is much worse than that. Beanstalk after starting the new instance which anyway is running the older application version will again connect it to the load balancer defeating the whole purpose.

4. Running Secondary Tomcat Instance

Here in this post let me just explain you the concept. In the next few posts we shall provide you with all the scripts you would need to setup and run a secondary tomcat instance.

Of course this method is applicable only if you are running an Elastic Beanstalk Tomcat application and that’s what we use for our hudku website. Here the Apache webserver runs listening to HTTP port 80 listening to the outside world. Tomcat instance also runs and is listening at port 8080, but the port is open only internally and is not accessible from outside world. When a HTTP request arrives Apache receives the request and proxies it to 8080. Whatever HTML content Tomcat responds with is received by Apache and sent back to the IP that made the HTTP request.

The Tomcat instance by default has three ports open viz., HTTP Connector listening port 8080, AJP Connector listening to port 8009 and a redirect port 8443. By default the Elastic Beanstalk used the name “tomcat7″ for its Tomcat instance and its files are present /usr/share/tomcat7/ folder.

Now we need to bring up another Tomcat instance, let us call it “tomcat-secondary”. Copy the contents of the primary instance present in “/usr/share/tomcat7/” to the folder “/usr/share/tomcat-secondary/”. In the file server.xml in the “conf” folder change the values of the three ports. Let us say we want to bump them up by 100. So in server.xml change 8080 ==> 8180, 8009 ==> 8109 and 8443 ==> 8543. Now start the secondary tomcat instance and it should come up without any problem. For that matter not just two, you can run any number of tomcat instances in parallel as long as there is no conflict in the port numbers they use.

Once secondary tomcat instance is up, do the internal port redirection using “iptables” linux command. Internally redirect the ports 8080 ==> 8180, 8009 ==> 8109 and 8443 ==> 8543. Once this is done the primary instance becomes dummy and all the requests gets handled by the secondary tomcat instance.

Now go to the Elastic Beanstalk Console and deploy the new application version on to the production machine. Here also it is production machine and the machine is currently running secondary tomcat instance. Once the deployment is complete, just remove the redirection rules using iptables so that this time the secondary instance becomes dummy and all requests get handled by the primary tomcat instance.

Shut down the secondary tomcat instance, sing a song and be happy.

Pros

If you have proper scripts in place then bringing up a secondary instance or shutting it down is a breeze.

None of the disadvantages mentioned in any of the above methods are applicable here. This method is blemishless.

No new environment is created. The billing meter is kept quiet and the deployment happens within minutes. To deploy new version of hudku using this method it takes less than two minutes.

No resources are wasted and no, no, we are not causing any damage to the ozone layer when we are using this method.

Because of the presence of secondary instance, at any point of time we can fall back to that version of the application in case we have a serious bug in the newly deployed version and need to quickly fall back.

Cons

This method reuses the existing machine / environment. In case we MUST create a new environment only in such case this method cannot be used. This is very rare and by trying to make this point I am being highly critical.

Conclusion

That’s it folks. This is what is our take on the “Zero downtime deployment with Amazon Elastic Beanstalk”. In case if you spot any other method or variation please feel free to share it for the benefit of everybody.

Of all the methods mentioned there is nothing like one is superior over other. Depending on the situation we have at hand we could be using any of the methods mentioned. That’s the reason I have painstakingly tried to explain each in detail because we need to be aware of all of them and at hudku we actually use each one of them.

It makes me sad that the last method is applicable only if you are running a tomcat application and not if you are running PHP, Python or Ruby. I am not an expert in this area. But I want to leave a thought just in case if it could help our brethren from PHP, Python and Ruby community.

Suppose let us say if a php application is running from a directory /var/www/my-php-app. Now copy the contents of this directory to /var/www/my-php-app-secondary. Then using “sed” command substitue my-php-app with my-php-app-secondary in elasticbeanstalk.conf file in /etc/httpd/conf.d folder. Then issue “service httpd reload” command to reload Apache web server so that after reload it starts running the secondary php application.

Now deploy the new application using Elastic Beanstalk Console. But you should ensure that during deployment the conf file you changed gets overwritten with your original content so that it has reference to the primary folder /var/www/my-php-app. Then once the deployment is complete, reload apache again and you should be back to primary application running the new version.

Do you think this is possible and would work? Also please feel free to point if I had got anything wrong or if any method needs correction or can be enhanced further.

If you like this post then please try to look for all those social buttons you can find on this page and nail each one of them.

Would appreciate if you can take few minutes and leave your comments.




Arun Kumar
Arun Kumar is an engineering graduate and started his career as Software Engineer in 1987.
Arun Kumar

Latest posts by Arun Kumar (see all)

  • bakura

    Hi,

    Thanks a lot for sharing this, solution n°3 makes a lot of sense. I think there is one more drawback: if you use resources created by Beanstalk such as RDS (or even SQS, ElastiCache… now that Beanstalk allow to create environment resources).

    Because staging environment will create its own resources, if you add the EC2 instance(s) from the staging environment to the ELB, it will use its own resources (insert messages into its own SQS queue…). So during the slight delay when we update the production environment, everything will be created to the wrong resources which can be problematic.

    The only solution I see is to override staging environment’s configuration so that it does not create its resources, but rather reuse existing ones from production.

    But maybe you have a better solution? (well, of course we could decide to craete RDS, ElastiCache, SQS… separately, but I quite like environment resources).

  • http://www.hudku.com/ hudku

    Solution no. 1 & 2 are safe. Solution no. 4 is the best and is the most recommended.

    Solution no. 3 is actually a hack which we happened to try and ensured that it can work. Definitely this method is not recommended and should not be considered as a regular option. Just for the completeness of the article we shared our understanding and experience of using AWS candidly without holding anything back.

    As mentioned in the article, being aware of various methods helps us deploy the right one depending on the situation.

    Solution 4 is the best and that’s what we use most of the time.

    • bakura

      Thansk for your answer. Unfortunately solution 4 is not applicable as I use PHP. In solution 1 (swap CNAME) there is still something I cannot really understand. I tried it with two test environments and it confuses me. Let’s say that I have one domain name that points to prod.exemple.com. If I swap environments, domain name will now points to prod.exemple.com which will be, in fact, the staging environment now. This means that staging environment will stay the “production” environment until we swap again ? This confuses me. I’ve re-read your article several times but I still can’t grasp it.

  • http://www.hudku.com/ hudku

    If you are using Beanstalk, then the URL cannot be prod.example.com. It will be in the format prod-myapp.elasticbeanstalk.com. Then in your DNS records which may be in for example GoDaddy or the DNS provider where you registered your domain name, there you should make a CNAME entry such as http://www.myapp.com and make it point to prod-myapp.elasticbeanstalk.com.

    At this point your production is setup and is accessible using http://www.myapp.com.

    Now you create a new environment, let us call it staging and it becomes available at staging-myapp.elasticbeanstalk.com. You access that url directly, test it and once you are satisfied do the CNAME swap.

    The url of the new staging would become prod-myapp.elasticbeanstalk.com and soon would start receiving all the traffic coming through http://www.myapp.com.

    Thus after the swap, prod becomes staging and staging becomes prod. One more swap will reverse it again.

    But the main DNS entry http://www.myapp.com will always point to prod-myapp.elasticbeanstalk.com and whichever environment has that URL will start handling the traffic.

    I tried my best. Hope it is clear now.

    If you are not using Beanstalk and are only using Apache and PHP then write to me. We could discuss a different solution for that scenario.

    • bakura

      Hi again. Thanks a lot for taking that time. I think it makes a bit more sense. But I indeed use Beanstalk with PHP. Makes everything so much easier.

  • bfreis

    Very interesting article. However, I’m not sure you can get absolutely 0 downtime with your 3rd method, as you propose.

    The problem occurs at the exact moment that you ask Elastic Beanstalk to deploy your new version to your production environment.

    As you explained very well, your application will become unavailable on the production instances during this update — the application server (say, Tomcat) will have to undeploy the older version, which will make it stop processing requests (and send 5xx errors to any requests), then it will deploy the new version, then there’s the application startup time, and only then it will start serving requests again.

    You then say: “But your users would never notice it as they would be served happily by the instance belonging to the staging environment. If you follow the above steps then you indeed are doing a zero downtime deployment in great style.”

    The problem here is that you assume that the Elastic Load Balancer will **instantly** know that the production instances are unavailable, which is not true.

    The ELB will only stop sending requests to an instance after it has determined that it is **unhealthy**. Instances are considered unhealthy according to the Health Check configuration of the ELB. The minimum amount of failed checks (either 5xx errors or a timeout) for ELB to mark an instance as unhealthy is 2, and the minimum amount of time between 2 consecutive checks is 0.1 minutes (ie, 6s). In the best case (ie, if the ELB performs one health check at the EXACT moment your application stops serving requests on the production instances), you will have at least 6 seconds in which the ELB will believe the instance is working (since it has only failed a single health check for now), and it will continue sending requests there, which will necessarily fail.

    This down time will occur — maybe only a fraction of the users might notice it, and maybe for only a few seconds, but it is definetely not 0.

    What you could do yo solve the problem is removing the production instances from the ELB before asking EB to deploy the new version, then wait until the environment is green again, and then add these instances back to the ELB, and then remove the staging instances from the ELB.

  • Dave North

    Outstanding article. We’ve been using Beanstalk for several years now and we use #2 (Route 53 Alias switch). It works very well BUT we’ve observed that the TTL is ignored by a great number of ISPs. In our case, we use a 60 second TTL but we observe every deployment that we get traffic on the ‘old’ environment for 3-4 days after switching over. It’s not (currently) the end of the world for us but it does bug me. I will have to experiment with the ‘loan instances’ approach and see how that works us.

    Thanks for the great post!