How We Built Container Maintenance

This week we introduced a new feature to our Cloud Container platform called Container Maintenance. This feature is designed to help our customers keep their websites and applications up-to-date and secure by fully automating the process.

This helps achieve one of the goals of our Cloud Container platform – which is to maintain consistency across the whole fleet of servers. This consistency improves maintainability and automation, which in turn makes it easier to improve the platform faster with less risk.

In this post I will explain some of the design decisions we made and the technical challenges we had to solve to implement them.

The Design

When we sat down to design this feature we quickly agreed on a few key priorities which would drive all decisions around the implementation. These were:

Minimal downtime for the customers' websites
A rock solid rollback method if something goes wrong
Safely preserve any configuration changes customers have made
Have a basic health check that customers can expand
Reduce risk by targeting SiteHost built ‘web containers’ only

The Technical Details

The Cloud Container platform is built on top of Docker (hence the name). Each of the customers’ websites exist inside a separate Docker container that sits behind a reverse HTTP proxy. This proxy, managed by us, enables virtual hosts (aliases) on your container and terminates the TLS connection. Once a week during the Cloud Container server’s maintenance window, we schedule a job to process all the containers on your server, looking for newer image versions that are available. If a new version is available, we begin the update process.

Merging Configuration Files

The first challenge we decided to take on was merging configuration files. In Cloud Containers customers have the ability to customise the configs for their containers via SSH but this can lead the problems when a customer has edited the sames lines in a config that a newer version of the image also changes. This is a common problem in distributed version control systems.

We decided to solve this by using a three-way merge (Wikipedia), always preferring customer changes in a conflict scenario. Because Git is tried and trusted, we opted to use its git merge-tool --ours command. This gets us most of the way there but we still needed to handle addition and deletion of files because Git operates at the file contents level, not the directory level.

Customer Config ↔ Original	Latest Config ↔ Original	Resolution
Identical	Identical	No operation.
Identical	Different	Use the latest configs.
Different	Identical	Keep customers configs.
Different	Different	Attempt three-way merge.

While this logic may not be perfect (computers struggle with context), this method saw the highest success rate of customer configuration changes persisting between updates and in general successful updates of containers.

Failing Early by Health Checking your Application

Once the configuration files are merged and the new Docker image is pulled onto the server, we test that the new image and merged config files are compatible with the customer’s application. At this point, a Docker Compose override file is generated to replace the image and the mount point of the configuration files. An identical container will be spun up using the override and the original Compose files. This is a special health check container that can be accessed by its internal IP address only - no public traffic is routed to this container yet.

The health check will make 10 requests with a one second delay between each attempt to check if the application is healthy. By default, we will only perform a basic GET request and expect a 200 OK response to the homepage. If the container has multiple virtual hosts (e.g. sitehost.nz and www.sitehost.co.nz), but all other hosts redirect to the preferred one, the check will follow all 301 and 302 redirects requested. Redirect locations outside of the container will instantly fail the check because we are health checking an external application, not the customer’s.

If the customer’s website forces a TLS connection (good on you), the check will honour those requests by resending the request with headers X-Forwarded-SSL: on and X-Forwarded-Proto: https via a HTTP connection. This emulates the behaviour of the reverse HTTP proxy in front of your site, which handles TLS termination.

If an application specific or more thorough health check is needed, this can be done by implementing the custom health check endpoint. When ./sitehost/health is present (does not return a 404), that URI will be used to test your application. If healthy, the page must respond with a 200 status code with the JSON object {“status”: “healthy”}. Malformed JSON or any other status is assumed to be unhealthy. You may check that all the required PHP modules and OS level dependencies are present or unit test your application in this endpoint.

Updating with Minimal Downtime and Reliable Rollbacks

Once your application is deemed to be alive and healthy, we will mark the health check container as healthy and begin to forward live external traffic to that container. At this point the server has two running containers, one with the new image and one with the old image. Traffic will now be round-robined between both containers by the reverse HTTP proxy. Once the configs are replaced and the original container is restarted with the new image, we bring down the health check container and clean up the temporary files used during the process.

The original container is never touched until the health check has been completed. A temporary scratch directory is created where the merged config files and Compose override file are put into. The override file allows us to only modify the image and config directory mount point, while keeping everything else (environment variables, labels, etc.) unmodified. If errors occur, the rollback procedure is to simply bring down the health check container and remove the scratch directory.

Future Improvements

In our tests using siege to flood the web container with traffic, there was a less than one second of downtime and a few dropped requests during the maintenance for a typical website. We suspect this is due to us not being able to drain connections from the proxy before stopping containers.

In some situations there is also the opportunity for a new container to start receiving real world traffic before it is completely ready, for example with applications that are slower to start up.

While neither of these issues are deal breakers they are things we would like to tidy up in the future if and when time allows, which is always a hard thing to come by it seems!

If you made it all the way through this post congratulations, there was quite a bit to take in but we hope you enjoyed the short overview into some of the technical challenges we faced on this feature. Thanks for reading.