Ansible

Building Auto Healing Clusters With AWS and Ansible

As we all know this is the era of cloud servers. With the emergence of cloud, no need to worry about the difficulties in hosting servers on premises. But if you are cloud engineer, you definitely know that any thing can happen to your machine. Unlike going and fixing on our own, in cloud its difficult. Even i faced a lot of such weird issues, where my cloud service provider terminated my server’s which includes my Postgres DB master also. So having a self healing cluster will help us a lot, especially if the server goes down in the middle of our sleep. Stateless services are the easiest candidates for self healing compared to DB’s, especially if we are using a Master-Slave DB architecture.

For those who are using Docker and Mesos, Marathon provides similar scaling features like Amazon ASG. We define the number of instances that has to be running, and Marathon makes sure that number always exists. Like Amazon ASG, it will relaunch a new container, if any container accidentally terminates. I’ve personally tested this feature of Marathon long back, and it’s really a promising one. There are indeed other automated container management systems, but the marathon is quite flexible to me.

But in Clementine, in our current architecture, we are not yet using Docker in Production and we heavily use AWS for all our clusters. With more features like Secure messaging, VOIP etc.. added to our product, we are expanding tremendously. And so does our infrastructure. Being a DevOps engineer, i need to keep the uptime. So this time i decided to prototype a self healing cluster using Amazon ASG and Ansible.

Design

For the auto healing i’m going to use Amazon ASG and Ansible. Since Ansible is a client less application, we need to either use Ansible in stand-alone mode and provision the machine via cloud init script, or use the ansible-pull. Or as the company recommends, use Ansible Tower, which is a paid solution. But we have built our own higher level API solution over Ansible called bootstrapper. Bootstrapper exposes a higher level rest API which we can invoke for all our Ansible management. Our in house version of Bootstrapper can perform various actions like, ec2 instance launch with/without EIP, Ahdoc command execution, server bootstrapping, code update etc ….

But again, if we use a plain AMI and tries to bootstrap the server completely during startup, it puts a heavy delay, especially when pypi gives u time out while installing the pip packages. So we decided to use a custom AMI which has our latest build in it. Jenkins takes care of this part. Our build flow is like this,

Dev pushes code to Master/Dev => Jenkins performs build test => if build succeeds, starts our master/dev packages => uploads the package to our APT repo => Packer builds the latest image via packer-aws-chroot

While building the image, we add two custom scripts on to our images, 1) setup_eip.sh (manages EIP for the instance via Bootstrapper), 2) ans_bootstrap.sh (Manages server bootstrapping via Ansible)

# set_eip.sh
inst_id=`curl -s http://169.254.169.254/latest/meta-data/instance-id`
role=`cat /etc/ansible/facts.d/clem.fact  | grep role | cut -d '=' -f2`  # our custom ansible facts
env=`cat /etc/ansible/facts.d/clem.fact  | grep env | cut -d '=' -f2`    # our custom ansible facts
if [ $env == "staging" ]; then
  bootstrapper_url="xxx.xxx.xxx.xxx:yyyy/ansible/set_eip/"
else
  bootstrapper_url="xxx.xxx.xxx.xxx:yyyy/ansible/set_eip/"     # our bootstrapper api for EIP management
fi
curl -X POST -s -d "instance_id=$inst_id&role=$role&env=$env" $bootstrapper_url

The above POST request to Ansible performs EIP management and Ansible will assign the proper EIP to the machine without any collision. We keep an EIP mapping for our cluster, which makes sure that we are not assigning any wrong EIP to the machines. If no EIP is available, we raise an exception and email/slack the infra team about the instance and cluster

# ans_bootstrap.sh
local_ip=`curl -s http://169.254.169.254/latest/meta-data/local-ipv4`
    role=`cat /etc/ansible/facts.d/clem.fact  | grep role | cut -d '=' -f2`  # our custom ansible facts
    env=`cat /etc/ansible/facts.d/clem.fact  | grep env | cut -d '=' -f2`    # our custom ansible facts
    if [ $env == "staging" ]; then
      bootstrapper_url="xxx.xxx.xxx.xxx:yyyy/ansible/role/"
    else
      bootstrapper_url="xxx.xxx.xxx.xxx:yyyy/ansible/role/"     # our bootstrapper api for role based playbook execution, multiple roles can be passed to the API
    fi
curl -X POST -d "host=$local_ip&role=$role&env=$env" $bootstrapper_url

These two scripts are executed via cloud-init script during machine bootup. Once we have the image ready, we need to create a launch config for the ASG. Below is a sample userdata script,

#! /bin/bash

echo "Starting EIP management via Bootstrapper"
/usr/local/src/boostrap_scripts/set_eip.sh
echo "starting server bootstrap"
/usr/local/src/bootstrap_scripts/ans_bootstrap.sh

Now create an Autoscaling group with the required number of nodes. On the scaling policies, select Keep this group at its initial size. Once the ASG is up, it will start the nodes based on the AMI and Subnet mentioned. Once the machine starts booting, cloud-init script will start executing our userdata scripts, which in turn talks to our Bootstrapper-Ansible and starts assigning EIP and executing the playbooks onto the hosts. Below is a sample log on our bootstrapper for EIP management, invoked by an ASG node while it was booting up.

01:33:11 default: bootstrap.ansble_set_eip(u'staging', u'<ansible_role>', u'i-xxxxx', '<remote_user>', '<remote_key>') (4df8aee9-ab0e-4152-973a-b227ddac91a1)
EIP xxx.xxx.xxx.xxx is attached to instance i-xyxyxyxy
EIP xxx.xxx.xxx.xxx is attached to instance i-xyxyxyxy
EIP xxx.xxx.xxx.xxx is attached to instance i-xyxyxyxy
EIP xxx.xxx.xxx.xxx is attached to instance i-xyxyxyxy
EIP xxx.xxx.xxx.xxx is attached to instance i-xyxyxyxy
Free EIP available for <our-cluster-name> is xxx.xxx.xxx.xxx   # Bootstrapper found the free ip that is allowed to be assigned for this particular node

PLAY [localhost] **************************************************************

TASK: [adding EIP to the instance] ********************************************
changed: [127.0.0.1]
01:33:12 Job OK, result = {'127.0.0.1': {'unreachable': 0, 'skipped': 0, 'ok': 2, 'changed': 1, 'failures': 0}}

I’ve tested this prototype with one of our VOIP clusters and the cluster is working is perfectly with the corresponding EIP’s as mapped. We terminated the machines, multiple times, to make sure that the EIP management is working properly and the servers are getting bootstrapped. The results are promising and this now motivates us to migrate all of our stateless clusters onto self healing so that our cluster auto heals whenever a machine becomes unhealthy. No need of any Human intervention unless Amazon really screws their ASG :p

Advertisements
Standard