Ansible

Ping Your Ansible From Slack

We have been using Ansible for all our Deployments, Code Updates, Inventory Management. We have also built a higher level API for Ansible called bootstrapper. Bootstrapper provides a rest API for executing various complex tasks via Ansible. Also, there is a simple python CLI for bootstrapper called bootstrappercli which interacts with the bootstrapper servers in our prod/staging env and performs the tasks. We have been using slack for our team interactions and it became a part of my chrome tabs always. We also have a custom slack callback plugin, that provides us a real time callback for each step of playbook execution.

Since bootstrapper is providing a sweet API, we decided to make Slack to directly talk to Ansible,. After googling for some time, i came across a simple slackbot called limbo, written in Python(offcourse) and the best part is its plugin system. It’s super easy to write custom plugins. As an initial step, i wrote a couple of plugins for performing codepush, adhoc and EC2 Management. Below are some sample plugin codes,

Code-Update Plugin
"""Only direct messages are accepted, format is @user codeupdate <tag> <env>"""

import requests
import json
import unicodedata

ALLOWED_CHANNELS = []
ALLOWED_USERS = ['xxxxx']
ALLOWED_ENV = ['xxx', 'xxx']
ALLOWED_TAG = ['xxx', 'xxx']

BOOTSTRAPPER__PROD_URL = "http://xxx.xxx.xxx.xxx:yyyy"
BOOTSTRAPPER_STAGING_URL = "http://xxx.xxx.xxx.xxx:yyyy"

def get_url(infra_env):
    if infra_env == "prod":
        return BOOTSTRAPPER__PROD_URL
    else:
        return BOOTSTRAPPER_STAGING_URL

def codeupdate(ans_env, ans_tag):
    base_url = get_url(ans_env)
    req_url = base_url + '/ansible/update-code/'
    resp = requests.post(req_url, data={'tags': ans_tag, 'env': ans_env})
    return [resp.status_code, resp.text]

def on_message(msg, server):
    init_msg =  unicodedata.normalize('NFKD', msg['text']).encode('ascii','ignore').split(' ')[0]
    cmd = unicodedata.normalize('NFKD', msg['text']).encode('ascii','ignore').split(' ')[1]
    if cmd != 'codeupdate':
        return ""
    ### slight hackish method to check if the message is a direct message or not
    elif init_msg == '<@xxxxxx>:':    # Message is actually a direct message to the Bot
        orig_msg =  unicodedata.normalize('NFKD', msg['text']).encode('ascii','ignore').split(' ')[2:]
        msg_user = msg['user']
        msg_channel = msg['channel']
        if msg_user in ALLOWED_USERS and msg_channel in ALLOWED_CHANNELS:
            env = orig_msg[0]
            if env not in ALLOWED_ENV:
                return "Please Pass a Valid Env (xxxx/yyyy)"
            tag = orig_msg[1]
            if tag not in ALLOWED_TAG:
                return "Please Pass a Valid Tag (xxxx/yyyy)"
            return codeupdate(env, tag)
        else:
            return ("!!! You are not Authorized !!!")
    else:
        return "invalid Message Format, Format: <user> codeupdate <env> <tag>"
Adhoc Execution plugin
"""Only direct messages are accepted, format is <user> <env> <host> <mod> <args_if_any"""

import requests
import json
from random import shuffle
import unicodedata

    ALLOWED_CHANNELS = []
    ALLOWED_USERS = ['xxxxx']
    ALLOWED_ENV = ['xxx', 'xxx']
    ALLOWED_TAG = ['xxx', 'xxx']

    BOOTSTRAPPER__PROD_URL = "http://xxx.xxx.xxx.xxx:yyyy"
    BOOTSTRAPPER_STAGING_URL = "http://xxx.xxx.xxx.xxx:yyyy"

def get_url(infra_env):
    if infra_env == "prod":
        return BOOTSTRAPPER__PROD_URL
    else:
        return BOOTSTRAPPER_STAGING_URL

def adhoc(ans_host, ans_mod, ans_arg, ans_env):
    base_url = get_url(ans_env)
    req_url = base_url + '/ansible/adhoc/'
    if ans_arg is None:
        resp = requests.get(req_url, params={'host': ans_host, 'mod': ans_mod})
    else:
        resp = requests.get(req_url, params={'host': ans_host, 'mod': ans_mod, 'args': ans_arg})
    return resp.text

def on_message(msg, server):
### slight hackish method to check if the message is a direct message or not
    init_msg =  unicodedata.normalize('NFKD', msg['text']).encode('ascii','ignore').split(' ')[0]
    cmd = unicodedata.normalize('NFKD', msg['text']).encode('ascii','ignore').split(' ')[1]
    if cmd != 'adhoc':
        return ""
    elif init_msg == '<@xxxxxxx>:':    # Message is actually a direct message to the Bot
        orig_msg =  unicodedata.normalize('NFKD', msg['text']).encode('ascii','ignore').split(' ')[2:]
        msg_user = msg['user']
        msg_channel = msg['channel']
        if msg_user in ALLOWED_USERS and msg_channel in ALLOWED_CHANNELS:
            env = orig_msg[0]
            if env not in ALLOWED_ENV:
                return "Only Dev Environment is Supported Currently"
            host = orig_msg[1]
            mod = orig_msg[2]
            arg = orig_msg[3:]
            return adhoc(host, mod, arg, env)
        else:
            return ("!!! You are not Authorized !!!")
    else:
        return "Invalid Message, Format: <user> <env> <host> <mod> <args_if_any>"

Configure Limbo as per the Readme present in the Repo and start the limbo binary. Now let’s try executing some Ansible operations from Slack.

Let’s try a simple ping module

Let’s try updating a staging cluster,

WooHooo, it worked 😉 This is opening up a new dawn for us in managing our infrastructure from Slack. But offcourse ACL is an important factor, as we should restrict access to specific operations. Currently, this ACL logic is written with in the plugin itself. It checks the user who is executing the command and from which channel he is executing, if both matches only, the Bot will start the talking to Bootstrapper, or else it throws an error to the user. I’m pretty much excited to play with this, looking forward to automate Ansible tasks as much as possible from Slack 🙂

Advertisements
Standard
etcd, serf

Building Self Discovering Clusters With ETCD and SERF

Now a days its very difficult to contain all the services in a single box. Also, companies have started adopting micro service based architectures. For Docker user’s, they can achieve a highly scalable system like Amazon ASG using Apache mesos and their supporting frameworks like Marathon etc… But there are a lot services, where in a new node needs to know the existing members in a cluster to join successfully. Common method will be having a static ip’s to the machines. But we have started moving all our clusters slowly to become immutable via ASG. So either we need to go for Elastic IP’s, and use these static ip’s, but the next challenge would, what if we want to scale the cluster up. Also, if these instances are inside a private network, assigning static local ip’s is also difficult. So if a new machine is launched or if asg relaunches a faulty machine, these machine’s should know the existing cluster nodes. But again not all applications provide service-discovery features. I was testing around with a new Erlang based MQTT broker called verne. For a new Verne node to join the cluster, we need to pass the ip’s of all existing nodes to the JOIN command.

So a quick solution came to mind was using a Distributed K-V pair system like etcd. Idea was to use etcd to have a key-value that can tell us the nodes present in a cluster. But quickly i found another hurdle. Imagine a node belonging to a cluster, went down, the key will be still present in etcd. If we have something like ASG, will launch a new machine, so this new machine’s IP will not be present in the etcd key. We can delete manaully and launch a new machine, then add its IP to the etcd. But our ultimate goal is to automate all these, otherwise our clusters cannot become a full immutable and auto healing.

So we need a system, that can keep track of our Node’s activity atleast, when they join/leave the cluster.

SERF

Enter SERF. Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Serf relies on an efficient and lightweight gossip protocol to communicate with nodes. The Serf agents periodically exchange messages with each other in much the same way that a zombie apocalypse would occur: it starts with one zombie but soon infects everyone. Serf is able to quickly detect failed members and notify the rest of the cluster. This failure detection is built into the heart of the gossip protocol used by Serf.

Serf, also need to know IP’s of other Serf Agents in the cluster. So initially when we start the cluster, we know the IP’s of the other Serf agents, and the serf agents will then populate our Cluster Key’s, on ETCD. Now if a new machine is launched/relaunched, etcd will use the Discovery URL and will connect to the ETCD Cluster. Once it has joined the existing cluter, our Serf agent will start and it will talk to the etcd agent running locally and will create the necessary key for the node. Once the Key is being created, Serf Agent will join the serf cluster using one of the existing nodes IP that was present in the ETCD cluster.

ETCD Discovery URL

A bit about etcd Discovery URL. It’s an inbuild discovery method in etcd for identifying the dynamic members in an etcd cluster. First, we need an ETCD server, this server won’t be a part of our cluster, instead this is used only for providing the discovery URL. We can use this server for multiple ETCD clusters, provided each cluster should have a unique URL. So all the new nodes can talk to their unique url and can join with the rest of the etcd nodes.

# For starting the ETCD Discovery Server

$ ./etcd -name test --listen-peer-urls "http://172.16.16.99:2380,http://172.16.16.99:7001" --listen-client-urls "http://172.16.16.99:2379,http://172.16.16.99:4001" --advertise-client-urls "http://172.16.16.99:2379,http://172.16.16.99:4001" --initial-advertise-peer-urls "http://172.16.16.99:2380,http://172.16.16.99:7001" --initial-cluster 'test=http://172.16.16.99:2380,test=http://172.16.16.99:7001'

Now we need to create a unique Discovery URL for our clusters.

$ curl -X PUT 172.16.16.99:4001/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/_config/size -d value=3

 where, 6c007a14875d53d9bf0ef5a6fc0257c817f0fb83 => a unique UUID
        value => Size of the cluster, if the number of etcd nodes that tries to join with the specific url is > value, then the extra etcd processes will fall back to being proxies by default.

SERF Events

Serf comes with inbuilt event handler system. It has some inbuilt event handlers for Member Events like member-join, member-leave. We are going to concentrate on these two events. Below is a simple handler script, that will add/remove keys in ETCD, whenever a node joins/leave the cluster.

# handler.py
#!/usr/bin/python
import sys
import os
import requests
import json

base_url = "http://<ip-of-the-node>:4001"   # Use any config mgmt tool or user-data script to populate this value for each host
base_key = "/v2/keys"
cluster_key = "/cluster/<cluster-name>/"    # Use any config mgmt tool or user-data script to populate this value for each host

for line in sys.stdin:
    event_type = os.environ['SERF_EVENT']
    if event_type == 'member-join':
        address = line.split('\t')[1]
        full_url = base_url + base_key + cluster_key + address
        r = requests.get(full_url)
        if r.status_code == 404:
            r = requests.put(full_url, data="member")     # If the key doesn't exists, it will create a new key
            if r.status_code == 201:
                print "Key Successfully created in ETCD"
            else:
                print "Failed to create the Key in ETCD"
        else:
            print "Key Already exists in ETCD"
    if event_type == 'member-leave':
        address = line.split('\t')[1]
        full_url = base_url + base_key + cluster_key + address
        r = requests.get(full_url)
        if r.status_code == 200:                         # If the node leaves the cluster and the key still exists, remove it
            r = requests.delete(full_url)
        if r.status_code == 404:            
                print "Key already removed by another Serf Agent"
            else:
                print "Key Successfully removed from ETCD"
        else:
            print "Key already removed from ETCD"

Design

Now we have some vague idea of serf and etcd. Let’s start building our self-discovering cluster. Make sure that the etcd Discovery node is up and a unique URL has been generated for the same. Once the URL is ready, lets start our ETCD cluster.

# Node1

$ /etcd -name test1 --listen-peer-urls "http://172.16.16.102:2380,http://172.16.16.102:7001" --listen-client-urls "http://172.16.16.102:2379,http://172.16.16.102:4001" --advertise-client-urls "http://172.16.16.102:2379,http://172.16.16.102:4001" --initial-advertise-peer-urls "http://172.16.16.102:2380,http://172.16.16.102:7001" --discovery http://172.16.16.99:4001/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/

# Sample Output

2015/06/01 19:51:10 etcd: listening for peers on http://172.16.16.102:2380
2015/06/01 19:51:10 etcd: listening for peers on http://172.16.16.102:7001
2015/06/01 19:51:10 etcd: listening for client requests on http://172.16.16.102:2379
2015/06/01 19:51:10 etcd: listening for client requests on http://172.16.16.102:4001
2015/06/01 19:51:10 etcdserver: datadir is valid for the 2.0.1 format
2015/06/01 19:51:10 discovery: found self a9e216fb7b8f3283 in the cluster
2015/06/01 19:51:10 discovery: found 1 peer(s), waiting for 2 more      # Since our cluster Size is 3, and this is the first node, its waiting for other 2 nodes to join

Lets start the other two nodes,

./etcd -name test2 --listen-peer-urls "http://172.16.16.101:2380,http://172.16.16.101:7001" --listen-client-urls "http://172.16.16.101:2379,http://172.16.16.101:4001" --advertise-client-urls "http://172.16.16.101:2379,http://172.16.16.101:4001" --initial-advertise-peer-urls "http://172.16.16.101:2380,http://172.16.16.101:7001" --discovery http://172.16.16.99:4001/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/

./etcd -name test3 --listen-peer-urls "http://172.16.16.100:2380,http://172.16.16.100:7001" --listen-client-urls "http://172.16.16.100:2379,http://172.16.16.100:4001" --advertise-client-urls "http://172.16.16.100:2379,http://172.16.16.100:4001" --initial-advertise-peer-urls "http://172.16.16.100:2380,http://172.16.16.100:7001" --discovery http://172.16.16.99:4001/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/

Now we have the ETCD cluster up and running, lets configure the serf.

$ ./serf agent -node=test1 -bind=172.16.16.102:5000 -rpc-addr=172.16.16.102:7373 -log-level=debug -event-handler=/root/handler.py

$ ./serf agent -node=test2 -bind=172.16.16.101:5000 -rpc-addr=172.16.16.101:7373 -log-level=debug -event-handler=/root/handler.py

$ ./serf agent -node=test3 -bind=172.16.16.100:5000 -rpc-addr=172.16.16.100:7373 -log-level=debug -event-handler=/root/handler.py

Now these 3 serf agents are running in standalone mode and they are not aware of each other. Lets make the agents to join and form the serf cluster.

$ ./serf join -rpc-addr=172.16.16.100:7373 172.16.16.101:5000

$ ./serf join -rpc-addr=172.16.16.102:7373 172.16.16.100:5000

Let’s query the serf agents and see if the cluster has been created.

$ root@sentinel:~# ./serf members -rpc-addr=172.16.16.100:7373
  test3  172.16.16.100:5000  alive
  test2  172.16.16.101:5000  alive
  test1  172.16.16.102:5000  alive

    $ root@sentinel:~# ./serf members -rpc-addr=172.16.16.101:7373
      test3  172.16.16.100:5000  alive
      test2  172.16.16.101:5000  alive
      test1  172.16.16.102:5000  alive

    $ root@sentinel:~# ./serf members -rpc-addr=172.16.16.102:7373
      test3  172.16.16.100:5000  alive
      test2  172.16.16.101:5000  alive
      test1  172.16.16.102:5000  alive

Now our serf nodes are up, let’s check our etcd cluster to see if serf has created the keys.

$ ./etcdctl -C 172.16.16.100:4001 ls /cluster/test/
  /cluster/test/172.16.16.100
  /cluster/test/172.16.16.101
  /cluster/test/172.16.16.102

Serf has successfully created the key. Now, let’s terminate one of the node’s. one of our Serf agent will immidiately receives an EVENT for member-leave and it will start the handler. Below is the log for the First agent who received the event.

2015/06/01 22:56:43 [INFO] serf: EventMemberLeave: test1 172.16.16.102
2015/06/01 22:56:44 [INFO] agent: Received event: member-leave
2015/06/01 22:56:45 [DEBUG] agent: Event 'member-leave' script output: member-leave => 172.16.16.102
Key Successfully removed from ETCD                                
2015/06/01 22:56:45 [DEBUG] memberlist: Initiating push/pull sync with: 172.16.16.100:5000

This event will later gets passed to the other agent,

2015/06/01 22:56:43 [INFO] serf: EventMemberLeave: test1 172.16.16.102
2015/06/01 22:56:44 [INFO] agent: Received event: member-leave
2015/06/01 22:56:45 [DEBUG] agent: Event 'member-leave' script output: member-leave => 172.16.16.102
Key already removed by another Serf Agent

Let’s make sure that the KEY was successfully removed from ETCD.

$ ./etcdctl -C 172.16.16.100:4001 ls /cluster/test/
  /cluster/test/172.16.16.100
  /cluster/test/172.16.16.101

Now let’s bring the old node back online,

$ ./serf agent -node=test1 -bind=172.16.16.102:5000 -rpc-addr=172.16.16.102:7373 -log-level=debug -event-handler=/root/handler.py
  ==> Starting Serf agent...
  ==> Starting Serf agent RPC...
  ==> Serf agent running!
           Node name: 'test1'
           Bind addr: '172.16.16.102:5000'
            RPC addr: '172.16.16.102:7373'
           Encrypted: false
            Snapshot: false
             Profile: lan

  ==> Log data will now stream in as it occurs:

      2015/06/01 23:01:55 [INFO] agent: Serf agent starting
      2015/06/01 23:01:55 [INFO] serf: EventMemberJoin: test1 172.16.16.102
      2015/06/01 23:01:56 [INFO] agent: Received event: member-join
      2015/06/01 23:01:56 [DEBUG] agent: Event 'member-join' script output: member-join => 172.16.16.102
  Key Successfully created in ETCD

The standalone agent has already created the key. Now let’s make the Serf agent to join the cluster. It can use any one of the IP from the K-V apart from it’s own IP

$ ./serf join -rpc-addr=172.16.16.102:7373 172.16.16.100:5000
  Successfully joined cluster by contacting 1 nodes.

Once the node joins the cluster, the other existing serf nodes will receive the member-join event, and will also tries to execute the handler. But since the key is already being created, it wont recereate it.

    2015/06/01 23:04:56 [INFO] agent: Received event: member-join
    2015/06/01 23:04:57 [DEBUG] agent: Event 'member-join' script output: member-join => 172.16.16.102
Key Already exists in ETCD

Let’s check our ETCD cluster to see if the keys are recreated.

$ ./etcdctl -C 172.16.16.100:4001 ls /cluster/test/
  /cluster/test/172.16.16.100
  /cluster/test/172.16.16.101
  /cluster/test/172.16.16.102

So now we have IP’s of our active nodes in the cluster in etcd. Now we need to extract the IP’S from etcd and need to use them with our Applications. Below is a simple python script that returns the IP’s in as a python LIST.

import requests
import json

base_url = "http://<local_ip_of_server>:4001"
base_key = "/v2/keys"
cluster_key = "/cluster/<cluster_name>/"
full_url = base_url + base_key + cluster_key
r = requests.get(full_url)
json_data = json.loads(r.text)
node_list = json_data['node']['nodes']
nodes = []
for i in node_list:
   nodes.append(i['key'].split(cluster_key)[1])
print nodes

We can use ETCD and confd to generate the config’s dynamically for our applications.

Conclusion

ETCD and SERF provide a powerfull combination for achieving a self-discovery feature onto our clusters. But it still needs heavy QA testing before it’s being used in production. Especially, we need to test on the confd automation part more, as the we are not using a single KEY-VALUE, as the SERF creates a separate K-V for each node. And the IP discovery from ETCD is done by iterating the key-directory. But these tools are backed by an awesome community, so i’m sure soon these tools will helps us to achieve a whole new level for clustering our applications.

Standard