Docker, FreeSwitch, sipp, sippy_cup

Sippy_cup – FreeSwitch Load Test Simplified

Ever since the entry of Docker, everyone is busy porting their applications to Docker Containers. Now with the tools like Mesos, CoreOS etc we can easily achieve scalability also. @Plivo we always dedicate ourselves to play around such new technologies. In my previous blog posts, i’ve explained how to containerize the Freeswitch, how to perform some basic load test using simple dialplans etc. My previous load tests required a bunch of basic Freeswitch servers to originate calls to flood the calls to the FreeSwitch container. So this time i’m going to use a simple method, which everyone can use even from their laptops.

Enter SIPp. SIPp is a free Open Source test tool / traffic generator for the SIP protocol. But the main issue for beginer like me is in generating a proper XML for SIPp that can match to my exact production scenarios. After googling, i came across a super simple ruby wrapper over SIPp called sippy_cup. SIPpy_cup is a simple ruby wrapper over SIPp. We just need to create a simple yaml file and sippy_cup parses this yml file and generates the XML equivalent which will be then used to generate calls. sippy_cup can also be used to generate only the XML file for SIPp.

Setting up sippy_cup is very simple. There are only two dependencies

      1) ruby (2.1.2 recomended)
      2) SIPp

Another important dependency is our local internet bandwidth. Flooding too many calls will definitely result in network bottlenecks, which i faced when i generated 1k calls from my laptop. Now let’s install SIPp.

sudo apt-get install pcaputils libpcap-dev libncurses5-dev

wget 'http://sourceforge.net/projects/sipp/files/sipp/3.2/sipp.svn.tar.gz/download'

tar zxvf sipp.svn.tar.gz

# compile sipp
make

# compile sipp with pcapplay support
make pcapplay

Once we have installed SIPp and ruby, we can install sippy_cup via ruby gems.

gem install sippy_cup

Configuring sippy_cup

First we need to create yml file for our call flow. There is a good documentation available on the Readme on various options that can be used to create the yml to suit to our call flow. My call flow is pretty simple, i’ve a DialPlan in my Docker FS, which will play an mp3 file. So below is a simple yml config for this call flow

source: <local_machine_ip>
destination: <docker_fs_ip>:<fs_port>
max_concurrent: <no_of_concurrent_calls>
calls_per_second: <calls_per_second>
number_of_calls: <total_no_of_calls>
to_user: <to_number>            # => should match the FS Dialplan
steps:                      # call flow steps
  - invite                  # Initial Call INVITE
  - wait_for_answer         # Waiting for Answer, handles 100, 180/183 and finally 200 OK
  - ack_answer              # ACK for the 200 OK
  - sleep 1000              # Sleeps for 1000 seconds
  - send_bye                # Sends BYE signal to FS

Now let’s run sippy_cup using our config yml

sippy_cup -r test.yml

Below is the output of a sample load test. Total 20 calls with 10 concurrent calls

         INVITE ---------->      20        1         0
         100 <----------         20        0         0         0
         180 <----------         0         0         0         0
         183 <----------         0         0         0         0
         200 <----------  E-RTD1 20        0         0         0
         ACK ---------->         20        0
              [ NOP ]
         Pause [    30.0s]       20                            0
         BYE ---------->         20        0
------------------------------ Test Terminated --------------------------------


----------------------------- Statistics Screen ------- [1-9]: Change Screen --
  Start Time             | 2014-10-22   19:12:40.494470 1414030360.494470
  Last Reset Time        | 2014-10-22   19:13:45.355358 1414030425.355358
  Current Time           | 2014-10-22   19:13:45.355609 1414030425.355609
-------------------------+---------------------------+--------------------------
  Counter Name           | Periodic value            | Cumulative value
-------------------------+---------------------------+--------------------------
  Elapsed Time           | 00:00:00:000000           | 00:01:04:861000
  Call Rate              |    0.000 cps              |    0.308 cps
-------------------------+---------------------------+--------------------------
  Incoming call created  |        0                  |        0
  OutGoing call created  |        0                  |       20
  Total Call created     |                           |       20
  Current Call           |        0                  |
-------------------------+---------------------------+--------------------------
  Successful call        |        0                  |       20
  Failed call            |        0                  |        0
-------------------------+---------------------------+--------------------------
  Response Time 1        | 00:00:00:000000           | 00:00:01:252000
  Call Length            | 00:00:00:000000           | 00:00:31:255000
------------------------------ Test Terminated --------------------------------


I, [2014-10-22T19:13:45.357508 #17234]  INFO -- : Test completed successfully!

I tried to perform a large scale load test by making 1k calls with 250 concurrent calls. My local internet was flooding with network traffic as there was real Media packets coming from the servers, though it bottlenecked my internet, but still i was able to make 994 successfull calls. I suggest to do such heavy load test on machines wich has good network throughput. Below are the output for this test.

------------------------------ Scenario Screen -------- [1-9]: Change Screen --
  Call-rate(length)   Port   Total-time  Total-calls  Remote-host
   5.0(0 ms)/1.000s   8836     585.61 s         1000  54.235.170.44:5060(UDP)

  Call limit reached (-m 1000), 0.507 s period  1 ms scheduler resolution
  6 calls (limit 250)                    Peak was 176 calls, after 150 s
  0 Running, 8 Paused, 1 Woken up
  604 dead call msg (discarded)          0 out-of-call msg (discarded)
  3 open sockets
  1490603 Total RTP pckts sent           0.000 last period RTP rate (kB/s)

                                 Messages  Retrans   Timeout   Unexpected-Msg
         INVITE ---------->      1000      332       0
         100 <----------         954       53        0         0
         180 <----------         0         0         0         0
         183 <----------         0         0         0         0
         200 <------2014-10-22  19:19:23.202714 1414030763.202714: Dead call 990-17510@192.168.1.146 (successful), 

received 'SIP/2.0 200 OK
Via: SIP/2.0/UDP 192.168.1.146:8836;received=208.66.27.62;branch=z9hG4bK-17510-990-8
From: "sipp" <sip:sipp@192.168.1.146>;tag=990
To: <sip:14158872327@54.235.170.44:5060>;tag=9p6t351mvXZXg
Call-ID: 990-17510@192.168.1.146
CSeq: 2 BYE
User-Agent: Plivo
Allow: INVITE, ACK, BYE, CANCEL, OPTIONS, MESSAGE, INFO, UPDATE, REFER, NOTIFY
Supported: timer, precondition, path, replaces
Conte----  E-RTD1        994       126       0         0
         ACK ---------->         994       126
              [ NOP ]
       Pause [    30.0s]         994                           0
         BYE ---------->         994       0
------------------------------ Test Terminated --------------------------------


----------------------------- Statistics Screen ------- [1-9]: Change Screen --
  Start Time             | 2014-10-22   19:15:29.941276 1414030529.941276
  Last Reset Time        | 2014-10-22   19:25:15.056475 1414031115.056475
  Current Time           | 2014-10-22   19:25:15.564038 1414031115.564038
-------------------------+---------------------------+--------------------------
  Counter Name           | Periodic value            | Cumulative value
-------------------------+---------------------------+--------------------------
  Elapsed Time           | 00:00:00:507000           | 00:09:45:622000
  Call Rate              |    0.000 cps              |    1.708 cps
-------------------------+---------------------------+--------------------------
  Incoming call created  |        0                  |        0
  OutGoing call created  |        0                  |     1000
  Total Call created     |                           |     1000
  Current Call           |        6                  |
-------------------------+---------------------------+--------------------------
  Successful call        |        0                  |      994
  Failed call            |        0                  |        0
-------------------------+---------------------------+--------------------------
  Response Time 1        | 00:00:00:000000           | 00:00:01:670000
  Call Length            | 00:00:00:000000           | 00:00:31:673000
------------------------------ Test Terminated --------------------------------

Sippy_cup is definitely a good tool for all beginers who finds really hard time to work around with SIPp XML’s. I’m really excited to see how Docker is going to contirbute to VOIP world.

Standard
CollectD, Elasticsearch, Kibana, logstash, Monitoring, Redis

Monitoring Redis Using CollectD and ELK

Redis is an open-source, networked, in-memory, key-value data store. It’s being heavily used every where from Web stack to Monitoring to Message queues. Monitoring tools like Sensu already has some good scripts to Monitor Redis. Last Month during PyCon 2014 @Plivo, opensourced a new rate limited queue called SHARQ which is based on Redis. So apart from just Monitoring checks, we decided to have a tsdb of what’s happening in our Redis Cluster. Since we are heavily using ELK stack to visualize our infrastructure, we decided to go ahead with the same.

CollectD Redis Plugin

There is a cool CollectD plugin for Redis. It pulls a verity of Data from Redis which includes, Memory used, Commands Processed, No. of Connected Clients and slaves, No. of blocked Clients, No. of Keys stored/db, uptime and challenges since last save. The installation is pretty simple and straight forward.

$ apt-get update && apt-get install collectd

$ git clone https://github.com/powdahound/redis-collectd-plugin.git /tmp/redis-collectd-plugin

Now place the redis_info.py file onto the collectd folder and enable the Python Plugins so that collectd can use this python file. Below is our collectd conf

Hostname    "<redis-server-fqdn>"
Interval 10
Timeout 4
Include "/etc/collectd/filters.conf"
Include "/etc/collectd/thresholds.conf"
LoadPlugin network
ReportStats true

        LogLevel info

Include "/etc/collectd/redis.conf"      # This is the configuration for the Redis plugin
<Plugin network>
    Server "<logstash-fqdn>" "<logstash-collectd-port>"
</Plugin>

Now copy the redis python plugin and the conf file to collectd folder.

$ mkdir /etc/collectd/plugin            # This is where we are going to place our custom plugins

$ cp /tmp/redis-collectd-plugin/redis_info.py /etc/collectd/plugin/

$ cp /tmp/redis-collectd-plugin/redis.conf /etc/collectd/

By default, the plugin folder in the redis.conf is defined as ‘/opt/collectd/lib/collectd/plugins/python’. Make sure to replace this with the location where we are copying the plugin file, in our case “/etc/collectd/plugin”. Now lets restart the collectd daemon to enable the redis plugin.

$ /etc/init.d/collectd stop

$ /etc/init.d/collectd start

In my previous Blog, i’ve mentioned how to enable and use the ColectD input plugin in Logstash and to use Kibana to plot the data coming from the collectd. Below are the Data’s that we are receiving from the CollectD on Logstash,

  1) type_instance: blocked_clients
  2) type_instance: evicted_keys
  3) type_instance: connected_slaves
  4) type_instance: commands_processed
  5) type_instance: connected_clients
  6) type_instance: used_memory 
  7) type_instance: <dbname>-keys
  8) type_instance: changes_since_last_save
  9) type_instance: uptime_in_seconds
10) type_instance: connections_received

Now we need to Visualize these via Kibana. Lets create some ElasticSearch queries so that visualize them directly. Below are some sample queries created in Kibana UI.

1) type_instance: "commands_processed" AND host: "<redis-host-fqdn>"
2) type_instance: "used_memory" AND host: "<redis-host-fqdn>"
3) type_instance: "connections_received" AND host: "<redis-host-fqdn>"
4) type_instance: "<dbname>-keys" AND host: "<redis-host-fqdn>"

Now We have some sample queries, lets visualize them.

Now create histograms in the same procedure by changing the Selected Queries.

Standard
Docker, Marathon, Mesos

Running Mesos With Native Docker Support

In my previous blog, i’ve explained how to manage Docker cluster using Mesos and Marathon. Now Mesos has released 0.20 with Native Docker support. Till Mesos 0.19, we had to use mesosphere’s Deimos for running Docker containers on Mesos slaves. Now from Mesos 0.20, Mesos has the ability to run Docker containers directly. As per Mesos Documentation we can run containers in two ways. 1) as a Task and 2) as an Executor. Currently Mesos 0.20 supports only the host (–net=host) Docker networking mode.

Install the latest Mesos 0.20 using the mesosphere’s Mesos Debian package. Once we have setup the Zookeeper and Docker, we need to make a few changes to enable the Mesos Native Docker support. We need to start all the Mesos-Slave with --containerizers=docker,mesos flag. For enabling this flag, we need to create a file containerizers in /etc/mesos-slave/containerizers with content “docker,mesos”. Mesos slave process, when starting up, will read the folder and enable this flag.

$ echo 'docker,mesos' > /etc/mesos-slave/containerizers

$ service mesos-slave restart

I was trying to start a container with Port 9999 via Marathon REST API. But the task was keep on failing. Up on investigating the logs, i found that the slave when sends the allocatable resource details to mesos master was total allocatable: cpus():0.8; mem():801; disk():35164; ports():[31000-31099, 31101-32000]. So the allowed port range was [31000-31099, 31101-32000]. So if we wnat to use any other custom port, we need to define the same in the /etc/mesos-slave folder by creating a config file.

    $ echo "ports(*):[31000-31099, 31101-32000, 9998-9999]" > "/etc/mesos-slave/resources"

Now, my new allocatable resource details sent to the Mesos master became total allocatable: ports():[31000-31099, 31101-32000, 9998-9998]; cpus():0.7; mem():801; disk():35164. The above config is crucial if we want to allocate our own custom range of ports. So this gave a good idea on how to control my resource allocation by creating custom configuration files. Now let’s restart Mesos-slave process.

    $ service mesos-slave restart

    $ ps axf | grep mesos-slave | grep -v grep
     1889 ?        Ssl    0:26 /usr/local/sbin/mesos-slave --master=zk://localhost:2181,localhost:2182,localhost:2183/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --resources=ports(*):[31000-31099, 31101-32000, 9998-9999]
     1900 ?        S      0:00  \_ logger -p user.info -t mesos-slave[1889]
     1901 ?        S      0:00  \_ logger -p user.err -t mesos-slave[1889]

Now we have the Mesos slave running with Native Docker support. But the current stable version of Marathon still don’t support the native Docker feature of Mesos. So we need to setup Marathon 0.7 from scratch. I’ve written a blog on upgrading Marathon from 0.6.x to 0.7. Once we have setup the Marathon 0.7, we can start using Mesos with the Native Docker support. Let’s create a new App in Marathon via its REST API.

create a JSON file in the new container format. say ubuntu.json

{
    "id": "jackex",
    "container": {
        "docker": {
            "image": "ubuntu:14.04"
        },
        "type": "DOCKER",
        "volumes": []
    },
    "cmd": "while sleep 10; do date -u +%T; done",
    "cpus": 0.2,
    "mem": 200,
    "ports": [9999],
    "requirePorts": true,
    "instances": 1
}

Now, we can use the Marathon API to launch the Docker task,

$ curl -X POST -H "Content-Type: application/json" localhost:8080/v2/apps -d@ubuntu.json

{"id":"/jackex","cmd":"while sleep 10; do date -u +%T; done","args":null,"user":null,"env":{},"instances":1,"cpus":0.2,"mem":200.0,"disk":0.0,"executor":"","constraints":[],"uris":[],"storeUrls":[],"ports":[9999],"requirePorts":true,"backoffSeconds":1,"backoffFactor":1.15,"container":{"type":"DOCKER","volumes":[],"docker":{"image":"ubuntu:14.04"}},"healthChecks":[],"dependencies":[],"upgradeStrategy":{"minimumHealthCapacity":1.0},"version":"2014-08-31T16:03:50.593Z"}

$ docker ps
CONTAINER ID        IMAGE               COMMAND                CREATED             STATUS              PORTS               NAMES
25ba3950a9e1        ubuntu:14.04        /bin/sh -c 'while sl   2 minutes ago       Up 2 minutes                            mesos-a19af637-ffb9-4d3c-8b61-778d47087ace

Mesos UI

Marathon UI

With the new Native Docker support, Mesos is becoming more friendly with Docker. And this is a great achievement for people like me who are trying to use Mesos and Docker in our Infrastructure. A big kudos to Mesos and Mesosphere team for their great work :)

Standard
Docker, Marathon, Mesos

Upgrading Marthon for Mesos Native Docker Support

A few days ago Mesos 0.20 version was released with native Docker support. Till that, Mesos Docker integration was performed using Mesosphere’s Deimos application. But with the new Mesos, there is no need for any external application for launching Docker containers in Mesos Slaves. As per Mesos Documentation we can run containers in two ways. 1) as a Task and 2) as an Executor. Currently Mesos 0.20 supports only the host (--net=host) Docker networking mode.

The current stable release of Marathon is 0.6X. The debian package provided by the Mesosphere also provides the 0.6x version. The container format has been completely changed in 0.7.x version. So we cannot use the 0.6.x version along with Mesos 0.20. More details are available in the Marathon Documentation

The current Mesos Master branch is version := "0.7.0-SNAPSHOT". So we need to compile Marathon from scratch. The only dependency for compiling Marathon is scala-sbt. Get the latest Debian package for sbt from here. The current sbt version is 0.13.5.

$ wget http://dl.bintray.com/sbt/debian/sbt-0.13.5.deb

$ dpkg -i sbt-0.13.5.deb

Now let’s clone the Marathon githib repo.

$ git clone https://github.com/mesosphere/marathon.git && cd marathon

$ sbt assembly

$ ./bin/build-distribution  # for building jar file

The above build command will create an executable jar file “marathon-runnable.jar” under the “target” folder. We can use this jar file along with Marathon 0.6.x start script. If we check the upstart script of Marathon 0.6.x, it uses /usr/local/bin/marathon binary for starting the service. So we need to edit two lines in this file to use our latest version 0.7’s jar file.

First let’s move our jar file to say “/opt/ folder.

$ cp marathon-runable.jar /opt/marathon.jar

Now let’s edit the Marathon binary. Make the below changes in the binary file,

marathon_jar="/opt/marathon.jar" # Line number 21

and

exec java "${vm_opts[@]}" -jar "$marathon_jar" "$@"  # -jar option added to use the jar file instead if the default -cp option. Line number 62

Now let’s restart the Marathon service.

$ service marathon restart

Now let’s use the ps command and verify the service status.

$ ps axf | grep marathon | grep -v grep

22653 ?        Ssl    6:09 java -Xmx512m -Djava.library.path=/usr/local/lib -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -jar /opt/marathon.jar --zk zk://localhost:2181,localhost:2182,localhost:2183/marathon --master zk://localhost:2181,localhost:2182,localhost:2183/mesos
22663 ?        S      0:00  \_ logger -p user.info -t marathon[22653]
22664 ?        S      0:00  \_ logger -p user.notice -t marathon[22653]

Now Marathon is running with the version 0.7 jar file. Let’s verify by creating a new Docker task. Make sure that the Mesos version 0.20 is running on the host and not the older versions of Mesos. More detail about the new container format is available in the Marathon Documentation page.

create a JSON file in the new container format. say ubuntu.json

{
    "id": "mesos-docker-test",
    "container": {
        "docker": {
            "image": "ubuntu:14.04"
        },
        "type": "DOCKER",
        "volumes": []
    },
    "cmd": "while sleep 10; do date -u +%T; done",
    "cpus": 0.2,
    "mem": 200,
    "ports": [9999],
    "requirePorts": false,
    "instances": 1
}

Now, we can use the Marathon API to launch the Docker task,

$ curl -X POST -H "Content-Type: application/json" localhost:8080/v2/apps -d@ubuntu.json

{"id":"/mesos-docker-test","cmd":"while sleep 10; do date -u +%T; done","args":null,"user":null,"env":{},"instances":1,"cpus":0.2,"mem":200.0,"disk":0.0,"executor":"","constraints":[],"uris":[],"storeUrls":[],"ports":[9999],"requirePorts":false,"backoffSeconds":1,"backoffFactor":1.15,"container":{"type":"DOCKER","volumes":[],"docker":{"image":"ubuntu:14.04"}},"healthChecks":[],"dependencies":[],"upgradeStrategy":{"minimumHealthCapacity":1.0},"version":"2014-08-31T16:03:50.593Z"}

$ docker ps
CONTAINER ID        IMAGE               COMMAND                CREATED             STATUS              PORTS               NAMES
25ba3950a9e1        ubuntu:14.04        /bin/sh -c 'while sl   2 minutes ago       Up 2 minutes                            mesos-a19af637-ffb9-4d3c-8b61-778d47087ace

Here i’ve used Port number 9999. But by default, in Mesos 0.20, default port range resourced by Mesos Slave is [31000-32000]. Please refer my next Blog post on how to modify and set a custom port range.

With the new Native Docker support, Mesos is becoming more friendly with Docker. And this is a great achievement for people like me who are trying to use Mesos and Docker in our Infrastructure. A big kudos to Mesos and Mesosphere team for their great work :)

Standard
Marathon, Mesos

Marathon Event Bus

It’s almost 3 months since i’ve started with Mesos + Marathon. Yesterday, i was going through the Marathon Doc Site, and found that Marathon has a cool internal Event Bus. Currently Marathon has an ibuilt HTTP callback subscriber that POSTs events in JSON format to one or more endpoints. So i decided to give it a try to my Mesos test environment. More documentations of the Event Bus are available in the Documentation page.

Currently, in my test environment all the Mesos and Marathon are installed from the Mesosphere Debian packages. Whether we use the packaged init or upstart script, both of them are directly calling the /usr/local/bin/marathon binary. For the Event Bus with http callback, we need to enable two flags (–event_subscriber http_callback –http_endpoints http://host1/foo,http://host2/bar) while starting the Marathon service ie, –event_subscriber, the type of subscriber that we are going to use and –http_endpoints, endpoints corresponding to the subscriber.

From the marathon binary, i found the it looks for files under the /etc/marathon/conf/ folder, where each file is the name of the flag to be enabled and the content of the file is the flag values. So in our case, we need two files inside the “/etc/marathon/conf/”, 1) event_subscriber and 2) http_endpoints

root@vagrant-ubuntu-trusty-64:/etc/marathon/conf# ls /etc/marathon/conf/
event_subscriber  http_endpoints

And the content of these files are,

root@vagrant-ubuntu-trusty-64:/etc/marathon/conf# cat /etc/marathon/conf/event_subscriber
http_callback

root@vagrant-ubuntu-trusty-64:/etc/marathon/conf# cat /etc/marathon/conf/http_endpoints

http://localhost:1234/

Now for test purpose, lets start a minimal webserver using netcat listening to port 1234

$ nc -l -p 1234

Now netcat is listening on port 1234. Lets restart Marathon with our new flags.

$ service marathon restart

Now lets check if the flags are enabled properly,

root@vagrant-ubuntu-trusty-64:/etc/marathon/conf# ps axf | grep marathon | grep -v grep
2780 ?        Sl     0:36 java -Xmx512m -Djava.library.path=/usr/local/lib -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp /usr/local/bin/marathon mesosphere.marathon.Main --zk zk://localhost:2181,localhost:2182,localhost:2183/marathon --master zk://localhost:2181,localhost:2182,localhost:2183/mesos --http_endpoints http://localhost:1234/ --event_subscriber http_callback
2791 ?        S      0:00  \_ logger -p user.info -t marathon[2780]
2792 ?        S      0:00  \_ logger -p user.notice -t marathon[2780]

From the above ps command, it can be seen that the flags are enabled properly. Now i’m going to start a docker container. So as per the Event Bus documentation, Callbacks are Fired every time Marathon receives an API request that modifies an app (create, update, delete)

$ curl -X POST -H "Content-Type: application/json" localhost:8080/v2/apps -d@ubuntu.json

where ubuntu.json is,

{
    "container": {
    "image": "docker:///libmesos/ubuntu",
    "options" : []
  },
  "id": "ubuntu2",
  "instances": "1",
  "cpus": ".3",
  "mem": "200",
  "uris": [ ],
  "ports": [9999],
  "cmd": "while sleep 10; do date -u +%T; done"
}

Once the Marathon API is fired, we will get a POST request on our netcat with the Event details. Below is the Event received on my netcat server.

root@vagrant-ubuntu-trusty-64:/var/tmp# nc -l -p 1234
POST / HTTP/1.1
Host: localhost:1234
Accept: application/json
User-Agent: spray-can/1.2.1
Content-Type: application/json; charset=UTF-8
Content-Length: 427

{"clientIp":"127.0.0.1","uri":"/v2/apps","appDefinition":{"id":"ubuntu2","cmd":"while sleep 10; do date -u +%T; done","env":{},"instances":1,"cpus":0.3,"mem":200.0,"disk":0.0,"executor":"","constraints":[],"uris":[],"ports":[9999],"taskRateLimit":1.0,"container":{"image":"docker:///libmesos/ubuntu","options":[]},"healthChecks":[],"version":{"dateTime":{}}},"eventType":"api_post_event","timestamp":"2014-08-25T13:23:27.059Z"}

This Event bus is really usefull as it notifies us on all the Event changes happening on our Mesos cluster. We can also buit an Event notification system based on these callbacks. Though the current Events has very minimal details, i’m sure that more Event types will get added soon into Marathon.

Standard
Docker, Marathon, Mesos, Zookeeper

Managing HA Docker Cluster Using Multiple Mesos Masters

In my previous blog, i’ve described how to manage Docker cluster using Mesos and Marathon. It was using a single Mesos Master. But for production, we cannot go with a single Mesos master, as it will result in a sinlge point of failure. Mesos supports Multi master via Zookeeper. So in this blog i’m going to explain how to setup a Multi Master Mesos Cluster using Zookeper for Automatic promotion of Mesos master when the current Active Mesos Master fails. This time i’m going to setup 3 Zookeeper services and 2 Mesos Master on a single Vagrant box.

Setting up ZooKeeper Cluster

First lets download the the latest stable Zookeeper source code.

$ wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/zookeeper/stable/zookeeper-3.4.6.tar.gz

Now extract the tar file and create 3 copies of the same, say zookeeper1, zookeeper2 and zookeeper3. Also Zookeeper needs Java on the machine, so let’s install java dependencies.

$ apt-get install openjdk-7-jdk openjdk-7-jre

Now inside, the extracted zookeeper source folder, we need to create a config file. So in our case, inside each Zookeeper folder, we need a zoo.cfg file.

For zookeeper1 folder,

# content of zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper1/
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

For zookeeper2 folder,

# content of zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper2/
clientPort=2182
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

For zookeeper3 folder,

    # content of zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper3/
clientPort=2183
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

Here, since the 3 zookeeper services are running on the same host, i’ve assigned separate ports. If the zookeeper are running on separate instances, then we can have the same ports for all of the zookeeper nodes. There are two port numbers. The first followers use to connect to the leader, and the second is for leader election.

Now we can start the Zookeeper in Foreground using the ./bin/zkServer.sh start-foreground from each of the 3 zookeeper folders. netstat -nltp | grep 288 will display the port of the zookeeper service which is the current master among the cluster. We can also check the connectivity using the zookeepr client binary available in the zookeeper source folder. Once zookeeper cluster is UP, we can go ahead setting up Mesos.

Setting up Multiple Mesos Masters

Mesossphere team has already built packages for Mesos,Marathon and Deimos. So one of my Mesos master will be from the package. But i also wanted to play with the Mesos Source, so i decided to build Mesos from the source. So my second Mesos Master will be from the scratch.

First Master from the Mesosphere package.

$ apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF

$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')

$ CODENAME=$(lsb_release -cs)

# Add the repository
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | sudo tee /etc/apt/sources.list.d/mesosphere.list

$ apt-get -y update

$ apt-get -y install mesos marathon deimos

Now we need to define the Zookeeper cluster details so that Mesos master and slave can connect. Edit /etc/mesos/zk and add zk://localhost:2181,localhost:2182,localhost:2183/mesos. More details about Zookeeper URL is available here. Also More details about installing Mesos from Mesosphere’s Debian package is available in their website.

So now we have one master ready, we can start the Mesos Master, Mesos Slave and Marathon. ANd esnure that they can connect to our Zookeeper cluster properly. We can also see the connection status in the stdout of the zookeeper process as they are running in foreground.

Now building the second Mesos master from scratch. First let’s download the Mesos Source Code.

$ wget http://archive.apache.org/dist/mesos/0.19.0/mesos-0.19.0.tar.gz

$ tar xvzf mesos-0.19.0.tar.gz && cd mesos-0.19.0

$ ./bootstrap && ./configure --prefix=/usr/local/mesos

$ make && make install

Once the Compilation is succeeded, we can start the new Mesos Master service. The default port 5050 is used by the existing master, so we need to run this new service on a different port.

$ /usr/local/mesos/sbin/mesos-master --zk=zk://localhost:2181,localhost:2182,localhost:2183/mesos --port=5054 --quorum=1 --registry=in_memory --work_dir=/var/lib/mesos/

Once the new Mesos Master is up, we can create a test container via Marathon Rest API. Create a simple json file called ubuntu.json

{
    "container": {
    "image": "docker:///libmesos/ubuntu",
    "options" : []
  },
  "id": "ubuntu",
  "instances": "1",
  "cpus": ".3",
  "mem": "200",
  "uris": [ ],
  "cmd": "while sleep 10; do date -u +%T; done"
}


$ curl -X POST -H "Content-Type: application/json" localhost:8080/v2/apps -d@ubuntu.json

This will launch a single container. We can check the status of the container via the Marathon UI as well as via RestAPI also. Now comes the critical part for Production, What happens when the master fails. Mesos Documentation says, if the current master fails, in a Multi master Mesos Cluster, Zookeeper will elect a new Master, in our case the second Mesos master service. And Mesos Claims that the running services will not get crashed and the status will be taken by the newly promoted master. In our case, we have a Docker container running as a long running service.

First i’ll stop one of the Mesos master service,

$ service mesos-master stop

Now immidiately on the stdout of the Zookeeper, we can see the logs corresponding to the Election process. Below is the same,

I0814 19:41:07.889044  9093 contender.cpp:243] New candidate (id='7') has entered the contest for leadership
2014-08-14 19:41:07,896:9088(0x7ff8e2ffd700):ZOO_INFO@check_events@1750: session establishment complete on server [127.0.0.1:2183], sessionId=0x347d606d0d90003, negotiated timeout=10000
I0814 19:41:07.899509  9092 group.cpp:310] Group process ((10)@10.0.2.15:5054) connected to ZooKeeper
I0814 19:41:07.900049  9092 group.cpp:784] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0814 19:41:07.900069  9092 group.cpp:382] Trying to create path '/mesos' in ZooKeeper
I0814 19:41:07.915621  9092 detector.cpp:135] Detected a new leader: (id='6')
I0814 19:41:07.915699  9092 group.cpp:655] Trying to get '/mesos/info_0000000006' in ZooKeeper
I0814 19:41:07.920266  9094 network.hpp:423] ZooKeeper group memberships changed
I0814 19:41:07.920910  9094 group.cpp:655] Trying to get '/mesos/log_replicas/0000000003' in ZooKeeper
I0814 19:41:07.924144  9095 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.0.2.15:5054 }
I0814 19:41:07.926575  9092 detector.cpp:377] A new leading master (UPID=master@10.0.2.15:5054) is detected
I0814 19:41:07.926679  9092 master.cpp:957] The newly elected leader is master@10.0.2.15:5054 with id 20140814-194033-251789322-5050-8808

As per the above logs, the second Mesos Master running at port 5054 was promoted as New Mesos Master. This will be reflected to Marathon also. We can go to the Marthon UI and make sure that the Docker process that we have started using the old Mesos master is still running fine. We can even try making some scaling changes and can make sure that the new master can alter the process.

In my testing, i’ve tried stoping each of the service and ensured that only one Mesos master is running and also tried scaling the Apps from one Master service to make sure that the new Master was able to keep track of all these changes. The results were quite promising. Mesos + Marrthon + Docker indeed is killer combo. We can really built a cross vendor independent cluster with scaling capabilites.

Standard
Docker, FreeSwitch, Ubuntu

Load Test on Docker Freeswitch – Part 1

Docker is a very powerfull tool for managing Linux containers. In my previous blog i’ve explaind on how to setup a Docker Freeswitch. Docker is very mature now, version 1.0 has already been released. Docker is now supported by all major cloud vendors. Docker was showing promising results when i was performing my initial testing. So this time i decided to perform a heavy load test on the Freeswitch container to ensure that Docker can really enter Telephony. Like any normal sys admin, i was googling for Freeswitch load test, and most of the results were pointing to Sipp, an Open Source test tool / traffic generator for the SIP protocol. For me Sipp didnt helped me as it started throwing errors beyond 320 simultaneous calls. The UDP connections were timing out. I tried increasing the timeout, which didn’t helped much.

So next choice is to use a Freeswitch itself, to generate calls. Using the FreeSwitch’s originate command to generate simultaneous calls and hit the Docker Freeswitch container. I also decided to collect all system metrics, so that i knows how the machine behaves under various load tests conditions. For this i deciced to use CollectD and Graphite combo. Collectd 5+ has an inbuild graphite plugin which can send the collectd metrics to a graphite server.

I’ve already setup an Ubuntu-Freeswitch Docker image. First we need to pull the images from the Docker hub.

$ docker pull deepakmdass88/fs-ubuntu

Now i’m going to start the Docker FreeSwitch container in foreground.

$ docker run --rm --privilieged -i -t -p 5060:5060/tcp -p 5060:5060/udp -p 16384:16384/udp -p 16385:16385/udp -p 16386:16386/udp -p 16387:16387/udp -p 16388:16388/udp -p 16389:16389/udp -p 16390:16390/udp -p 16391:16391/udp -p 16392:16392/udp -p 16393:16393/udp -p 5080:5080/tcp -p 5080:5080/udp deepakmdass88/fs-ubuntu /bin/bash

The privilieged option was enabled because, the FreeSwitch init script sets some custom ulimit values, so the container has to be given special privileges. Corresponding SIP and RTP ports are forwarded from the host to the container.

Now before starting the Freeswitch service, we can set up the CollectD agent. By default, the Ubuntu repostiry contains CollectD versio 4.10, but the Graphite plugin is available from version 5.0+ onwards. So we can use somne PPA which has the corresponding version available.

$ apt-get install python-software-properties

$ add-apt-repository ppa:joey-imbasciano/collectd5

$ apt-get update && apt-get install collectd

Now in the /etc/collectd.conf, uncomment LoadPlugin write_graphite. Also, in the same file and uncomment the plugin definition and fill in the server details.

<Plugin write_graphite>
  <Carbon>
        Host "dockergraphite.example.com"
        Port "2003"
        Protocol "tcp"
        LogSendErrors true
        Prefix "collectd."
        StoreRates true
        AlwaysAppendDS false
        EscapeCharacter "_"
  </Carbon>
</Plugin>

I’ve enabled a custom freeswitch plugin, which will extract the current ongoing calls count from freeswitch and sends it to the graphite server. Once the config changes are done we can restart the CollectD service. Now we can check our graphite UI to see if the default metrics like memory, load, cpu etc. are reaching the graphite server. Once CollectD-Graphite setup is ready, we can go ahead with our load test. So, once the call has reached the server, we need some Dialplan to continue the calls. So the simplest method is to create an infinite loop of playing some file, or some conference. Below are some dialplans that i’ve created in the public.xml

# Infinite Play Loop

 <extension name="111222333">
       <condition field="destination_number" expression="^111222333$">
         <action application="answer"/>
         <action application="playback" data="sounds/music/8000/got.wav"/>
         <action application="transfer" data="111222333 XML public"/>
       </condition>
    </extension>

# Test conference

  <extension name="docker-fs-test-conf">
    <condition field="destination_number" expression="^112233">
      <action application="answer"/>
      <action application="sleep" data="500"/>
      <action application="conference" data="docker-test@public"/>
    </condition>
  </extension>


# Default IVR menu

    <extension name="ivr_demo">
      <condition field="destination_number" expression="^5000$">
        <action application="answer"/>
        <action application="sleep" data="2000"/>
        <action application="ivr" data="demo_ivr"/>
      </condition>
    </extension>

Now, we have the dialplans ready, next is authentication. By default there are two ways, Digest auth and IP Whitelist. Here i’m going to use IP whitelist, so we need to whitelist our IP in the acl.conf file.

 <list name="domains" default="deny">
      <!-- domain= is special it scans the domain from the directory to build the ACL -->
      <node type="allow" domain="$${domain}"/>
      <node type="allow" cidr="xxx.xxx.xxx.xxx/32"/>                 # IP of FS from which we are going to send the calls
      <!-- use cidr= if you wish to allow ip ranges to this domains acl. -->
      <!-- <node type="allow" cidr="192.168.0.0/24"/> -->
 </list>

Now we can start the Freeswitch service.

$ /etc/init.d/freeswitch start

We can check the freeswich service using the fs_cli command.

$ /usr/local/freeswitch/bin/fs_cli -x "show status"

UP 0 years, 0 days, 6 hours, 34 minutes, 59 seconds, 648 milliseconds, 56 microseconds
FreeSWITCH (Version 1.5.13b git 39200cd 2014-07-02 21:55:21Z 64bit) is ready
1068 session(s) since startup
0 session(s) - peak 299, last 5min 0
0 session(s) per Sec out of max 30, peak 29, last 5min 0
1000 session(s) max
min idle cpu 0.00/100.00
Current Stack Size/Max 240K/8192K

Now freeswitch is ready to accept the connection. We can start sending the calls from our Load test freeswitch. Below is the script that was used to originate the calls from the load test Freeswitch machine. This will create simultaneous calls towards the Docker FS.

IP_URI="sip:111222333@<docker-fs-ip>:5060"
MAX_CALLS=$1

while [ 1 ]; do

set -i req
req=$(/usr/local/freeswitch/bin/fs_cli -q -b -x "show channels count" | awk '{print $1}')
if [ $req -lt $MAX_CALLS ]; then
    /usr/local/freeswitch/bin/fs_cli -q -b -x "bgapi originate sofia/external/$SIP_URI loadtest"
else
    echo "sleep a bit ..."
    sleep 10s

fi

While bulk calls are being made from the Load test freeswitch machines, to test the Quality in real time, it’s better to dial to the extension directly from a Sip Phone/Client and ensure that voice quality is good. Below is my Graphite dashboard for the load test.

Default Graphite UI

Tessera UI

The FS was stable till 500 simultaneous calls, after that there was a sudden drop in calls and also the voice quality started dropping and in a minute the Freeswitch crashed due to Segmentation fault. I’m going to analyze the core dump file to understand more about the crash. The other smaller drops that we see in the graph was caused by the Load test Freeswitch machine, as the load was getting high when the number of calls was increased. But 500 simultaneous calls are pretty decent and the there was no issue in voice quality till the number of calls crossed 500. Though it’s very difficult to make a final confirmation, i decided to go ahead with phase 2 load test.

In the phase 2 test, i’m planning to use multiple FS load test machines to generate large simultaneous calls + running 2 separate FS containers on the same host and split the incoming calls to both these containers. Once the phase 2 test is completed, ill share the test results in an another blog post. Docker is still under heavy development, and i’m sure Docker will be entering Telephony soon.

Standard