Elasticsearch, Kibana, logstash, Monitoring, Plivo, SIP, Ubuntu, Voip

Extending ELK Stack to VOIP Infrastructure

Being a DevOps guy, i always love metrics. Visualized metrics gives a good picture of what’s happening in our live battle stations. There are now a quite lot of Open Source tools for monitoring and visualizing. It’s more than a year since i’ve started using Logstash. It never turned me down. ElasticSearch-Logstash-Kibana (ELK) is a killer combination. Though i started Elasticsearch + Logstash as a log analyzer, later StatsD and Graphite took it to the next level. When we have a simple infrastructure it’s easy to monitor. But when the infra starts scaling, it becomes quite difficult to keep track of all the events happening inside each nodes. Though service checks can help, but there is still limitation for it. I faced a lot of scenarios where things breaks but service checks will be fine. Under such scenarios logs are the only hope. They have all these events captured.

At Plivo, we manage a variety of servers from SIP, Media, Proxy, WebServers, DB’s etc. Being a fully Cloud based system, i really wanted to have a system which can keep track of all the live events/status of what’s really happening inside our infra. So my plan was to collect two important stats, 1) Server’s events 2) Application events.

Collectd and Logstash

Collectd is a daemon which collects system performance statistics periodically. Since we have a lot Server’s which handle Realtime Media, it’s a very critical component for us. We need to ensure that the server’s are not getting overloaded and there is no latency in network. I’ve been using Logstash heavily for stashing all my logs. And there is a stable input plugin for collectd to send the all the system metrics to logstash.

First we need to enable the Network Plugin, and then we need to mention our Logstash server IP and port so that collectd can start injecting metrics. Below is a sample colectd configuration.

Hostname    "test.plivo.com"
Interval 10
Timeout 4
Include "/etc/collectd/filters.conf"
Include "/etc/collectd/thresholds.conf"
ReportStats true
    LogLevel info
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin network
<Plugin interface>
    Interface "eth0"
    IgnoreSelected false
</Plugin>
<Plugin network>
    Server "{logstash_server_ip}" "logstash_server_port"    # if no port number is mentioned, it will take the default port number (25826)
</Plugin>

Now on the Logstash server, we need to add the CollectD plugin on to the input filter in the logstash’s config file.

input {
      collectd {
      port => "5555"    # default port is 25826
      }
}

Now we are set. Based the plugins enabled in the collectd config file, collctd will start sending the metrics to Logstash on the Interval mentioned in the config, default is 10s. So in my case, i wanted the Load, CPU usage, Memory usage, Bandiwdth (TX and RX) etc. There are default plugins for all these metrics, which we can just enable it in the config file. We also had some custom plugins to collect some custom metrics. BTW writing custom plugin is pretty easy in Collectd.

Now using the Logstash’s Elasticsearch output plugin, we can keep these metrics in Elasticsearch. Now this where Kibana comes in. We can start visualizing these metrics via Kibana. We need to create a custom Lucene Query. Once we have the query, we can create a custom histogram’s for each of these queries. Below aresome sample Lucene queries that we can use with Kibana.

For Load -> collectd_type:"load" AND host:"test.plivo.com"
For Network usage -> collectd_type:"if_octets" AND host:"test.plivo.com"

Below is the screenshot of histogram for Load and Network (TX and RX)

Log Events

Now next is to collect the events from the application logs. We use SIP protocol for all our VOIP sessions. So all our SIP server’s are very critical for us. SIP is pretty similar to HTTP. The response codes are very similar to HTTP responses, ie 1xx, 2xx, 3xx, 4xx, 5xx, 6xx. So i wrote some custom grok patterns so keep track of all of these responses and stores the same on the Elasticsearch.

The second stats which i was interested was our SIP registrar server. We provide SIP endpoints to our customers so that they can use the same with SIP/Soft phones. So i was more interested on stats like Number of registrations/sec, Auth error rates. Plus using ElasticSearch’s MAP facet’s i can create BetterMap. In my previous blog post’s i’ve mentioned on how to create these bettermaps using Kibana and Elasticsearch. Below bettermap screenshot shows us the SIP endpoint registrations from various locations in the last 2 hours.

Now using the Kibana we can start visualizing all these data’s. Below is a sample of Dashboard that i’ve created using Kibana.

ELK stack proved to be an amazing combination. We are currently injecting 3 million events every day and ElasticSearch was blazingly fast in indexing all theses.

Advertisements
Standard

16 thoughts on “Extending ELK Stack to VOIP Infrastructure

  1. Lovely dashboard. Curious to know where you define those lucene queries in the histogram panel? I don’t see a place to put the query on the histogram panel config. Did you have to customize kibana?

    • Hello Ramesh,

      Yes you can define the lucene query in the query section, and when you create a histogram, you can manually select the queries and can display the results of those only. While creating the histogram, in the query option, choose “selected” one from hte drop down, it will display you call the queries dat you have entered in the query section of kibana. You can select whichever you want

  2. Eric says:

    How do you tell Kibana to graph the CPU load values? I can only get a graph of the number of CPU metrics collected, not the values inside the metric.

    • Hello Eric,

      By default, the load plugin of collectd, sends out 3 parameters.

      plugin load

      longterm 0.13
      midterm 0.14
      shortterm 0.12

      This corresponds to 1min,5min,15min load avg in *nix. So while creating the Histogram, do not select the Count. Count will takes the number of responses that it received. Instead, select the “Chart” value as “min” and value field as “shortterm” as short term corresponds to 1min load avg 🙂

      • Eric says:

        Thanks Deepak, I figured it out, though I used “total” instead of “min” and it seemed to work. Do you know the difference between the two? Which values is it taking a total of, or taking the min of? Is it taking the total or the min of the shortterm values over some time period?

  3. Interesting, I never thought of sending collectd metrics to logstash. Do you still send them to statsd/graphite as well? Or are you using logstash/elasticsearch/kibana for all of your metric needs?

    • Justyn, Im still using collectd-logstash + ELK for all my system metrics. For now, im just plotting all my system stats not much computation so ELK is sufficient for me. But there are some cool UI’s coming for graphite also, plus i wanted to try out metric2.0 (metric20.org) with graphite soon.

  4. Pingback: Monitoring Redis Using CollectD and ELK | beingasysadmin

  5. Rajkumar Rajendran says:

    Hi Deepak,

    It is a wonderful article,please let me know how did you create the dashboard for livecalls ,what are the fields should be taken into consideration and other dashboards outbound SIP , which fields you took for creating the dashboards.

    It would be helpful for my elk solution.

    Thanks,
    Raj

  6. Rajkumar Rajendran says:

    It is a great article.Just a small query ,Is it possible to monitor the voip server- I mean the call quality with respect to latency,packet loss and jitter in Kibana*ELK stack*

    Ned you input on this?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s