CollectD, Elasticsearch, Kibana, logstash, Monitoring, Redis

Monitoring Redis Using CollectD and ELK

Redis is an open-source, networked, in-memory, key-value data store. It’s being heavily used every where from Web stack to Monitoring to Message queues. Monitoring tools like Sensu already has some good scripts to Monitor Redis. Last Month during PyCon 2014 @Plivo, opensourced a new rate limited queue called SHARQ which is based on Redis. So apart from just Monitoring checks, we decided to have a tsdb of what’s happening in our Redis Cluster. Since we are heavily using ELK stack to visualize our infrastructure, we decided to go ahead with the same.

CollectD Redis Plugin

There is a cool CollectD plugin for Redis. It pulls a verity of Data from Redis which includes, Memory used, Commands Processed, No. of Connected Clients and slaves, No. of blocked Clients, No. of Keys stored/db, uptime and challenges since last save. The installation is pretty simple and straight forward.

$ apt-get update && apt-get install collectd

$ git clone https://github.com/powdahound/redis-collectd-plugin.git /tmp/redis-collectd-plugin

Now place the redis_info.py file onto the collectd folder and enable the Python Plugins so that collectd can use this python file. Below is our collectd conf

Hostname    "<redis-server-fqdn>"
Interval 10
Timeout 4
Include "/etc/collectd/filters.conf"
Include "/etc/collectd/thresholds.conf"
LoadPlugin network
ReportStats true

        LogLevel info

Include "/etc/collectd/redis.conf"      # This is the configuration for the Redis plugin
<Plugin network>
    Server "<logstash-fqdn>" "<logstash-collectd-port>"
</Plugin>

Now copy the redis python plugin and the conf file to collectd folder.

$ mkdir /etc/collectd/plugin            # This is where we are going to place our custom plugins

$ cp /tmp/redis-collectd-plugin/redis_info.py /etc/collectd/plugin/

$ cp /tmp/redis-collectd-plugin/redis.conf /etc/collectd/

By default, the plugin folder in the redis.conf is defined as ‘/opt/collectd/lib/collectd/plugins/python’. Make sure to replace this with the location where we are copying the plugin file, in our case “/etc/collectd/plugin”. Now lets restart the collectd daemon to enable the redis plugin.

$ /etc/init.d/collectd stop

$ /etc/init.d/collectd start

In my previous Blog, i’ve mentioned how to enable and use the ColectD input plugin in Logstash and to use Kibana to plot the data coming from the collectd. Below are the Data’s that we are receiving from the CollectD on Logstash,

  1) type_instance: blocked_clients
  2) type_instance: evicted_keys
  3) type_instance: connected_slaves
  4) type_instance: commands_processed
  5) type_instance: connected_clients
  6) type_instance: used_memory 
  7) type_instance: <dbname>-keys
  8) type_instance: changes_since_last_save
  9) type_instance: uptime_in_seconds
10) type_instance: connections_received

Now we need to Visualize these via Kibana. Lets create some ElasticSearch queries so that visualize them directly. Below are some sample queries created in Kibana UI.

1) type_instance: "commands_processed" AND host: "<redis-host-fqdn>"
2) type_instance: "used_memory" AND host: "<redis-host-fqdn>"
3) type_instance: "connections_received" AND host: "<redis-host-fqdn>"
4) type_instance: "<dbname>-keys" AND host: "<redis-host-fqdn>"

Now We have some sample queries, lets visualize them.

Now create histograms in the same procedure by changing the Selected Queries.

Standard
Elasticsearch, Kibana, logstash, Monitoring, Plivo, SIP, Ubuntu, Voip

Extending ELK Stack to VOIP Infrastructure

Being a DevOps guy, i always love metrics. Visualized metrics gives a good picture of what’s happening in our live battle stations. There are now a quite lot of Open Source tools for monitoring and visualizing. It’s more than a year since i’ve started using Logstash. It never turned me down. ElasticSearch-Logstash-Kibana (ELK) is a killer combination. Though i started Elasticsearch + Logstash as a log analyzer, later StatsD and Graphite took it to the next level. When we have a simple infrastructure it’s easy to monitor. But when the infra starts scaling, it becomes quite difficult to keep track of all the events happening inside each nodes. Though service checks can help, but there is still limitation for it. I faced a lot of scenarios where things breaks but service checks will be fine. Under such scenarios logs are the only hope. They have all these events captured.

At Plivo, we manage a variety of servers from SIP, Media, Proxy, WebServers, DB’s etc. Being a fully Cloud based system, i really wanted to have a system which can keep track of all the live events/status of what’s really happening inside our infra. So my plan was to collect two important stats, 1) Server’s events 2) Application events.

Collectd and Logstash

Collectd is a daemon which collects system performance statistics periodically. Since we have a lot Server’s which handle Realtime Media, it’s a very critical component for us. We need to ensure that the server’s are not getting overloaded and there is no latency in network. I’ve been using Logstash heavily for stashing all my logs. And there is a stable input plugin for collectd to send the all the system metrics to logstash.

First we need to enable the Network Plugin, and then we need to mention our Logstash server IP and port so that collectd can start injecting metrics. Below is a sample colectd configuration.

Hostname    "test.plivo.com"
Interval 10
Timeout 4
Include "/etc/collectd/filters.conf"
Include "/etc/collectd/thresholds.conf"
ReportStats true
    LogLevel info
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin network
<Plugin interface>
    Interface "eth0"
    IgnoreSelected false
</Plugin>
<Plugin network>
    Server "{logstash_server_ip}" "logstash_server_port"    # if no port number is mentioned, it will take the default port number (25826)
</Plugin>

Now on the Logstash server, we need to add the CollectD plugin on to the input filter in the logstash’s config file.

input {
      collectd {
      port => "5555"    # default port is 25826
      }
}

Now we are set. Based the plugins enabled in the collectd config file, collctd will start sending the metrics to Logstash on the Interval mentioned in the config, default is 10s. So in my case, i wanted the Load, CPU usage, Memory usage, Bandiwdth (TX and RX) etc. There are default plugins for all these metrics, which we can just enable it in the config file. We also had some custom plugins to collect some custom metrics. BTW writing custom plugin is pretty easy in Collectd.

Now using the Logstash’s Elasticsearch output plugin, we can keep these metrics in Elasticsearch. Now this where Kibana comes in. We can start visualizing these metrics via Kibana. We need to create a custom Lucene Query. Once we have the query, we can create a custom histogram’s for each of these queries. Below aresome sample Lucene queries that we can use with Kibana.

For Load -> collectd_type:"load" AND host:"test.plivo.com"
For Network usage -> collectd_type:"if_octets" AND host:"test.plivo.com"

Below is the screenshot of histogram for Load and Network (TX and RX)

Log Events

Now next is to collect the events from the application logs. We use SIP protocol for all our VOIP sessions. So all our SIP server’s are very critical for us. SIP is pretty similar to HTTP. The response codes are very similar to HTTP responses, ie 1xx, 2xx, 3xx, 4xx, 5xx, 6xx. So i wrote some custom grok patterns so keep track of all of these responses and stores the same on the Elasticsearch.

The second stats which i was interested was our SIP registrar server. We provide SIP endpoints to our customers so that they can use the same with SIP/Soft phones. So i was more interested on stats like Number of registrations/sec, Auth error rates. Plus using ElasticSearch’s MAP facet’s i can create BetterMap. In my previous blog post’s i’ve mentioned on how to create these bettermaps using Kibana and Elasticsearch. Below bettermap screenshot shows us the SIP endpoint registrations from various locations in the last 2 hours.

Now using the Kibana we can start visualizing all these data’s. Below is a sample of Dashboard that i’ve created using Kibana.

ELK stack proved to be an amazing combination. We are currently injecting 3 million events every day and ElasticSearch was blazingly fast in indexing all theses.

Standard
Elasticsearch, Kibana, logstash, Monitoring

Near RealTime Dashboard with Kibana and Elasticsearch

Being in DevOps it’s always Multi tasking. From Regular customer queries it goes through Monitoring, Troubleshooting etc. And offcourse when things breaks, it really becomes core multi tasking. Especially when you have a really scaling infrastructure, we should really understand what’s really happening in our infrastructure. Yes we do have many new generation cloud monitoring tools like Sensu, but what if we have a near real time system that can tell us the each and every events happeing in our infrastructure. Logs are the best places where we can keep track of the events, even if the monitoring tools has missed it. We have a lot of log aggregator tools like tool Logstash, Splunk, Apache Kafka etc. And for log based event collection the common choice will be always Logstash -> StatsD -> Graphite. And ElasticSearch for indexing these.

My requirement was pretty straight. Record the events, aggregate them and keeps track of them in a timely manner. Kibana uses ElasticSearch facets for aggregating the search query results.Facets provide aggregated data based on a search query. So as a first task, i decided to visualize the location of user’s who are registering their SIP endpoints on our SIP registrar server. Kibana gives us a good interface for the 2D heat map as well as a new option called BetterMap. Bettermap uses geographic coordinates to create clusters of markers on map and shade them orange, yellow and green depending on the density of the cluster. So from the logs, i just extracted the register events, and used a custom regex patterns to extract the details like the Source IP, usernames etc using logstash. Using the logstash’s GeoIP filter, the Geo Locations of the IP can be identified. For the BetterMap, we need coordinates, in geojson format. GeoJSON is [longitude,latitude] in an array. From the Geo Locations that we have identified in the GeoIP filter, we can create this GeoJSON for each event that we are receiving. Below is a sample code that i’ve used in logstash.conf for creating the GeoJSON in Logstash.

if [source_ip]  {
    geoip {
      type => "kamailio-registers"
      source => "source_ip"
      target => "geoip"
      add_field => ["[geoip][coordinates]","%{[geoip][longitude]}"]
      add_field => ["[geoip][coordinates]","%{[geoip][latitude]}"]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }

The above filter will create a GeoJSON array “geoip.coordinates”. This array can be used for creating the BetterMap in Kibana. Below are the settings for creating a BetterMap panel in the Kibana dashboard. While adding a new panel, select “bettermap” as the panel type, and the co-ordinate filed should be the one which contains the GeoJSON data. Make sure that the data is of the format [longitude,latitude], ie Longitude first and then followed by latitude.

Moving ahead, i decided to collect the events happening on our various other server’s. We were one of the earliest companies who started using SIP (Session Initiation Protocol). SIP employs design elements similar to the HTTP request/response transaction model. So similar to web traffic, i’ve decided to collect events related to 4XX, 5XX and 6XX error responses, as it is very important to us. Once the logs are shipped to logstash, i wrote another custom grok pattern, which extracts the Error Code and Error responses, including the server which returned the same. These data’s can be used for future analysis also. So i decided to store these on ElasticSearch. So now we have the real time event data’s stored, but how to visualize. Since i dont have to perform much mathematical analytics with data, i decided to to remove graphite. Kibana has a wonder full GUI for visualizing the data. So decided to go ahead with Kibana. One option is “histogram” panel time. Using histogram we can visualize the data via a regular bar graph, as well as using the area graph. There is another panel type called “terms” which can be used to display the agrregated events via pie chart, bar chart, or a table. And below is what i achieved with Kibana.

This is just an inital setup. I’m going to add more events to this. As of now Kibana + Elasticsearch proves to be a promising combination for displaying all near real time events happening in my Infrastructure.

Standard