Reliably monitoring Cassandra with Statsd

I’m a newbie at Lookout, but already loving working here. One of the cool projects I was asked to help on was to merge the few Cassandra instances we have at Lookout into one. One of my freshly adopted rules is that “if it moves, graph it”. So, I started researching monitoring Cassandra.

I found that the stuff that’s out there is pretty scattered. Both nodetool and DataStax OpsCenter seem to rely on JMX to gather their statistics. This is really a “poll operation” which means that you are going to have another service you’ll have to stand up if you want to look at trends.

The first thing I looked at was integrating with New Relic. It appeared to have a semi-supported cassandra plugin that had some issues, some that were easy to fix but there would be a lot more work to do to make it secure enough for us. The first problem was that JMX didn’t use any credentials. This is fixable; it’s not hard to add credentials to JMX. In general though, any ingress to a critical service, like our cassandra server would need authentication, authorization, a security audit, and periodic vulnerability reviews.

The other problem with this approach is that the poller machine itself might fail, which means that all nodes might stop reporting. We might think our entire cluster has failed, and wake someone up, just because the reporting node crashes. The real issue here is that there’s no redundancy in this solution.

Lookout also uses graphite and statsd for gathering and displaying information, so I headed off to check that out instead, especially since Cassandra has native support for graphite. This would have worked for us, but we would still have some authorization issues. UDP outbound-only reporting of statsd is really what we wanted.

Despite Cassandra claiming that it actually has pluggable metrics, what that really means is you can plug in any metric reporter you want, as long as it’s console, csv, ganglia or graphite. If you want Cassandra to report directly to statsd via the codehale metrics library using another open source project, you’re pretty much out of luck, unless…

An awesome idea for plugging in some extra code into an existing executable is to use a java agent. This approach was taken by another open source project to get statds support in cassandra. This works great – each client independently sends a UDP packet with the statistics to the server. We don’t have to open any ports up except for an inbound UDP to statsd. Since it’s connectionless UDP, if the server is down, we just lose a packet, and we’ll just send another in 10 seconds.

There were a few bugs and enhancements we needed to make this work with our infrastructure, so we forked and created a github repo for this.

Putting this all together

If you want statsd for cassandra, it’s super easy now. First, grab these two jars:

curl -L https://dl.bintray.com/lookout/systems/com/github/lookout/metrics/agent/1.0/agent-1.0.jar -o agent-1.0.jar
curl -L https://oss.sonatype.org/content/groups/public/com/bealetech/metrics-statsd/2.3.0/metrics-statsd-2.3.0.jar -o metrics-statsd-2.3.0.jar

Put these jars in cassandra’s lib directory.

Change cassandra startup to add this agent. This can be done in a stock install by adding the following to /etc/default/cassandra:

export JVM_OPTS="-javaagent:/usr/share/cassandra/lib/agent-1.0.jar=localhost"

Note the ‘=localhost’ at the end. If your statds server is somewhere else, you should change this.

There are some additional installation details available in the README, including options to change the port number or reporting interval.


With this easily-cheffable plugin, we’re off to add additional monitoring to our cassandra clusters. What a great way to start 2015!

posted in: · · ·



JRuby, sponsored in part by Lookout

In my time at Lookout we’ve had to grow in just about every way imaginable. New products, new applications, new people, new teams; growth in every dimension. In order to help engineering continue to grow and be successful within our service-oriented infrastructure, we created a team named “Core Systems” late this year. Among the team’s responsibilities are building tools, support engineering and maintaining systems foundational to a number of our products such as Kafka, Storm, Cassandra and so on. The systems we’ll cover in later blog posts, but for this post we’ll focus on our support of the JRuby toolchain.

JRuby Jay!

Since the beginning Lookout has primarily been a Ruby-based engineering team. As our deployment and tooling requirements have evolved, we’ve become more and more invested in the JVM, which means JRuby and some of the associated tooling (e.g. jbundler) are critical to the efficiency of our day-to-day work.

To help us support the JRuby toolchain internally we’ve welcomed Christian Meier, a very talented JRuby hacker, to the Lookout engineering team.

The benefits have almost been instantaneous, with a great number of changes contributed by Christian on Lookout’s behalf making their way into: JRuby 1.7.17, JRuby 1.7.18, jbundler 0.7.0 and jruby-openssl 0.9.6.

This post doesn’t aim to cover each commit and bug fix specifically, but will provide a general overview of the areas of focus for our bug fixes and improvements.

JRuby

In JRuby 1.7.13 and 1.7.14 the LoadService was refactored quite a bit to support operations such as: File.exists?, Dir['*'], and require 'somefile. Despite best intentions, a number of regressions did sprout up after the releases went public.

File operations without “native” support

One series of regressions occurred when the native support from FFI would not be properly loaded, and the fallback to Java failed. The cases where the FFI libraries cannot load aren’t necessarily bugs since JRuby supports explicitly disabling loading of “native code” at runtime.

The failure of the Java-based fallback code would impact operations such as: jruby -S gem install my.gem if ~/.gem/jruby/1.9 was not a directory. Basically operations that would require:

  • File.exists?
  • File.file?
  • File.directory?
  • File.executable?

And so on. The JRuby test suite does contain many tests for these calls, but it was only being run with native code enabled.

Fixing this class of bugs required not only fixing the tests to run with and without native code enabled, but also required fixing a number of underlying bugs in Java code.

Testing on various JRuby “installations”

Previously much of JRuby’s testing focus was on the JRuby which you might install directly on your system. Because of the versatility and embeddability of JRuby however, there are a number of different forms of “installations” which look different than what you might get from RVM:

  • `org.jruby:jruby:pom. artifact
  • org.jruby:jruby:pom:noasm artifact which is the above with asm classes repacked to org.jruby.org.asm packages
  • jruby-jars.gem which is jruby.jar (from the install bundle) and org.jruby:jruby-stdlib:jar
  • org.jruby:jruby-complete:jar (which we use in our frankenwar and frankenjar applications)

The biggest difference between those methods and the regular JRuby installation (via rbenv, rvm, etc) is that the JRuby stdlib is inside a jar. Some features like the “default gems” bundled with JRuby (e.g. jruby-openssl) do not work when you set jruby.home to be classpath:/META-INF/jruby.home. This means users could not pick the correct version of the jruby-openssl or even get a nasty MethodNotException if they attempted to.

JRuby gives some (unnecessary) preference for the Thread.currentThread().getContextClassLoader() and assumes that JRuby is loaded via this classloader. When running inside of a Frankenwar, or a Storm topology this is not always going to be the case.

By adding some tests to exercise these alternative forms of JRuby installations, a number of bugs were identified and corrected here as well.

URI-like paths

As mentioned in the previous section, JRuby supports a number of different forms of installations and can be easily embedded. A side-effect of this functionality is that a “file path” in JRuby can be a number of different things:

  • classpath:my/path/to/a/file/on/the/classpath
  • uri:bundle:0.17://my/path/inside/an/osgi/bundle
  • uri:classloader://my/path/to/classloader/of/jruby
  • file://my/path
  • file://my.jar!/some/path/inside/this/jar
  • my.jar!/some/path/inside/this/jar
  • jar:file://my.jar!/some/path/inside/this/jar

These URI-like paths can pop up very quckly when using constructs such as File.dirname(__FILE__) from a file within the stdlib, JRuby kernel or even code packed inside of a .jar file.

Sometimes these URI-like paths would need to be translated into a java.io.InputStream like with X509Store#add_cert. The LoadService refactoring already had some support for handling these use-cases but it was error prone. The older mechanisms for doing this had already been dropped in the next major development branch of JRuby (also known as JRuby 9000) but the 1.7.x branch needed some fixes too.

jruby-openssl

Concurrency problems

The X509Store had some synchronization issues discovered by some of our applications running in production. While not an egregious bug, fixing it was a good exercise for Lookout of getting production-level issues reported, triaged and merging fixes into an open source project.

In this case the mutex for an X509Store instance was used by several classes and needed to be fixed and synchronized properly.

Certificate loading from URI-like paths

Related to the URI-like paths changes above, jruby-openssl also needed some updates to ensure that it could load certificates from within .jar files like jruby-stdlib.jar. Previously attempting to load these bundled certificates would always fail due to the aforementioned URI-like paths, but starting with JRuby 1.7.17 this behavior has been finally corrected.

Misc. fixes

jruby-openssl did see a lot of refactoring around Digest and Certificates prior to the 0.9.6 release. Ruby code was moved into Java, in addition to some cases where code was using Bouncycastle directly instead of the Java Crypto Extensions. After that regression there were some hairy test regressions which needed to be fixed against both JRuby 9000 and the JRuby 1.7.x branch of development in order to get the jruby-openssl 0.9.6 gem released.


Overall we’re thrilled to be sponsoring JRuby development and helping “raise all the boats” by getting bugs we’re finding into the open source ecosystem.

I’m certainly looking forward to what we can do in 2015 as an engineering organization, building great products with great technology, with regular open source contributions of course!

If you’re interested in helping to build a great security company and enjoy challenges, join Lookout. We’ve got a number of positions open, not only on Core Systems but across the board in engineering from Android, to iOS, Security, Operations, and ‘Platform and Infrastructure.’

posted in: · ·



An Introduction to Kafka (cross-post)

Last week a member of our Operations team, Brandon Burton, had an article published for sysadvent: Introduction to Kafka.

I wanted to provide an introduction to the operational side of Kafka. I won’t really get into using Kafka as part of your application but will provide some jumping off points to where to learn more about accomplishing that near the end of this post.

At a high-level, the Lookout engineering team has been migrating to Kafka as an alternative to ActiveMQ that better meets the evolving needs of the Lookout service-oriented architecture.

In the future we’ll have more blog posts on our usage of Kafka from the application/services standpoint, but if you’ve ever wanted to understand some core concepts around Kafka, Brandon’s post is a good introduction.

posted in: · · ·