Amazon IAM out of beta

I missed this yesterday, but Amazon Web Services has just now announcedthe general availability (leaving beta) of they Identity Access Management service. One of the cool things about this is that there is now a tab for IAM in the web-based management console. This means that you now have an alternative to the previous “you have to do it all through the API for now”.

More details can be found on the Amazon Web Services blog.

S3 access log parsing/storage with Tamarin

We have been helping one of our clients moves their massive collectionof audio and video media to S3 over the last few weeks. After most of the files were in place, we saw that our usage reports on for one of the buckets was reporting much higher usage than expected. We ran some CSV usage report dumps to try to get a better idea of what was going on, but found ourselves wanting more details. For example:

  • Who are the biggest consumers of our media? (IP Addresses)
  • What are the most frequently downloaded files?
  • Are there any patterns suggesting that we are having our content scraped by bots or malicious users?
  • How do the top N users compare to the average user in resource consumption.

Enter: Bucket Logging

One of S3’s many useful features includes Server Access Logging. The basic idea is that you go to the bucket you’d like to log, enable bucket logging, and tell S3 where to dump the logs. You then end up with a bunch of log keys that are in a format that resembles something you’d get from Apache or Nginx. We ran some quick and dirty scripts against a few day’s worth of data, but quickly found ourselves wanting to be able to form more specific queries on the fly without having to maintain a bunch of utility scripts. We also needed to prepare for the scenario where we need to automatically block users that were consuming disproportionately large amounts of bandwidth.

Tamarin screeches its way into existence

The answer for us ended up being to write an S3 access log parser with pyparsing, dumping the results into a Django model. We did the necessary leg work to get the parser working, and tossed this up on GitHub as Tamarin. Complete documentation may be found here.

Tamarin contains no real analytical tools itself, it is just a parser, two Django models, and a log puller (retrieves S3 log keys and tosses them at the parser). Our analytical needs are going to be different than the next person’s, and we like to keep apps like this as focused as possible. We very well may release apps in the future that leverage Tamarin for things like the automated blocking of bandwidth hogs we mentioned, or apps that plot out pretty graphs. However, these are best left up to other apps so Tamarin can be light, simple, and easy to tweak as needed.

Going back to our customer with higher-than-expected bandwidth usage, we ended up finding that aside from a few bots from Nigeria and Canada, usage patterns were pretty normal. The media that was uploaded into that bucket was never tracked for bandwidth usage on the old setup, so the high numbers were actually legitimate. With this in mind, we were able to go back to our client and present concrete evidence that they simply had a lot more traffic than previously imagined.

Where to go from here

If anyone ends up using Tamarin, please do leave a comment for me with any interesting queries you’ve built. We can toss some of them up on the documentation site for other people to draw inspiration from.

Source: https://github.com/duointeractive/tamarin

Documentation: http://duointeractive.github.com/tamarin/

GitHub Project: https://github.com/duointeractive/tamarin

django-ses + celery = Sea Cucumber

Maintaining, monitoring, and keeping a mail server in good standing canbe pretty time-consuming. Having to worry about things like PTR records and being blacklisted from a false-positive stinks pretty bad. We also didn’t want to have to run and manage yet another machine. Fortunately, the recently released Amazon Simple Email Service takes care of this for us, with no fuss and at very cheap rates.

We (DUO Interactive) started using django-ses in production a few weeks ago, and things have hummed along without a hitch. We were able to drop django-ses into our deployment with maybe three lines altered. It’s just an email backend, so this is to be expected.

Our initial deployment was for a project running on Amazon EC2, so the latency between it and SES was tiny, and reliability has been great. However, we wanted to be able to make use of SES on our Django projects that were outside of Amazon’s network. Also, even projects internal to AWS should have delivery re-tries and non-blocking sending (more on that later).

Slow-downs and hiccups and errors, oh my!

The big problem we saw with using django-ses on a deployment external to Amazon Web Services was that any kind of momentary slow-down or API error (they happen, but very rarely) resulted in a lost email. The django-ses email backend uses boto’s new SES API, which is blocking, so we also saw email-sending views slow down when there were bumps in network performance. This was obviously just bad design on our part, as views should not block waiting for email to be handed off to an external service.

django-ses is meant to be as simple as possible. We wanted to take django-ses’s simplicity and add the following:

  • Non-blocking calls for email sending from views. The user shouldn’t see a visible slow-down.
  • Automatic re-try for API calls to SES that fail. Ensures messages get delivered.
  • The ability to send emails through SES quickly, reliably, and efficiently from deployments external to Amazon Web Services.

The solution: Sea Cucumber

We ended up taking Harry Marr’s excellent django-ses and adapting it to use the (also awesome) django-celery. Celery has all of the things we needed built in (auto retry, async hand-off of background tasks), and we already have it in use for a number of other purposes. The end result is the now open-sourced Sea Cucumber Django app. It was more appropriate to fork the project, rather than tack something on to django-ses, as what we wanted to do did not mesh well with what was already there.

An additional perk is that combining Sea Cucumber with django-celery’s handy admin views for monitoring tasks lets us have peace of mind that everything is working as it should.

Requirements

  • boto 2.04b+
  • Django 1.2 and up, but we won’t turn down patches for restoring compatibility with earlier versions.
  • Python 2.5+
  • celery 2.x and django-celery 2.x

Using Sea Cucumber

  • You may install Sea Cucumber via pip: pip install seacucumber
  • You’ll also probably want to make sure you have the latest boto: pip install —upgrade boto
  • Register for SES.
  • Look at the Sea Cucumber README.
  • Profit.

Getting help

If you run into any issues, have questions, or would like to offer suggestions or ideas, you are encouraged to drop issues on our issue tracker. We also haunt the #duo room on FreeNode on week days.

Credit where it’s due

Harry Marr put a ton of work into boto’s new SES backend, within a day of Amazon releasing this service. He then went on to write django-ses. We are extremely thankful for all of his hard work, and thank him for cranking out a good chunk of code that Sea Cucumber still uses.

Amazon EC2 and long restart delays

For the benefit of others either considering Amazon’s EC2, or who arealready there, I thought I’d point something out. I am not sure if this is an Ubuntu EC2 AMI issue, an EC2 issue, an EBS issue, or some combination of all of these, but we are experiencing some really erratic restart times. We run our EC2 instances with the following basic configration:

  • Size: small and medium high-cpu
  • Root device: EBS
  • Distro: Ubuntu 10.10 (the latest)

Symptoms

The behavior we are seeing is that even our simplest, no-extra-EBS-volumes instances periodically hang on boot when restarted. The restart can range anywhere from less than 60 seconds, to 5 minutes, to not at all. In the case of ‘not at all’, something bad happens to the point where we the instance fails to reach the point where we can even SSH in, and the syslog that the web-based management console shows looks to be out of date.

In the case of a complete hang on restart (it seems to be 50-50 right now), we have to reboot the machine again from the web-based AWS EC2 management console. This second reboot usually results in a full startup.

Our hunch

From what we can tell, this may be EBS-related. Even though we specifically have nobootwait on our swap partition (which is an included ephemeral drive that is standard with the Ubuntu AMI), it seems like Ubuntu may be freaking out when it can’t reach the root EBS drive. It’s also possible that even despite the nobootwait in the fstab, this same thing could be happening with the swap partition as well. We haven’t had a lot of time to try different combinations, as we’re insanely busy right now.

If anyone else has experienced something similar, please chime in.