2014-03-10

LogStash ElasticSearch Index Cleanup

LogStash is a great way to track logs from lots of different sources and store them in a central location where metrics and monitoring can occur. I've started pushing LOTS of data into our setup which uses the ElasticSearch back end. To quote their site, "ElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine." and I think it has a really bright future... but currently it's soaking up a lot of disk space. I'm sure I'm not the only one with this issue, after all, when something can handle LOADS of data, you want to give it all you've got! So, we've got 3 hosts running ElasticSearch processes, each with 250GB of data storage. Sometimes one will start to fill up. Looking into the API, I found it's really, REALLY easy to delete old data to keep the size within requested parameters. First off, looking at LogStash's ElasticSearch plugin, it notes that by default LogStash indexes are "logstash-%{+YYYY.MM.dd}". Keeping that in mind, the following info would work for anything as long as you know the indexes you want to delete, but let's start off simple.

curl -s -XDELETE  'http://127.0.0.1:9200/logstash-2014.02.28'

That'll delete the "logstash-2014.02.28" index. I've had to connect in and do this sometimes. Great to do when you need it on demand, but we can do better. Assuming that I'm cool keeping the last 7 days up there, let's edit up a quick bash script:

#!/bin/bash
DATETODELETE=`date +%Y.%m.%d -d '7 days ago'`
curl -s -XDELETE  http://127.0.0.1:9200/logstash-${DATETODELETE}

Now, we could put that in the crontab, have it run once or twice a day and be good to go... And if you knew you could ALWAYS keep 7 days worth of data on your system, that'd be acceptable. But let's have some more fun. Let's assume that we want to keep as much as we can on our system and still keep 10% space free and that the drive we store this on is mounted on /data

#!/bin/bash
#This is about 10% of a 250GB volume (not GiB) using 1000 = k 
DESIRED=24410000
AVAIL=`df /data|grep -v Filesystem|awk '{print $4}'`
if [ $AVAIL -lt $DESIRED ] 
then
    curl -s -XDELETE 127.0.0.1:9200/`curl -s 127.0.0.1:9200/_stats?pretty|grep logstash|sort|awk -F'"' '{ print $2 }'|head -n1`
fi

Let's explain this sample a bit... First off, we set DESIRED to be the amount of "Available" space we desire the system to retain. in our case above, I calculated 10% of a 250GB drive and put that in. So if it ever starts to go below 10% remaining (90%+ used) the if statement will fire.

Next, I pull the Available space. If you take what's in the backquotes and put it on a command line, you'll see what happens. I run df, limited to just the filesystem I care about, I get rid of the line with the labels and then awk pulls out the 4th column (Avail). This number gets stored in AVAIL and we move on.

The If statement then compares the two, if DESIRED is less than AVAIL, we are bumping our limit and have something's got to give, so we run the curl... This curl is a combination of two actually. Starting inside out, we do a "curl -s 127.0.0.1:9200/_stats?pretty" which prints out a list of indexes and a bunch of cool stats about them... then we grep for logstash to get rid of all the cool stats and just keep the names, then we use sort to make sure they're in order of lowest to highest (since they have dates 2014-03-04 and such, that works) and then we use some awk magic to pull out JUST the name of the index and get rid of the other chars 'pretty' uses. That then gets placed back in the right place for the outter curl to execute a delete on it and bye, bye index!

If you put this in your crontab and run it often (it won't do anything if the drive has more than the desired available space remaining) you'll be able to maintain free space on your ElasticSearch hosts without having to set a hard limit on days to keep.

Thinking further into it, you can use the same script with different commands inside the if statement to keep free space on many other systems as well.

No comments:

Post a Comment