The JML Continuum: Linux

Showing posts with label Linux. Show all posts

2016-12-21

Keeping a process running with Flock and Cron

We've got a few processes here that aren't system services, but need to be running in the background and should be restarted if they die. However, this method can also be used for a cron that often runs past it's normal re-execution time (say a every 5 min cron that sometimes runs for 7min). It will prevent multiple executions from running simultaneously.

First off, in your crontab, you can add a line like this:

* * * * * flock -x -n /tmp/awesomenessRunning.lock -c "/usr/local/bin/myAwesomeScript.sh" >/dev/null 2>&1

What happens here is fairly straight forward:

Every minute, flock executes your script in this case "/usr/local/bin/myAwesomeScript.sh"
flock opens up an exclusive write lock on the lock file, here named "/tmp/awesomenessRunning.lock". When it's done executing, it'll release the lock.
The next time this cron runs, flock will attempt to again, get an exclusive lock on that lock file... but it can't if that script is still running, so it'll give up and try again next time the cron runs.

Now, generally, if I'm doing this as a systems level item, I'll put the following in a file named for the job or what the job is doing and drop it in /etc/cron.d/. All the files there will get compiled together into the system cron, which helps other admins (or your later self) to find and disable it later. If you do that, remember to stick the user to execute the cron as between the *'s and the flock!

2016-06-24

Growing a RAID-5 Mdadm while online

Today, I decided that my 3 drive RAID-5 setup just wasn't big enough. I've got 2 1.5TB and a 1TB drive in a RAID-5 of the first 1TB across them all using madam. The extra space on the 1.5's lets me use that area as scratch disks for other things that I don't especially need the speed or resilience of the RAID-5 for. But now it's time to throw another TB at it and make it a 4 drive RAID-5. Mdadm can help us out here in just a few steps. First, though, we have to partition the new drive so we can use it. We also need to know which one it is... First I'll see what we had. I know it's mounted on /array/ and can tell from a mount output what the device name is, and extrapolate what drives are part of it like this:

# mount |grep array
/dev/md3 on /array type ext4 (rw,noatime,data=ordered)

# mdadm --misc --detail /dev/sd3
/dev/md3:
        Version : 0.91
  Creation Time : Sun Jul 17 21:20:35 2011
     Raid Level : raid5
...
    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       97        2      active sync   /dev/sdg1

So, sdc1, sdd1, sdg1 are all part of this array. After inserting the new disk, I run a `dmesg|grep TB` as I know it will be listed as a #TB drive and we'll look for other devices:

# dmesg |grep TB
[    3.264540] sd 2:0:0:0: [sdc] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
[    3.329286] sd 2:0:1:0: [sdd] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
[    3.329630] sd 3:0:0:0: [sde] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[    5.930020] sd 9:0:0:0: [sdg] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

Hey look, a new one, and we shall call you 'sde'. Time to make some partitions:

# fdisk /dev/sde

Welcome to fdisk (util-linux 2.26.2).Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x8c82f3b1.

Command (m for help): p
Disk /dev/sde: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x8c82f3b1

Command (m for help): n
Partition type
   p   primary (0 primary, 0 extended, 4 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-1953525167, default 2048): 2048    
Last sector, +sectors or +size{K,M,G,T,P} (2048-1953525167, default 1953525167): 1953525167 

Created a new partition 1 of type 'Linux' and of size 931.5 GiB.
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

So, above, we hit 'p' to print what was there. All it did was tell us about the disk again because there weren't any existing partitions... This is still a good check to make sure that there's nothing on the drive already and that it's really the disk you want. If this isn't a new drive, and you want to clear it out first, you'll need to simply hit 'd' and pick the number corresponding to what you want to delete until there aren't any left. We then just created a default partition as the first and only one taking up the whole drive.

If you wanted to do as I did with the 1.5TB drives and create an array that doesn't take up the whole drive (like say I got a 2TB one, but wanted the first 1TB in this array), create a partition as normal, but change that "Last sector" number to be the same number as one of the other drives partitions. Running the fdisk /dev/<otherDevice> and hitting 'p' to print it's table and then 'q' to quit, will let you find out the sector count of that drive which you can just match here. Feel free to make extra partitions then for whatever else you want to do with the remainder of the disk.

# mdadm --grow --raid-devices=4 --add /dev/md3 /dev/sde1
mdadm: added /dev/sde1
mdadm: Need to backup 192K of critical section..

# top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
  205 root      20   0       0      0      0 S  29.2  0.0   0:16.08 md3_raid5                                                                           
 4668 root      20   0       0      0      0 D  11.0  0.0   0:04.95 md3_resync    
...

You can see that mdadm is using up some decent CPU now (this is a Quad Core 2Ghz Core 2 based Pentium D), crunching all those raid-5 checksums.

# iostat -m 1 100
 
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdc             162.00        46.23        32.00         46         32
sdd             164.00        47.38        32.00         47         32
sde              91.00         0.00        29.83          0         29
md3               0.00         0.00         0.00          0          0
sdg             151.00        44.73        28.00         44         28

And iostat shows that sdc, sdd, sdg and the new sde are all moving lots of MB/sec. Interestingly, since sde is new, you can tell it's not being read from, only written to.

If you want to see detailed progress, you can run this:

# watch cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md128 : active raid1 sdb1[1] sda1[0]
      112972800 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md3 : active raid5 sde1[3] sdg1[2] sdd1[1] sdc1[0]
      1953519872 blocks super 0.91 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
      [>....................]  reshape =  3.7% (36999992/976759936) finish=504.7min speed=31026K/sec

Here you can see both 'md128' RAID-1 my boot drive mirror (sda1 and sdb1), as well as the now expanding RAID-5 'md3' using sdg1, sdd1, sdc1, and of course, the new sde1. Because that's run via 'watch' it'll update every 2 seconds by default. Taking the 'watch' off the front will give you a 1 time status page.

And now we wait... about 504.7min, apparently. ... Finally, you'll see:

# top
# cat /proc/mdstat
...
md3 : active raid5 sde1[3] sdg1[2] sdd1[1] sdc1[0]
      2930279808 blocks level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
...
# dmesg | tail
...
[42646.351875] md: md3: reshape done.
[42648.156073] RAID conf printout:
[42648.156078]  --- level:5 rd:4 wd:4
[42648.156081]  disk 0, o:1, dev:sdc1
[42648.156084]  disk 1, o:1, dev:sdd1
[42648.156086]  disk 2, o:1, dev:sdg1
[42648.156088]  disk 3, o:1, dev:sde1
[42648.156094] md3: detected capacity change from 2000404348928 to 3000606523392
[42649.508764] VFS: busy inodes on changed media or resized disk md3

But our file system according to df still shows what it was. Well, the filesystem and it's allocation table were written while it was smaller, so, if it's formatted with any of the ext filesystem types, it can be enlarged that with the following commands.

# resize2fs /dev/md3
resize2fs 1.42.12 (29-Aug-2014)
Filesystem at /dev/md3 is mounted on /array; on-line resizing required
old_desc_blocks = 117, new_desc_blocks = 175
The filesystem on /dev/md3 is now 732569952 (4k) blocks long.
# dmesg|tail
...
[52020.706909] EXT4-fs (md3): resizing filesystem from 488379968 to 732569952 blocks
[52023.727545] EXT4-fs (md3): resized filesystem to 732569952

Huzzah! We have our space! Running a quick df will also show the capacity increase! It's online and ready to use. Hope this helped you, and thanks for reading!

2016-05-24

Expanding a non-LVM disk on Linux

The example below was done in Ubuntu 14.04LTS, but it really is about the same in any 'modern' linux distribution.

You can see below that our /storage mount is full. Time to add some more storage. Now, there's two options here. Luckily this is a virtual machine so I can just tell VMWare I want to make that disk bigger, reboot or rescan the disk and it'll pick it up... but it won't resize the partition. If this is a physical box, however, there could still be a simple solution, if this disk has multiple partitions on it, you can consume the next one on the disk to create one big partition and likewise solve this issue. If you're completely out of space on that disk, you'll need some more magic to make it happen, and that won't be covered here.

root@web01.example.com:~# df
Filesystem     1K-blocks      Used Available Use% Mounted on
dev              1012984        12   1012972   1% /dev
tmpfs             204896       756    204140   1% /run
/dev/dm-0      255583756   9385312 233192416   4% /
none                   4         0         4   0% /sys/fs/cgroup
none                5120         0      5120   0% /run/lock
none             1024468         0   1024468   0% /run/shm
none              102400         0    102400   0% /run/user
/dev/sda1         240972    104857    123674  46% /boot
/dev/sdb1      515929528 515912512         0 100% /storage

The steps you'll need to follow to expand the partition above, called '/dev/sdb1' is as follows:

Unmount the disk with the standard 'umount /dev/sdb1' command.
If you're consuming a partition, skip to step 4. If you're running this as a VM and can simply expand the disk, do so and reboot or rescan. (Different virtualization programs allow different options here. Some will allow online expansion, others will require a shutdown first)
After booting back up, make sure the drive is not mounted and open fdisk on that drive. You should see your updated drive size.
Using fdisk, you'll need to remember the partition number you're resizing, the type, and the starting position. If you're consuming the next partition and not going to the end of the disk, you'll need that number as well.

root@web01.example.com:~# fdisk /dev/sdb

Command (m for help): p

Disk /dev/sdb: 805.3 GB, 805306368000 bytes
193 heads, 8 sectors/track, 1018694 cylinders, total 1572864000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x4f10cdef

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048  1048575999   524286976   83  Linux

Command (m for help): d
Selected partition 1

Delete the partition. If you're consuming another, delete that as well. Now, you haven't deleted data, just pointers to where it starts.
Create a new partition. Start where the old one started, end either at the end of the disk, or the end of the partition you're consuming and hit "w" to write the changes to disk.

Command (m for help): n
Partition type:   
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1):
Using default value 1
First sector (2048-1572863999, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-1572863999, default 1572863999):
Using default value 1572863999

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Now the system knows about the partition but the file system inside it only knows about the old format. We have to resize the file system to fill the space. In order to resize, we must verify that it's in order and clean.

root@web01.example.com:~# resize2fs /dev/sdb1
resize2fs 1.42.9 (4-Feb-2014)
Please run 'e2fsck -f /dev/sdb1' first.

Finally, it's time to actually resize the drive

root@web01.example.com:~# e2fsck -f /dev/sdb1
e2fsck 1.42.9 (4-Feb-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb1: 1369/32768000 files (0.6% non-contiguous), 131067490/131071744 blocks

Now we can resize the file system and the mount it!

root@web01.example.com:~# resize2fs /dev/sdb1
resize2fs 1.42.9 (4-Feb-2014)
Resizing the filesystem on /dev/sdb1 to 196607744 (4k) blocks.
The filesystem on /dev/sdb1 is now 196607744 blocks long.
root@web01.example.com:~# mount -a
root@web01.example.com:~# df
Filesystem     1K-blocks      Used Available Use% Mounted on
udev             1012984        12   1012972   1% /dev
tmpfs             204896       756    204140   1% /run
/dev/dm-0      255583756   9385316 233192412   4% /
none                   4         0         4   0% /sys/fs/cgroup
none                5120         0      5120   0% /run/lock
none             1024468         0   1024468   0% /run/shm
none              102400         0    102400   0% /run/user
/dev/sda1         240972    104857    123674  46% /boot
/dev/sdb1      773960448 515911432 218711088  71% /storage

And a quick 'df' shows that we've got some breathing room!
Go grab a beer or glass of wine and pat yourself on the back.

2015-06-05

Resizing Online Disks in Linux with LVM and No Reboots

When we set this up, we only had a 16GB primary disk... Plenty of space. Until you start to write lots of logs and data... and then it fills up quick. So let's talk about how to resize an LVM based partition on a live server without reboots... Reboots are for Windows! This system is a CentOS 6.x machine running in VMWare 5.x that's currently got a 16GiB VMDK based drive. Let's see what we've got to work with:

$ df -h
Filesystem                                         Size  Used Avail Use% Mounted on
/dev/mapper/myfulldisk--vg-root                     12G   11G     0 100% /
none                                               4.0K     0  4.0K   0% /sys/fs/cgroup
udev                                               7.4G  4.0K  7.4G   1% /dev
tmpfs                                              1.5G  572K  1.5G   1% /run
none                                               5.0M     0  5.0M   0% /run/lock
none                                               7.4G     0  7.4G   0% /run/shm
none                                               100M     0  100M   0% /run/user
/dev/sda1                                          236M   37M  187M  17% /boot
/dev/sdc1                                          246G   44G  190G  19% /data

Hmm... time to get on this then. Now, luckily we're running in VMWare. A quick edit to our VM to enlarge the VMDK (not covered in this how-to) will fix this... First, what device are we talking about?

$ dmesg|grep sd
[    1.562363] sd 2:0:0:0: Attached scsi generic sg1 type 0
[    1.562384] sd 2:0:0:0: [sda] 33554432 512-byte logical blocks: (17.1 GB/16.0 GiB)
[    1.562425] sd 2:0:0:0: [sda] Write Protect is off
[    1.562426] sd 2:0:0:0: [sda] Mode Sense: 61 00 00 00
[    1.562460] sd 2:0:0:0: [sda] Cache data unavailable
[    1.562461] sd 2:0:0:0: [sda] Assuming drive cache: write through
[    1.563331] sd 2:0:0:0: [sda] Cache data unavailable
[    1.563451] sd 2:0:1:0: Attached scsi generic sg2 type 0
[    1.563452] sd 2:0:1:0: [sdb] 8388608 512-byte logical blocks: (4.29 GB/4.00 GiB)
[    1.563479] sd 2:0:1:0: [sdb] Write Protect is off
[    1.563481] sd 2:0:1:0: [sdb] Mode Sense: 61 00 00 00
[    1.563507] sd 2:0:1:0: [sdb] Cache data unavailable
[    1.563508] sd 2:0:1:0: [sdb] Assuming drive cache: write through
[    1.563755] sd 2:0:2:0: Attached scsi generic sg3 type 0
[    1.563881] sd 2:0:2:0: [sdc] 524288000 512-byte logical blocks: (268 GB/250 GiB)
[    1.563942] sd 2:0:2:0: [sdc] Write Protect is off
[    1.563944] sd 2:0:2:0: [sdc] Mode Sense: 61 00 00 00
[    1.564008] sd 2:0:2:0: [sdc] Cache data unavailable
[    1.564010] sd 2:0:2:0: [sdc] Assuming drive cache: write through
[    1.564282] sd 2:0:2:0: [sdc] Cache data unavailable
[    1.564283] sd 2:0:2:0: [sdc] Assuming drive cache: write through
[    1.564360] sd 2:0:1:0: [sdb] Cache data unavailable
[    1.564362] sd 2:0:1:0: [sdb] Assuming drive cache: write through
[    1.564989] sd 2:0:0:0: [sda] Assuming drive cache: write through
[    1.571010]  sdb: sdb1
[    1.571426] sd 2:0:1:0: [sdb] Cache data unavailable
[    1.571514] sd 2:0:1:0: [sdb] Assuming drive cache: write through
[    1.571626] sd 2:0:1:0: [sdb] Attached SCSI disk
[    1.574181]  sda: sda1 sda2 < sda5 >
[    1.574797] sd 2:0:0:0: [sda] Cache data unavailable
[    1.574888] sd 2:0:0:0: [sda] Assuming drive cache: write through
[    1.575003] sd 2:0:0:0: [sda] Attached SCSI disk
[    1.579250]  sdc: sdc1
[    1.579805] sd 2:0:2:0: [sdc] Cache data unavailable
[    1.579944] sd 2:0:2:0: [sdc] Assuming drive cache: write through
[    1.580141] sd 2:0:2:0: [sdc] Attached SCSI disk
[    6.922330] Adding 4193276k swap on /dev/sdb1.  Priority:-1 extents:1 across:4193276k FS
[    7.137134] EXT4-fs (sda1): mounting ext2 file system using the ext4 subsystem
[    7.142419] EXT4-fs (sda1): mounted filesystem without journal. Opts: (null)
[    7.218150] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null)
[    7.384566] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).

The first one is the 16GB drive in question. Take the number on that line and use it in the next step:

$ echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan
$ dmesg |tail
[1918441.322362] sd 2:0:0:0: [sda] 209715200 512-byte logical blocks: (107 GB/100 GiB)
[1918441.322596] sd 2:0:0:0: [sda] Cache data unavailable
[1918441.330685] sd 2:0:0:0: [sda] Assuming drive cache: write through
[1918441.489622] sda: detected capacity change from 17179869184 to 1073741824

So, that's good, it sees our increased size. Now, lets enlarge that Volume Group. First we get info about the volume group.

$ vgdisplay
  --- Volume group ---
  VG Name               myfulldisk-vg
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               15.76 GiB
  PE Size               4.00 MiB
  Total PE              4034
  Alloc PE / Size       4028 / 15.73 GiB
  Free  PE / Size       6 / 24.00 MiB
  VG UUID               dv3URd-EVvz-oTwY-WiDW-RPt1-4rbD-FnPxxM

That '6' is only 24MiB. It doesn't see our new space yet. In order to get it to, we need to make a new partition of the right type, then add it to the volume group. We'll then end up with more Free PE's. Here we go:

$ fdisk /dev/sda
Command (m for help): p

Disk /dev/sda: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders, total 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ade37

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      499711      248832   83  Linux
/dev/sda2          501758    33552383    16525313    5  Extended
/dev/sda5          501760    33552383    16525312   8e  Linux LVM

Command (m for help): n
Partition type:
   p   primary (1 primary, 1 extended, 2 free)
   l   logical (numbered from 5)
Select (default p): p
Partition number (1-4, default 3): 3
First sector (499712-209715199, default 499712): 33552384
Last sector, +sectors or +size{K,M,G} (33552384-209715199, default 209715199): 
Using default value 209715199

Command (m for help): t
Partition number (1-5): 3
Hex code (type L to list codes): 8e
Changed system type of partition 3 to 8e (Linux LVM)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

At this point you could reboot... but we're not going to. Even though this is our root drive which makes this a little trickier, it's nothing we can't fix:

$ partprobe /dev/sda

Hopefully, partprobe has found your new partition for you and enlightened the kernel with it's wisdom (or at least fresh load of zeros). Now we need to make it an available volume to use for expanding the disk. This consists of making it a 'Physical volume', and then adding that physical volume to the Volume Group containing the disk we want to expand.

$ pvcreate /dev/sda3
  Physical volume "/dev/sda3" successfully created
$ pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda5
  VG Name               myfulldisk-vg
  PV Size               15.76 GiB / not usable 2.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              4034
  Free PE               6
  Allocated PE          4028
  PV UUID               a3mhvZ-ogyk-ao4y-2JSM-KVfL-i9no-q0LAUk
   
  "/dev/sda3" is a new physical volume of "84.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sda3
  VG Name               
  PV Size               84.00 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               IpyjOU-1GDy-bTLL-U9kE-iSGP-BYg1-a25LIm
   
$ vgextend /dev/myfulldisk-vg /dev/sda3
  Volume group "myfulldisk-vg" successfully extended
$ vgdisplay
  --- Volume group ---
  VG Name               myfulldisk-vg
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               99.76 GiB
  PE Size               4.00 MiB
  Total PE              25538
  Alloc PE / Size       4028 / 15.73 GiB
  Free  PE / Size       21510 / 84.02 GiB
  VG UUID               dv3URd-EVvz-oTwY-WiDW-RPt1-4rbD-FnPxxM

Awesome, we now have 21510 free PE's that we can use... That's, apparently, 84.02GB in this case. Next up, we'll need to know what portion of the VG we need to extend. Looking back up at a 'df' output, and knowing our system, it says "root" in there. doing a quick ls of /dev/myfulldisk-vg/ shows that there's only really a choice between root" and "swap". So, knowing it's root we move on with:

$ lvextend -L95G /dev/myfulldisk-vg/root
  Extending logical volume root to 95.00 GiB
  Logical volume root successfully resized
$ df -h
Filesystem                                         Size  Used Avail Use% Mounted on
/dev/mapper/myfulldisk--vg-root                     12G   11G  114M  99% /
none                                               4.0K     0  4.0K   0% /sys/fs/cgroup
udev                                               7.4G  4.0K  7.4G   1% /dev
tmpfs                                              1.5G  576K  1.5G   1% /run
none                                               5.0M     0  5.0M   0% /run/lock
none                                               7.4G     0  7.4G   0% /run/shm
none                                               100M     0  100M   0% /run/user
/dev/sda1                                          236M   37M  187M  17% /boot
/dev/sdc1                                          246G   44G  190G  19% /data

Okay, the VG might be bigger, but no one else knows that. Because the filesystem ON the VG is still the same size, luckily there's a command for that too!

$ resize2fs /dev/myfulldisk-vg/root
resize2fs 1.42.9 (4-Feb-2014)
Filesystem at /dev/myfulldisk-vg/root is mounted on /; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 6
The filesystem on /dev/myfulldisk-vg/root is now 24903680 blocks long.

$ df -h
Filesystem                                         Size  Used Avail Use% Mounted on
/dev/mapper/myfulldisk--vg-root                     94G   11G   79G  12% /
none                                               4.0K     0  4.0K   0% /sys/fs/cgroup
udev                                               7.4G  4.0K  7.4G   1% /dev
tmpfs                                              1.5G  576K  1.5G   1% /run
none                                               5.0M     0  5.0M   0% /run/lock
none                                               7.4G     0  7.4G   0% /run/shm
none                                               100M     0  100M   0% /run/user
/dev/sda1                                          236M   37M  187M  17% /boot
/dev/sdc1                                          246G   44G  190G  19% /data

Ha, there we go! 79GB available, enjoy!

2014-03-10

LogStash ElasticSearch Index Cleanup

LogStash is a great way to track logs from lots of different sources and store them in a central location where metrics and monitoring can occur. I've started pushing LOTS of data into our setup which uses the ElasticSearch back end. To quote their site, "ElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine." and I think it has a really bright future... but currently it's soaking up a lot of disk space. I'm sure I'm not the only one with this issue, after all, when something can handle LOADS of data, you want to give it all you've got! So, we've got 3 hosts running ElasticSearch processes, each with 250GB of data storage. Sometimes one will start to fill up. Looking into the API, I found it's really, REALLY easy to delete old data to keep the size within requested parameters. First off, looking at LogStash's ElasticSearch plugin, it notes that by default LogStash indexes are "logstash-%{+YYYY.MM.dd}". Keeping that in mind, the following info would work for anything as long as you know the indexes you want to delete, but let's start off simple.

curl -s -XDELETE  'http://127.0.0.1:9200/logstash-2014.02.28'

That'll delete the "logstash-2014.02.28" index. I've had to connect in and do this sometimes. Great to do when you need it on demand, but we can do better. Assuming that I'm cool keeping the last 7 days up there, let's edit up a quick bash script:

#!/bin/bash
DATETODELETE=`date +%Y.%m.%d -d '7 days ago'`
curl -s -XDELETE  http://127.0.0.1:9200/logstash-${DATETODELETE}

Now, we could put that in the crontab, have it run once or twice a day and be good to go... And if you knew you could ALWAYS keep 7 days worth of data on your system, that'd be acceptable. But let's have some more fun. Let's assume that we want to keep as much as we can on our system and still keep 10% space free and that the drive we store this on is mounted on /data

#!/bin/bash
#This is about 10% of a 250GB volume (not GiB) using 1000 = k 
DESIRED=24410000
AVAIL=`df /data|grep -v Filesystem|awk '{print $4}'`
if [ $AVAIL -lt $DESIRED ] 
then
    curl -s -XDELETE 127.0.0.1:9200/`curl -s 127.0.0.1:9200/_stats?pretty|grep logstash|sort|awk -F'"' '{ print $2 }'|head -n1`
fi

Let's explain this sample a bit... First off, we set DESIRED to be the amount of "Available" space we desire the system to retain. in our case above, I calculated 10% of a 250GB drive and put that in. So if it ever starts to go below 10% remaining (90%+ used) the if statement will fire.

Next, I pull the Available space. If you take what's in the backquotes and put it on a command line, you'll see what happens. I run df, limited to just the filesystem I care about, I get rid of the line with the labels and then awk pulls out the 4th column (Avail). This number gets stored in AVAIL and we move on.

The If statement then compares the two, if DESIRED is less than AVAIL, we are bumping our limit and have something's got to give, so we run the curl... This curl is a combination of two actually. Starting inside out, we do a "curl -s 127.0.0.1:9200/_stats?pretty" which prints out a list of indexes and a bunch of cool stats about them... then we grep for logstash to get rid of all the cool stats and just keep the names, then we use sort to make sure they're in order of lowest to highest (since they have dates 2014-03-04 and such, that works) and then we use some awk magic to pull out JUST the name of the index and get rid of the other chars 'pretty' uses. That then gets placed back in the right place for the outter curl to execute a delete on it and bye, bye index!

If you put this in your crontab and run it often (it won't do anything if the drive has more than the desired available space remaining) you'll be able to maintain free space on your ElasticSearch hosts without having to set a hard limit on days to keep.

Thinking further into it, you can use the same script with different commands inside the if statement to keep free space on many other systems as well.

2014-01-03

WordPress asks for FTP Credentials

Being a modern Systems Administrator, I'm sometimes asked to manage things that throw me for a loop. WordPress is one of those things. It's both really simple and really complex, and sometimes not direct with it's response to problems. I've noticed with a fresh WordPress install that when my users wanted to upload a new theme, they were presented with a normal 'upload' link. Click this button, browse to your file, hit okay, then hit upload. All well and good. But then it prompted them for FTP or SFTP credentials. No error message, no reason why FTP would be needed in light of the previous, seemingly successful upload. We don't run FTP here in relation to WordPress, nor would I want to add that complexity to the setup or allow another potential access point for an attacker.

After digging a bit, the reason came down to my being a bit too secure and clamping down the permissions for my web server user, on the WordPress files, too far. I found the following issues during troubleshooting:

The default install from a tarball doesn't make the wp-content/uploads directory. You've got to make it yourself.
The uploads dir must allow writes (for obvious reason) by apache or www-user or whoever your web user is.
The target of your upload has to go somewhere... Theme uploads need apache to be able to write to wp-content/themes/, upgrades wp-content/upgrades/, etc.

Fixing these issues made the FTP prompt stop showing up, and we were good to go.

2013-12-18

More OpenShift Oddities

I had to fight with OpenShift a bit more today to get my application up and running after a botched code push. Restarting from the website didn't work, and simply re-pushing git code didn't help either... so time to dig in. As you can see here, [node] being in brackets meant it wasn't really running, it was in the process of starting or stopping... in fact, it kept doing it quite frequently according to a tail -f on /nodejs/logs/node.log ... So, I decided I had to stop it restarting, but how?

[(app name).rhcloud.com (username)]\> ps aux
kUSER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     240483  0.0  0.0 105068  3152 ?        S    17:02   0:00 sshd: (user)@pts/1
1313     240486  0.0  0.0 108608  2100 pts/1    Ss   17:02   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     249661  0.1  0.4 397100 35224 ?        Sl   17:08   0:00 /usr/bin/mongod --auth -f /var/lib/openshift/(user)/mongodb//conf/mongodb.conf run
1313     261473  5.5  0.0      0     0 ?        R    17:15   0:00 [node]
1313     261476  2.0  0.0 110244  1156 pts/1    R+   17:15   0:00 ps aux
1313     390906  8.1  0.2 1021240 20196 ?       Sl   Dec10 321:14 node /opt/rh/nodejs010/root/usr/bin/supervisor -e node|js|coffee -p 1000 -- server.js
[(app name).rhcloud.com (username)]\> kill 390906

That killed the process "supervisor" that re-spawns the node process. This is generally helpful, but today, it's continually incrementing the PID and it seems like that's happening more often than the gear can attempt to stop it. Unfortunately, now I can't restart it (rerunning that command in the ps output just gave me an error complaining about an Unhandled 'error' event in the supervisor script, so I decided to start the node service myself.

There are a few ways of doing this. You can go to your code and run 'node' or you can use gear start. But if you try gear start, well, it won't start if it thinks it's already running. After killing supervisor, the node process was not attempting to restart, but gear start didn't work either. I tried tricking it by clearing out the $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid file, but that didn't work either... It did point out something I could use though.

[(appname).rhcloud.com (username)]\> gear stop
Stopping gear...
Stopping NodeJS cartridge
usage: kill [ -s signal | -p ] [ -a ] pid ...
       kill -l [ signal ]
Stopping MongoDB cartridge
[(appname).rhcloud.com (username]\> gear start
Starting gear...
Starting MongoDB cartridge
Starting NodeJS cartridge
Application 'deploy' failed to start
An error occurred executing 'gear start' (exit code: 1)
Error message: Failed to execute: 'control start' for /var/lib/openshift/(username)/nodejs

For more details about the problem, try running the command again with the '--trace' option.

What I found interesting about that was that it apparently tried to pass the empty pid that was in the $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid file along to kill and kill didn't know what to do with that. In fact, kill returns a failed error code if you don't tell it what to kill OR if you tell it to kill something that wasn't there (original issue), so instead of getting an 'okay' back from the kill command when the gear script tried to run it, it got a failure and that meant problems for gear. So, I thought if I got something running on a PID that it COULD kill and put that PID in the file, it'd kill it successfully and everything would be back to normal. Easiest thing I could think of was to stick the '}' in my script that I'd forgotten and run that.

The node code is stored in /app-deloyments/<datestamp>/repo/ .. but don't expect things you put here to stick around.

\> node server.js 
^Z
\> ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     240483  0.0  0.0 105068  3152 ?        S    17:02   0:00 sshd: (user)@pts/1
1313     240486  0.0  0.0 108608  2124 pts/1    Ss   17:02   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     275483  0.3  0.4 467788 36892 ?        Sl   17:24   0:01 /usr/bin/mongod --auth -f /var/lib/openshift/(user)/mongodb//conf/mongodb.conf run
1313     284292  2.5  0.6 732440 45924 pts/1    Sl   17:30   0:02 node server.js
1313     287036  2.0  0.0 110240  1156 pts/1    R+   17:32   0:00 ps aux
\> echo "284292" > $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid

So, PID is in the file, and the PID is a valid running node process. Then I did my git commit of my fix, and ran git push... and it was back to normal!

Counting objects: 5, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 344 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Stopping NodeJS cartridge
remote: Stopping MongoDB cartridge
remote: Saving away previously installed Node modules
remote: Building git ref 'master', commit f5e40ef
remote: Building NodeJS cartridge
remote: npm info it worked if it ends with ok
...
remote: npm info ok 
remote: Preparing build for deployment
remote: Deployment id is aa38fed5
remote: Activating deployment
remote: Starting MongoDB cartridge
remote: Starting NodeJS cartridge
remote: Result: success
remote: Activation status: success
remote: Deployment completed with status: success

So, now that the PID was stable and correct, it seemed to deploy properly and I've had no troubles since!

2013-12-10

Too many authentication failures for

Lately I've been getting this lovely error when trying to ssh to certain hosts (not all, of course):

# ssh ssh.example.com
Received disconnect from 192.168.1.205: 2: Too many authentication failures for

My first thought is "But you didn't even ASK me for a password!" My second thought is "And you're supposed to be using ssh keys anyway!"

So, I decide I need to specify a specific key to use on the command line with the -i option.

# ssh ssh.example.com -i myAwesomeKey
Received disconnect from 192.168.1.205: 2: Too many authentication failures for

Well, that didn't help. Adding a -v shows that it tried a lot of keys... including the one I asked it to. Now, apparently this is the crux of the issue. You see, it looks through the config file (of which mine is fairly extensive as I deal with a few hundred hosts, most of which share a subset of keys, but not all of them). Apparently it doesn't always necessarily try the key I specified FIRST. So, if you have more than, say 5 keys defined, it may not necessarily use the key you want it to use first, it will offer anything from the config file. Yes, even if you have them defined per host. For instance, my config file goes something like this:

Host src.example.com
 User frank.user
 Compression yes
 CompressionLevel 9
 IdentityFile /home/username/.ssh/internal

Host puppet.example.com
 User john.doe
 Compression yes
 CompressionLevel 9
 IdentityFile /home/username/.ssh/jdoe

Apparently, this means ssh will try both of these keys for any host that isn't those two. If the third one you define, "Host ssh.example.com" in our case, is the one you want, it'll do that one THIRD, even though the host entry line matches. The fix is simple: Tack "IdentitiesOnly yes" in there. It tells ssh to apply ONLY the IdentityFile entries having to do with that host TO that host.

Host src.example.com
 User frank.user
 Compression yes
 CompressionLevel 9
        IdentitiesOnly yes
 IdentityFile /home/username/.ssh/internal

The side effect of this is that you don't have to define an IdentityFile line for EVERY HOST. It will apply all the keys it knows about to all of the Host entries in the config, and indeed to every ssh you attempt, listed or not. This is why it didn't always fail, there was a good chance the first one or two in the list worked. It was only when the first 5 it tried didn't work that it failed.

2013-11-20

Adding Swap Space in Linux Without a Reboot

So, let's say you've got a server running out of memory. Not just RAM, but swap too. Now, generally, there are a few well known ways to solve this issue.

Close/Kill processes you don't need
Reboot
Add another swap partition
Buy more RAM
Buy more Hardware

Now, In our scenario, the first option isn't helping, the second one is just the nuclear option to the first. But we've got one huge process and it's not all active memory... it's just consuming a lot of RAM and Swap and we want it to succeed. Buying more RAM is the best idea, but this server won't take anymore, or we're not sure we'll have this workload often, so we can't justify wasting money on more hardware. We've gotta get creative before it fills up and gets OOM killed. Adding another swap partition is a great idea, but we're out of available disk partitions or drives to throw at it. However, we do have some free space on an existing partition, we can leverage that.

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md2               47G   11G   35G  23% /
/dev/hda1              99M   20M   74M  22% /boot

Alright, looking at a top or vmstat, we know we've got 4GB of RAM in here, and another 2GB of swap. Knowing the size of the process, we figure doubling that swap will give us plenty of overhead at the moment. Let's do this!

$ dd if=/dev/zero of=/newswap bs=32k count=64k

65536+0 records in
65536+0 records out
2147483648 bytes (2.1 GB) copied, 18.9618 seconds, 113 MB/s

$ ls -al /newswap
-rw-r--r-- 1 root root 2147483648 Nov 19 23:02 /newswap
$ mkswap /newswap
Setting up swapspace version 1, size = 2147479 kB
$ swapon /newswap

And that's it. A quick check should find that we now have another 2GB of swap space and a system that can breathe a little more.

Note: The size of the swap space is determined by the size of the file. 'bs' is the block size, and 'count' is the number of blocks. I generally stick to 32k or 64k block sizes and then adjust the count from there. 64k & 64k is 4GB, 64k and 128k is 8GB, etc.

Now, this won't stick after a reboot as is. If you'd like it to, I recommend changing the process a bit. It's the same until you've finished the mkswap command, after that instead of running swapon, open up the /etc/fstab in your favorite editor (vi /etc/fstab) and then add another swap line after the disk the file is on is listed like so:

/newswap         swap                    swap    defaults        0 0

Then you can run 'swapon -a' and it will mount ALL swap partitions.

Note: Swap automatically stripes across multiple swap partitions of the same priority. It might be useful to make swap partitions on multiple drives to allow for faster RAID-0 type speeds across drives!

Hope this helped someone out. I had to use it the other day and was able to save a long running process that was eating up RAM like candy. It finished a few hours after I put this fix in place. Since I don't run that process often, I simply removed the line from the /etc/fstab and the next time it rebooted, it was back to it's normal swap sizes. I then deleted the file and it was like nothing ever happened!

2013-09-23

Setting up Replication in MySQL of 3+ cluster nodes.

I've been running a multinode MySQL Replication Loop for a while and thought it'd be useful to write up how replication is synced between the peers. I'm writing this up as if there are three nodes, Adam, Ben and Charlie in a loop, but you can do this with any number of nodes. In our setup Adam and Ben are in our primary location, with Charlie sitting in our DR setup. This means Charlie is always on but doesn't get many queries on a regular basis. Adam and Ben have heartbeatd setup as a failover pair so that when Adam has a vacation (downtime) everything fails over to Ben and continues to run. We do this simply with a floating IP.

1. Decide which node to start with

Because of this setup, prior to doing a resync, it's important to connect to the node who currently has the active IP and start things from there. To find out who that is, connect to the nodes and run:

$ /sbin/ip addr

You'll see something like:

: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:50:56:82:29:01 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.56/24 brd 192.168.1.255 scope global eth0
    inet 192.168.1.59/24 brd 192.168.1.255 scope global secondary eth0

Note how this one has both the primary (eth0) and "secondary eth0" address on it. Also, we know our floating IP is 192.167.1.59, so we know it's here. Because of that, this node has the latest info and will thus be the beginning of our resync process.

2. Stop the Slaves

To stop the slaves, connect to mysql as follows:

$ mysql -p -u root

mysql> stop slave;
Query OK, 0 rows affected (0.01 sec)

I recommend going ahead and stopping the slave on both Adam and Charlie as well at this point.

3. Get a fresh mysqldump

To start the resync process, we need to take a dump of all databases with masterdata in a single-transation:

/bin/mysqldump -p -u root --all-databases --single-transaction --master-data | bzip2 > 20120912-midsync-`hostname`.sql.bz2

This will make sure that we get all the info we need in the backup. Without Master Data we'd have to remember when we took the backup when we restored it in order to not miss any updates, which is almost impossible when running this against an active loop. When this is done, transfer the backup from Adam to Ben. Then, stop the slaves around the circle.

4. Transfer dump to the next node in the circle

You can use scp or what ever you'd like. just get the backup over there.

5. Import the dump

Then, on the first target (Ben in this case), bunzip2 the file, and then import it:

$ mysql -p -u root < 20120912-midsync-adam.sql

Because we included Master Data, this will automatically set the master position to be correct to resume the slave... but don't do it yet.

6. Rinse, repeat

We're going to continue around the circle making backups of each, passing them onto the next and then restoring. in our case, we need a fresh backup of Ben, same way we did the original Adam backup. Transfer that one to Charlie. bunzip2 it, and import it with the MySQL command. If you have more nodes, you'll continue this around until all have fresh restores.

If you have multiple primary or online nodes, you'll want to 'start slave;' right after you import the backup. This will make sure they don't have stale data or data that will conflict with another node's data.

For us, we're going to take advantage of the fact that Charlie is in our DR site and thus doesn't have much written to it, but still we'll need to do this quickly.

7. The Last Hop

Okay, you've got all of the up to date, but we haven't made that last hop where 0 pulls from N. Instead of the normal restore process, we're going to switch it up a bit:

$ ls
20120912-BenBackup.sql.bz2
$ bunzip 20120912-BenBackup.sql.bz2
$ echo "reset master;" >> 20120912-BenBackup.sql
$ mysqp -u root -p < 20120912-BenBackup.sql

Adding on 'reset master' will reset the master files and counter. Now, bring up both this node's terminal and Adam's terminal up next to each other. Login to Adam's MySQL prompt and type in "CHANGE MASTER TO MASTER_LOG_FILE='', MASTER_LOG_POS=###;", but don't hit enter. On Charlie, at the MySQL prompt, type in 'show master status;' and hit enter. Now, copy the file name in the first data column over to your command on Adam. Then, move your cursor to the ###'s, and run the 'show master status' command again copying the number over and hitting enter as quick as you can accurately.

8. Gentlemen, start your slaves!

Once you have it set, run "start slave" on Adam, then on Charlie, and finally on Ben. It's important that this number matches.

9. Check your status

When the slave is started, it pulls any changes since the backups' restore (or master reset) from the node behind it (Charlie pulls from Ben, for instance). By working backward, we have a better chance of the replication circle staying in sync and once they're all up, you're almost done. Give it a minute and run "Show Slave Status\G" on each. You should see something similar to this:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.57
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: ben-bin.000022
          Read_Master_Log_Pos: 888045912
               Relay_Log_File: charlie-relay-bin.000002
                Relay_Log_Pos: 81957
        Relay_Master_Log_File: ben-bin.000022
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 888045912
              Relay_Log_Space: 82121
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2
1 row in set (0.00 sec)