2016-06-24

Growing a RAID-5 Mdadm while online

Today, I decided that my 3 drive RAID-5 setup just wasn't big enough. I've got 2 1.5TB and a 1TB drive in a RAID-5 of the first 1TB across them all using madam. The extra space on the 1.5's lets me use that area as scratch disks for other things that I don't especially need the speed or resilience of the RAID-5 for. But now it's time to throw another TB at it and make it a 4 drive RAID-5. Mdadm can help us out here in just a few steps. First, though, we have to partition the new drive so we can use it. We also need to know which one it is... First I'll see what we had. I know it's mounted on /array/ and can tell from a mount output what the device name is, and extrapolate what drives are part of it like this:

# mount |grep array
/dev/md3 on /array type ext4 (rw,noatime,data=ordered)

# mdadm --misc --detail /dev/sd3
/dev/md3:
        Version : 0.91
  Creation Time : Sun Jul 17 21:20:35 2011
     Raid Level : raid5
...
    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       97        2      active sync   /dev/sdg1

So, sdc1, sdd1, sdg1 are all part of this array. After inserting the new disk, I run a `dmesg|grep TB` as I know it will be listed as a #TB drive and we'll look for other devices:
# dmesg |grep TB
[    3.264540] sd 2:0:0:0: [sdc] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
[    3.329286] sd 2:0:1:0: [sdd] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
[    3.329630] sd 3:0:0:0: [sde] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[    5.930020] sd 9:0:0:0: [sdg] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

Hey look, a new one, and we shall call you 'sde'. Time to make some partitions:

# fdisk /dev/sde

Welcome to fdisk (util-linux 2.26.2).
Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Device does not contain a recognized partition table. Created a new DOS disklabel with disk identifier 0x8c82f3b1. Command (m for help): p Disk /dev/sde: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: dos Disk identifier: 0x8c82f3b1 Command (m for help): n Partition type p primary (0 primary, 0 extended, 4 free) e extended (container for logical partitions) Select (default p): p Partition number (1-4, default 1): 1 First sector (2048-1953525167, default 2048): 2048 Last sector, +sectors or +size{K,M,G,T,P} (2048-1953525167, default 1953525167): 1953525167 Created a new partition 1 of type 'Linux' and of size 931.5 GiB. Command (m for help): w The partition table has been altered. Calling ioctl() to re-read partition table. Syncing disks.
So, above, we hit 'p' to print what was there. All it did was tell us about the disk again because there weren't any existing partitions... This is still a good check to make sure that there's nothing on the drive already and that it's really the disk you want. If this isn't a new drive, and you want to clear it out first, you'll need to simply hit 'd' and pick the number corresponding to what you want to delete until there aren't any left. We then just created a default partition as the first and only one taking up the whole drive.

If you wanted to do as I did with the 1.5TB drives and create an array that doesn't take up the whole drive (like say I got a 2TB one, but wanted the first 1TB in this array), create a partition as normal, but change that "Last sector" number to be the same number as one of the other drives partitions. Running the fdisk /dev/<otherDevice> and hitting 'p' to print it's table and then 'q' to quit, will let you find out the sector count of that drive which you can just match here. Feel free to make extra partitions then for whatever else you want to do with the remainder of the disk.

# mdadm --grow --raid-devices=4 --add /dev/md3 /dev/sde1
mdadm: added /dev/sde1
mdadm: Need to backup 192K of critical section..

# top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
  205 root      20   0       0      0      0 S  29.2  0.0   0:16.08 md3_raid5                                                                           
 4668 root      20   0       0      0      0 D  11.0  0.0   0:04.95 md3_resync    
...
You can see that mdadm is using up some decent CPU now (this is a Quad Core 2Ghz Core 2 based Pentium D), crunching all those raid-5 checksums.

# iostat -m 1 100
 
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdc             162.00        46.23        32.00         46         32
sdd             164.00        47.38        32.00         47         32
sde              91.00         0.00        29.83          0         29
md3               0.00         0.00         0.00          0          0
sdg             151.00        44.73        28.00         44         28

And iostat shows that sdc, sdd, sdg and the new sde are all moving lots of MB/sec. Interestingly, since sde is new, you can tell it's not being read from, only written to.

If you want to see detailed progress, you can run this:

# watch cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md128 : active raid1 sdb1[1] sda1[0]
      112972800 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md3 : active raid5 sde1[3] sdg1[2] sdd1[1] sdc1[0]
      1953519872 blocks super 0.91 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
      [>....................]  reshape =  3.7% (36999992/976759936) finish=504.7min speed=31026K/sec

Here you can see both 'md128' RAID-1 my boot drive mirror (sda1 and sdb1), as well as the now expanding RAID-5 'md3' using sdg1, sdd1, sdc1, and of course, the new sde1. Because that's run via 'watch' it'll update every 2 seconds by default. Taking the 'watch' off the front will give you a 1 time status page.

And now we wait... about 504.7min, apparently. ... Finally, you'll see:

# top
# cat /proc/mdstat
...
md3 : active raid5 sde1[3] sdg1[2] sdd1[1] sdc1[0]
      2930279808 blocks level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
...
# dmesg | tail
...
[42646.351875] md: md3: reshape done.
[42648.156073] RAID conf printout:
[42648.156078]  --- level:5 rd:4 wd:4
[42648.156081]  disk 0, o:1, dev:sdc1
[42648.156084]  disk 1, o:1, dev:sdd1
[42648.156086]  disk 2, o:1, dev:sdg1
[42648.156088]  disk 3, o:1, dev:sde1
[42648.156094] md3: detected capacity change from 2000404348928 to 3000606523392
[42649.508764] VFS: busy inodes on changed media or resized disk md3

But our file system according to df still shows what it was. Well, the filesystem and it's allocation table were written while it was smaller, so, if it's formatted with any of the ext filesystem types, it can be enlarged that with the following commands.

# resize2fs /dev/md3
resize2fs 1.42.12 (29-Aug-2014)
Filesystem at /dev/md3 is mounted on /array; on-line resizing required
old_desc_blocks = 117, new_desc_blocks = 175
The filesystem on /dev/md3 is now 732569952 (4k) blocks long.
# dmesg|tail
...
[52020.706909] EXT4-fs (md3): resizing filesystem from 488379968 to 732569952 blocks
[52023.727545] EXT4-fs (md3): resized filesystem to 732569952

Huzzah! We have our space! Running a quick df will also show the capacity increase! It's online and ready to use. Hope this helped you, and thanks for reading!