The JML Continuum: 2013

2013-12-18

More OpenShift Oddities

I had to fight with OpenShift a bit more today to get my application up and running after a botched code push. Restarting from the website didn't work, and simply re-pushing git code didn't help either... so time to dig in. As you can see here, [node] being in brackets meant it wasn't really running, it was in the process of starting or stopping... in fact, it kept doing it quite frequently according to a tail -f on /nodejs/logs/node.log ... So, I decided I had to stop it restarting, but how?

[(app name).rhcloud.com (username)]\> ps aux
kUSER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     240483  0.0  0.0 105068  3152 ?        S    17:02   0:00 sshd: (user)@pts/1
1313     240486  0.0  0.0 108608  2100 pts/1    Ss   17:02   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     249661  0.1  0.4 397100 35224 ?        Sl   17:08   0:00 /usr/bin/mongod --auth -f /var/lib/openshift/(user)/mongodb//conf/mongodb.conf run
1313     261473  5.5  0.0      0     0 ?        R    17:15   0:00 [node]
1313     261476  2.0  0.0 110244  1156 pts/1    R+   17:15   0:00 ps aux
1313     390906  8.1  0.2 1021240 20196 ?       Sl   Dec10 321:14 node /opt/rh/nodejs010/root/usr/bin/supervisor -e node|js|coffee -p 1000 -- server.js
[(app name).rhcloud.com (username)]\> kill 390906

That killed the process "supervisor" that re-spawns the node process. This is generally helpful, but today, it's continually incrementing the PID and it seems like that's happening more often than the gear can attempt to stop it. Unfortunately, now I can't restart it (rerunning that command in the ps output just gave me an error complaining about an Unhandled 'error' event in the supervisor script, so I decided to start the node service myself.

There are a few ways of doing this. You can go to your code and run 'node' or you can use gear start. But if you try gear start, well, it won't start if it thinks it's already running. After killing supervisor, the node process was not attempting to restart, but gear start didn't work either. I tried tricking it by clearing out the $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid file, but that didn't work either... It did point out something I could use though.

[(appname).rhcloud.com (username)]\> gear stop
Stopping gear...
Stopping NodeJS cartridge
usage: kill [ -s signal | -p ] [ -a ] pid ...
       kill -l [ signal ]
Stopping MongoDB cartridge
[(appname).rhcloud.com (username]\> gear start
Starting gear...
Starting MongoDB cartridge
Starting NodeJS cartridge
Application 'deploy' failed to start
An error occurred executing 'gear start' (exit code: 1)
Error message: Failed to execute: 'control start' for /var/lib/openshift/(username)/nodejs

For more details about the problem, try running the command again with the '--trace' option.

What I found interesting about that was that it apparently tried to pass the empty pid that was in the $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid file along to kill and kill didn't know what to do with that. In fact, kill returns a failed error code if you don't tell it what to kill OR if you tell it to kill something that wasn't there (original issue), so instead of getting an 'okay' back from the kill command when the gear script tried to run it, it got a failure and that meant problems for gear. So, I thought if I got something running on a PID that it COULD kill and put that PID in the file, it'd kill it successfully and everything would be back to normal. Easiest thing I could think of was to stick the '}' in my script that I'd forgotten and run that.

The node code is stored in /app-deloyments/<datestamp>/repo/ .. but don't expect things you put here to stick around.

\> node server.js 
^Z
\> ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     240483  0.0  0.0 105068  3152 ?        S    17:02   0:00 sshd: (user)@pts/1
1313     240486  0.0  0.0 108608  2124 pts/1    Ss   17:02   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     275483  0.3  0.4 467788 36892 ?        Sl   17:24   0:01 /usr/bin/mongod --auth -f /var/lib/openshift/(user)/mongodb//conf/mongodb.conf run
1313     284292  2.5  0.6 732440 45924 pts/1    Sl   17:30   0:02 node server.js
1313     287036  2.0  0.0 110240  1156 pts/1    R+   17:32   0:00 ps aux
\> echo "284292" > $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid

So, PID is in the file, and the PID is a valid running node process. Then I did my git commit of my fix, and ran git push... and it was back to normal!

Counting objects: 5, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 344 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Stopping NodeJS cartridge
remote: Stopping MongoDB cartridge
remote: Saving away previously installed Node modules
remote: Building git ref 'master', commit f5e40ef
remote: Building NodeJS cartridge
remote: npm info it worked if it ends with ok
...
remote: npm info ok 
remote: Preparing build for deployment
remote: Deployment id is aa38fed5
remote: Activating deployment
remote: Starting MongoDB cartridge
remote: Starting NodeJS cartridge
remote: Result: success
remote: Activation status: success
remote: Deployment completed with status: success

So, now that the PID was stable and correct, it seemed to deploy properly and I've had no troubles since!

2013-12-12

OpenShift Solving the: 'PID Does not match' Error

OpenShift is a great free service (and paid for larger requirements) to run your Java, Node.JS, Ruby, Python and Perl apps in the cloud quickly and easily. Basically, for the un-initiated, it works like this:

Sign Up
Get assigned a git repo
Clone the repo locally
Put code in the repo
git commit and then git push

Upon pushing the code up, it will execute your server (Sometimes you'll need to write a small config file to tell it What to run) and you're done. I'm using it for Node.js along with what they call a cartridge for MongoDB. Just purring right along until yesterday, when I get this error during a git push:

remote: Stopping NodeJS cartridge
remote: Warning: Application '(appname)' nodejs PID (322361) does not match '$OPENSHIFT_NODEJS_PID_DIR/cartridge.pid' (14154
remote: 390925).  Use force-stop to kill.
remote: An error occurred executing 'gear prereceive' (exit code: 141)
remote: Error message: Failed to execute: 'control stop' for /var/lib/openshift/(username)/nodejs
remote: 
remote: For more details about the problem, try running the command again with the '--trace' option.
To ssh://(username)@(app name).rhcloud.com/~/git/(app name).git/
 ! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'ssh://(username)@(app name).rhcloud.com/~/git/(app name).git/'

Well, that's annoying. I did find that I can connect in and manually restart the app by killing the running node pid (ps aux lists it, and then kill (pid) to kill it.) Because they're running 'supervisor' it'll respawn the node process. However, it didn't actually seem to get my git push either. So, I'm now re-running the old code. Not very handy. Of course, there's no way I can git push 'force-stop' and it actually be valid, so I'm left wondering what I can do to get back up and developing.

Turns out, it's not that hard to fix. Observe:

# ssh (username)@(app name).rhcloud.com
(app name) (username)]> ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     315834  0.3  0.4 471888 34752 ?        Sl   21:51   0:02 /usr/bin/mongod --auth -f /var/lib/openshift/(username)/mongodb//conf/mongodb.
1313     319791  0.0  0.0 104916  3120 ?        S    21:52   0:00 sshd: (username)@pts/0
1313     319809  0.0  0.0 108608  2064 pts/0    Ss   21:52   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     322361  0.6  0.7 733380 57516 ?        Sl   21:53   0:03 node server.js
1313     360061  2.0  0.0 110240  1148 pts/0    R+   22:02   0:00 ps aux
1313     390906  8.1  0.1 1021104 9008 ?        Sl   Dec10 226:34 node /opt/rh/nodejs010/root/usr/bin/supervisor -e node|js|coffee -p 1000 -- server.js
(app name) (username)]> vi $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid

Now, that will open up vi with the pid file in it. Your PID's might vary, but you're going to want to delete whatever is in this file, and put the pid of the node process (in bold above). In my case here, it was 322361. Once I put that in there and saved it (ESC :wq <= For you non-vi types), you should be back in business. Run another git push and you should be back to your normal git output, something along these lines:

Counting objects: 8, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 679 bytes | 0 bytes/s, done.
Total 6 (delta 4), reused 0 (delta 0)
remote: Stopping NodeJS cartridge
remote: Stopping MongoDB cartridge
remote: Saving away previously installed Node modules
remote: Building git ref 'master', commit f9d21d1
remote: Building NodeJS cartridge
remote: npm info it worked if it ends with ok
remote: npm info using npm@1.2.17
remote: npm info using node@v0.10.5
...
remote: npm info ok 
remote: Preparing build for deployment
remote: Deployment id is a878ff76
remote: Activating deployment
remote: Starting MongoDB cartridge
remote: Starting NodeJS cartridge
remote: Result: success
remote: Activation status: success
remote: Deployment completed with status: success

2013-12-10

Too many authentication failures for

Lately I've been getting this lovely error when trying to ssh to certain hosts (not all, of course):

# ssh ssh.example.com
Received disconnect from 192.168.1.205: 2: Too many authentication failures for

My first thought is "But you didn't even ASK me for a password!" My second thought is "And you're supposed to be using ssh keys anyway!"

So, I decide I need to specify a specific key to use on the command line with the -i option.

# ssh ssh.example.com -i myAwesomeKey
Received disconnect from 192.168.1.205: 2: Too many authentication failures for

Well, that didn't help. Adding a -v shows that it tried a lot of keys... including the one I asked it to. Now, apparently this is the crux of the issue. You see, it looks through the config file (of which mine is fairly extensive as I deal with a few hundred hosts, most of which share a subset of keys, but not all of them). Apparently it doesn't always necessarily try the key I specified FIRST. So, if you have more than, say 5 keys defined, it may not necessarily use the key you want it to use first, it will offer anything from the config file. Yes, even if you have them defined per host. For instance, my config file goes something like this:

Host src.example.com
 User frank.user
 Compression yes
 CompressionLevel 9
 IdentityFile /home/username/.ssh/internal

Host puppet.example.com
 User john.doe
 Compression yes
 CompressionLevel 9
 IdentityFile /home/username/.ssh/jdoe

Apparently, this means ssh will try both of these keys for any host that isn't those two. If the third one you define, "Host ssh.example.com" in our case, is the one you want, it'll do that one THIRD, even though the host entry line matches. The fix is simple: Tack "IdentitiesOnly yes" in there. It tells ssh to apply ONLY the IdentityFile entries having to do with that host TO that host.

Host src.example.com
 User frank.user
 Compression yes
 CompressionLevel 9
        IdentitiesOnly yes
 IdentityFile /home/username/.ssh/internal

The side effect of this is that you don't have to define an IdentityFile line for EVERY HOST. It will apply all the keys it knows about to all of the Host entries in the config, and indeed to every ssh you attempt, listed or not. This is why it didn't always fail, there was a good chance the first one or two in the list worked. It was only when the first 5 it tried didn't work that it failed.

2013-11-22

Writing a C/C++ App run from apache

A few weeks back I started wondering on ways to improve the performance of certain areas of my site (as I do periodically) and while I came across examples of PHP compilers in their many forms, I started wondering if I should just go whole hog and write some parts of the application right in C/C++. Sure, everyone is jumping on nodejs, but I've worked in both C and C++ languages before and while it has been sometime, I thought it might be an interesting exercise to try writing some of the back end processes in either one. After all, I'm running mysql as my db, and there are mysql c headers and api, so it should be fairly straight forward. I found quickly that some of this documentation was lacking. Sure I can create a C/C++ executable that'll give me info from my DB on demand, but getting it back to the requester was being a pain. Digging around and experimenting led me to finding a few things that I'd like to share here.

int MIMEHeader() { 
     cout << "Content-type: text/html" << endl << endl << endl; 
     cout << "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"<<endl; 
     cout <<"<HTML xlmns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>" <<endl;
}

int main() {
     MIMEHeader();
     cout<<">head<>/head><body>"<<endl;        
     cout<<"Hello World!"<<endl;    
     cout<<"</body>"<<endl;
     return TRUE;
}

First off, yes, you really do need three "endl"'s on the first line. One finishes the current line and the next to tell apache you really mean that this is html and it should serve it. The next two sets up the doctype and get things ready for you to output your info. Calling this function from our main function causes things to be ready for output. In the example here, we tack on the proverbial "Hello World" example.
The next step here is to get apache to proxy the output of our program to the end user's web browser. You'll need this block in your sections affected.

<Directory "/path/your/website/x">
     AllowOverride None
     Options ExecCGI
     Order allow,deny
     Allow from all
</Directory>
ScriptAlias /x /path/to/your/website/x

In the example, I used /x as my subdir of compiled code. So, resolving it would be "http://example.com/x/programName".

At this point you're good to go, and boy is it FAST. In my non-exhaustive testing, a simple App involving a MySQL query takes less than 1ms more than the actual MySQL query. Obviously, your results may vary. Here's the example C++ I wrote for a start before I started adding more fancy api type features to it. I compiled it on CentOS 5.4 with stock CentOS distributed MySQL 5.1:

// Compile this using:
// g++ -o mysqltest mysqltest.cpp -L/usr/lib64/mysql -lmysqlclient -I/usr/include/mysql && ./mysqltest
//

#include <iostream>
#include <mysql/mysql.h>

using namespace std;
MYSQL *connection, mysql;
MYSQL_RES *result;
MYSQL_ROW row;
int query_state;

int MIMEHeader() {
 cout << "Content-type: text/html"<<endl<<endl<<endl<<"<HTML><BODY><PRE>";
}

int Footer() { 
 cout << "</PRE></BODY></HTML>";
}

int main() {
 char * env;
 MIMEHeader();
 mysql_init(&mysql);
 connection = mysql_real_connect(&mysql,"localhost","myUs3r","myPa55word","myBookmarks",0,0,0);
 if (connection == NULL) { 
  cout << mysql_error(&mysql) << endl;
  return 1;
 } 
 query_state=mysql_query(connection,"select * from links where id like '%blogger.com';");
 if (query_state!=0) {
  cout << mysql_error(&mysql) << endl;
                return 1;
 }
 result = mysql_store_result(connection);
 cout << "<table>"<<endl;
 while  (( row = mysql_fetch_row(result)) != NULL ) {
  cout << "<tr><td>" << row[0] <<"</td><td>"<< row[1] <<"</td><td>"<< row[2] <<"</td></tr>"<<endl;
 }
 cout << "</table>"<<endl;
 Footer();
 mysql_free_result(result);
 mysql_close(connection);
 return 0;
}

2013-11-20

Adding Swap Space in Linux Without a Reboot

So, let's say you've got a server running out of memory. Not just RAM, but swap too. Now, generally, there are a few well known ways to solve this issue.

Close/Kill processes you don't need
Reboot
Add another swap partition
Buy more RAM
Buy more Hardware

Now, In our scenario, the first option isn't helping, the second one is just the nuclear option to the first. But we've got one huge process and it's not all active memory... it's just consuming a lot of RAM and Swap and we want it to succeed. Buying more RAM is the best idea, but this server won't take anymore, or we're not sure we'll have this workload often, so we can't justify wasting money on more hardware. We've gotta get creative before it fills up and gets OOM killed. Adding another swap partition is a great idea, but we're out of available disk partitions or drives to throw at it. However, we do have some free space on an existing partition, we can leverage that.

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md2               47G   11G   35G  23% /
/dev/hda1              99M   20M   74M  22% /boot

Alright, looking at a top or vmstat, we know we've got 4GB of RAM in here, and another 2GB of swap. Knowing the size of the process, we figure doubling that swap will give us plenty of overhead at the moment. Let's do this!

$ dd if=/dev/zero of=/newswap bs=32k count=64k

65536+0 records in
65536+0 records out
2147483648 bytes (2.1 GB) copied, 18.9618 seconds, 113 MB/s

$ ls -al /newswap
-rw-r--r-- 1 root root 2147483648 Nov 19 23:02 /newswap
$ mkswap /newswap
Setting up swapspace version 1, size = 2147479 kB
$ swapon /newswap

And that's it. A quick check should find that we now have another 2GB of swap space and a system that can breathe a little more.

Note: The size of the swap space is determined by the size of the file. 'bs' is the block size, and 'count' is the number of blocks. I generally stick to 32k or 64k block sizes and then adjust the count from there. 64k & 64k is 4GB, 64k and 128k is 8GB, etc.

Now, this won't stick after a reboot as is. If you'd like it to, I recommend changing the process a bit. It's the same until you've finished the mkswap command, after that instead of running swapon, open up the /etc/fstab in your favorite editor (vi /etc/fstab) and then add another swap line after the disk the file is on is listed like so:

/newswap         swap                    swap    defaults        0 0

Then you can run 'swapon -a' and it will mount ALL swap partitions.

Note: Swap automatically stripes across multiple swap partitions of the same priority. It might be useful to make swap partitions on multiple drives to allow for faster RAID-0 type speeds across drives!

Hope this helped someone out. I had to use it the other day and was able to save a long running process that was eating up RAM like candy. It finished a few hours after I put this fix in place. Since I don't run that process often, I simply removed the line from the /etc/fstab and the next time it rebooted, it was back to it's normal swap sizes. I then deleted the file and it was like nothing ever happened!

2013-11-19

Cacti 0.8.8a and Plugins

Today I needed to install some plugins into my Cacti 0.8.8a install. Cacti has been a great graphing tool for many years here, and the 0.8.8a update has added a lot of cool things. One of them is the built in Plugin Architecture (Previous versions required you to install Cacti, then download the "PIA" and patch files and make manual edits to get it to work). However, being used to the old ways, I was having trouble using the "simple new way" and I didn't find much out in the wild about making it work. It really is simple! Just:

Download a plugin from somewhere like Here
Extract the contents of the tarball (it should make a subdirectory)
Move the subdirectory to the cacti/plugins/ directory (mine is /var/www/html/cacti/plugins/)
Open up Cacti in a browser
If you see Plugin Management on the left, skip the next step and just go there
Go to User Management, pick your user and check the box next to "Plugin Management".. refresh the page and then go to "Plugin Management" on the left menu.
You should see your plugin listed. Hit the arrow down button to install it, and then the arrow right to enable it!

NOTE: The problem I had is this: The directory your plugin is in (for instance, mine was "cacti/plugins/aggregate-0.7B2") CANNOT HAVE NON-ALPHA CHARACTERS.

So, basically, I followed all of these instructions, wondered where the heck my plugin was, and started scraping through code. I finally found the line that looks for available plugins had input_validate_input_regex(get_request_var("id"), "^([a-zA-Z0-9]+)$"); did I then realize, "Hey, maybe if I took that - and the . out of the filename..."

So I moved it, hit refresh on the Plugin Management page and saw my plugin appear.

2013-11-07

PHPUnit 3.5.15 for Zend Framework ZF1 on CentOS

We're still running ZF1 here, and as the ZF2 update is no small body of work, we continue to need PHPUnit 3.5.15 in order to run our unit tests. I've found that this is not as straight forward as installing the latest version of PHPUnit (up to 3.7.24 as of this writing). It seems PHPUnit versions of 3.6+ are geared toward ZF2 development and testing and no longer work AT ALL with ZF1. So, before you think "I can just run pear install phpunit/PHPUnit, and be good!" and then bang your head against the desk trying to get it to work right, this is the sequence of commands I've gone through to install this on CentOS 5.x and 6.x hosts.

pear channel-discover pear.symfony-project.com
pear channel-discover components.ez.no
pear install --alldeps pear.phpunit.de/DbUnit-1.0.3
pear install phpunit/PHPUnit_TokenStream-1.1.5 ??
pear install pear.phpunit.de/PHPUnit_Selenium-1.0.1
pear install phpunit/File_Iterator-1.2.3
pear install pear.phpunit.de/PHP_CodeCoverage-1.0.2
pear install pear.phpunit.de/PHPUnit-3.5.15

In fact, here's a full output of running this in a terminal on a CentOS 5.10 box

[root@mydev ~]# pear channel-discover components.ez.no
Adding Channel "components.ez.no" succeeded
Discovery of channel "components.ez.no" succeeded
[root@mydev ~]# pear channel-discover pear.symfony-project.com
Adding Channel "pear.symfony-project.com" succeeded
Discovery of channel "pear.symfony-project.com" succeeded
[root@mydev ~]# pear install --alldeps pear.phpunit.de/DbUnit-1.0.3
downloading DbUnit-1.0.3.tgz ...
Starting to download DbUnit-1.0.3.tgz (39,292 bytes)
..........done: 39,292 bytes
downloading YAML-1.0.6.tgz ...
Starting to download YAML-1.0.6.tgz (10,010 bytes)
...done: 10,010 bytes
install ok: channel://pear.symfony-project.com/YAML-1.0.6
install ok: channel://pear.phpunit.de/DbUnit-1.0.3
[root@mydev ~]# pear install phpunit/PHPUnit_TokenStream-1.1.5
No releases available for package "pear.phpunit.de/PHPUnit_TokenStream"
install failed
[root@mydev ~]# pear install pear.phpunit.de/PHPUnit_Selenium-1.0.1
downloading PHPUnit_Selenium-1.0.1.tgz ...
Starting to download PHPUnit_Selenium-1.0.1.tgz (15,285 bytes)
.....done: 15,285 bytes
install ok: channel://pear.phpunit.de/PHPUnit_Selenium-1.0.1
[root@mydev ~]# pear install phpunit/File_Iterator-1.2.3
downloading File_Iterator-1.2.3.tgz ...
Starting to download File_Iterator-1.2.3.tgz (3,406 bytes)
....done: 3,406 bytes
install ok: channel://pear.phpunit.de/File_Iterator-1.2.3
[root@mydev ~]# pear install pear.phpunit.de/PHP_CodeCoverage-1.0.2
downloading PHP_CodeCoverage-1.0.2.tgz ...
Starting to download PHP_CodeCoverage-1.0.2.tgz (109,280 bytes)
.........................done: 109,280 bytes
downloading ConsoleTools-1.6.1.tgz ...
Starting to download ConsoleTools-1.6.1.tgz (869,994 bytes)
...done: 869,994 bytes
downloading PHP_TokenStream-1.2.1.tgz ...
Starting to download PHP_TokenStream-1.2.1.tgz (9,854 bytes)
...done: 9,854 bytes
downloading Text_Template-1.1.4.tgz ...
Starting to download Text_Template-1.1.4.tgz (3,701 bytes)
...done: 3,701 bytes
downloading Base-1.8.tgz ...
Starting to download Base-1.8.tgz (236,357 bytes)
...done: 236,357 bytes
install ok: channel://pear.phpunit.de/PHP_TokenStream-1.2.1
install ok: channel://pear.phpunit.de/Text_Template-1.1.4
install ok: channel://components.ez.no/Base-1.8
install ok: channel://components.ez.no/ConsoleTools-1.6.1
install ok: channel://pear.phpunit.de/PHP_CodeCoverage-1.0.2
[root@mydev ~]# pear install pear.phpunit.de/PHPUnit-3.5.15
Did not download optional dependencies: pear/XML_RPC2, use --alldeps to download automatically
phpunit/PHPUnit can optionally use package "pear/XML_RPC2"
phpunit/PHPUnit can optionally use PHP extension "dbus"
downloading PHPUnit-3.5.15.tgz ...
Starting to download PHPUnit-3.5.15.tgz (118,859 bytes)
..........................done: 118,859 bytes
downloading PHP_Timer-1.0.5.tgz ...
Starting to download PHP_Timer-1.0.5.tgz (3,597 bytes)
...done: 3,597 bytes
downloading PHPUnit_MockObject-1.2.3.tgz ...
Starting to download PHPUnit_MockObject-1.2.3.tgz (20,390 bytes)
...done: 20,390 bytes
install ok: channel://pear.phpunit.de/PHP_Timer-1.0.5
install ok: channel://pear.phpunit.de/PHPUnit_MockObject-1.2.3
install ok: channel://pear.phpunit.de/PHPUnit-3.5.15
[root@mydev ~]# phpunit --version
PHPUnit 3.5.15 by Sebastian Bergmann.

And there was much rejoicing!

2013-09-23

Setting up Replication in MySQL of 3+ cluster nodes.

I've been running a multinode MySQL Replication Loop for a while and thought it'd be useful to write up how replication is synced between the peers. I'm writing this up as if there are three nodes, Adam, Ben and Charlie in a loop, but you can do this with any number of nodes. In our setup Adam and Ben are in our primary location, with Charlie sitting in our DR setup. This means Charlie is always on but doesn't get many queries on a regular basis. Adam and Ben have heartbeatd setup as a failover pair so that when Adam has a vacation (downtime) everything fails over to Ben and continues to run. We do this simply with a floating IP.

1. Decide which node to start with

Because of this setup, prior to doing a resync, it's important to connect to the node who currently has the active IP and start things from there. To find out who that is, connect to the nodes and run:

$ /sbin/ip addr

You'll see something like:

: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:50:56:82:29:01 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.56/24 brd 192.168.1.255 scope global eth0
    inet 192.168.1.59/24 brd 192.168.1.255 scope global secondary eth0

Note how this one has both the primary (eth0) and "secondary eth0" address on it. Also, we know our floating IP is 192.167.1.59, so we know it's here. Because of that, this node has the latest info and will thus be the beginning of our resync process.

2. Stop the Slaves

To stop the slaves, connect to mysql as follows:

$ mysql -p -u root

mysql> stop slave;
Query OK, 0 rows affected (0.01 sec)

I recommend going ahead and stopping the slave on both Adam and Charlie as well at this point.

3. Get a fresh mysqldump

To start the resync process, we need to take a dump of all databases with masterdata in a single-transation:

/bin/mysqldump -p -u root --all-databases --single-transaction --master-data | bzip2 > 20120912-midsync-`hostname`.sql.bz2

This will make sure that we get all the info we need in the backup. Without Master Data we'd have to remember when we took the backup when we restored it in order to not miss any updates, which is almost impossible when running this against an active loop. When this is done, transfer the backup from Adam to Ben. Then, stop the slaves around the circle.

4. Transfer dump to the next node in the circle

You can use scp or what ever you'd like. just get the backup over there.

5. Import the dump

Then, on the first target (Ben in this case), bunzip2 the file, and then import it:

$ mysql -p -u root < 20120912-midsync-adam.sql

Because we included Master Data, this will automatically set the master position to be correct to resume the slave... but don't do it yet.

6. Rinse, repeat

We're going to continue around the circle making backups of each, passing them onto the next and then restoring. in our case, we need a fresh backup of Ben, same way we did the original Adam backup. Transfer that one to Charlie. bunzip2 it, and import it with the MySQL command. If you have more nodes, you'll continue this around until all have fresh restores.

If you have multiple primary or online nodes, you'll want to 'start slave;' right after you import the backup. This will make sure they don't have stale data or data that will conflict with another node's data.

For us, we're going to take advantage of the fact that Charlie is in our DR site and thus doesn't have much written to it, but still we'll need to do this quickly.

7. The Last Hop

Okay, you've got all of the up to date, but we haven't made that last hop where 0 pulls from N. Instead of the normal restore process, we're going to switch it up a bit:

$ ls
20120912-BenBackup.sql.bz2
$ bunzip 20120912-BenBackup.sql.bz2
$ echo "reset master;" >> 20120912-BenBackup.sql
$ mysqp -u root -p < 20120912-BenBackup.sql

Adding on 'reset master' will reset the master files and counter. Now, bring up both this node's terminal and Adam's terminal up next to each other. Login to Adam's MySQL prompt and type in "CHANGE MASTER TO MASTER_LOG_FILE='', MASTER_LOG_POS=###;", but don't hit enter. On Charlie, at the MySQL prompt, type in 'show master status;' and hit enter. Now, copy the file name in the first data column over to your command on Adam. Then, move your cursor to the ###'s, and run the 'show master status' command again copying the number over and hitting enter as quick as you can accurately.

8. Gentlemen, start your slaves!

Once you have it set, run "start slave" on Adam, then on Charlie, and finally on Ben. It's important that this number matches.

9. Check your status

When the slave is started, it pulls any changes since the backups' restore (or master reset) from the node behind it (Charlie pulls from Ben, for instance). By working backward, we have a better chance of the replication circle staying in sync and once they're all up, you're almost done. Give it a minute and run "Show Slave Status\G" on each. You should see something similar to this:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.57
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: ben-bin.000022
          Read_Master_Log_Pos: 888045912
               Relay_Log_File: charlie-relay-bin.000002
                Relay_Log_Pos: 81957
        Relay_Master_Log_File: ben-bin.000022
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 888045912
              Relay_Log_Space: 82121
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2
1 row in set (0.00 sec)

2013-08-21

Puppet: Exiting; no certificate found and waitforcert is disabled

I have a number of servers that were built using puppet. They contact a central puppet master and pull configs. This had been working quite well for a while. The I noticed that they suddenly have been silently failing to do any updates. I then tried this manually:

# puppet agent --test
Exiting; no certificate found and waitforcert is disabled

Well, that's not too useful. Other puppet slaves are running, and the puppet master doesn't have a full disk or anything. Then I noticed the following:

ls -al /var/lib/puppet/ssl/certificate_requests/
-rw-r----- 1 puppet puppet 1610 Jan 17  2013 hostname.example.net.pem

Weird, why was there a request for this? Not sure. But doing a quick rm of that file and then re-running "puppet agent --test" made puppet create a new cert and submit it to the master. I then ran "puppet cert --sign --all" and it's good to go! So, not sure about the root cause yet, but this solution helped me out and I wanted to share.

2013-05-20

MySQL 5.6.x Admin Password

So, I'm rolling out a fresh mysql server the other day, as I've done many times before, and ran into some odd behaviour. I downloaded a fresh set of RPM's from oracle's download page and did my usual:

# yum install MySQL-*-5.6.11*.rpm -y
# mysql_install_db
# mysql_secure_installation

And then I got the root password prompt. Now usually you just hit enter because the root password is blank. But it wasn't taking that today. Wiped the DB, erased and reinstalled, still didn't take the empty password. Until I did:

# ls -al
total 366900
dr-xr-x---.  4 root root      4096 May 20 12:08 .
dr-xr-xr-x. 24 root root      4096 May 20 11:26 ..
-rw-r--r--.  1 root root  23010735 May 20 11:07 MySQL-client-5.6.11-2.linux_glibc2.5.x86_64.rpm
-rw-r--r--.  1 root root   4554269 May 20 11:07 MySQL-devel-5.6.11-2.linux_glibc2.5.x86_64.rpm
-rw-r--r--.  1 root root 112519557 May 20 11:08 MySQL-embedded-5.6.11-2.linux_glibc2.5.x86_64.rpm
-rw-------.  1 root root       192 May 20 12:00 .mysql_secret
-rw-r--r--.  1 root root  56354288 May 20 11:09 MySQL-server-5.6.11-2.el6.x86_64.rpm
-rw-r--r--.  1 root root  88319899 May 20 11:10 MySQL-server-5.6.11-2.linux_glibc2.5.x86_64.rpm
-rw-r--r--.  1 root root   2389748 May 20 11:10 MySQL-shared-5.6.11-2.linux_glibc2.5.x86_64.rpm
-rw-r--r--.  1 root root   5180812 May 20 11:10 MySQL-shared-compat-5.6.11-2.linux_glibc2.5.x86_64.rpm
-rw-r--r--.  1 root root  72675691 May 20 11:11 MySQL-test-5.6.11-2.linux_glibc2.5.x86_64.rpm
#

Wait, what's that ".mysql_secret" file?! I didn't put that there. Apparently it did... Turns out, upon initial install of the MySQL Packages, it runs an the mysql_install_db automatically... and then it sets the password to something random.

# cat .mysql_secret

# The random password set for the root user at Mon May 20 12:00:16 2013 (local time): mS2tzW4Z

So, you'd think that you can still run mysql_secure_installation, but you can use that password. Except that you end up getting the message "ERROR 1862 (HY000): Your password has expired. To log in you must change it using a client that supports expired passwords." Which is apparently the normal text client. So, feel free to reset your password the old fashion way:

# mysql -u root -pmS2tzW4Z

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.6.11

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SET PASSWORD FOR 'root'@'localhost' = PASSWORD('ub3rS3cureP455word!');
Query OK, 0 rows affected (0.00 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)

Then you can run mysql_secure_installation and enjoy your new MySQL Install.

2013-02-09

Nginx + FastCGI with the Quickness

CentOS 6.x (I've got 6.3 here) install. first off, you'll need both nginx and spawn-fcgi as well as php. For purposes of simplicity, I'll just go with the 5.3.3 that yum pulls in, but really any version (I ran it with php 5.4.8 for this example) will work. As long as it's compiled with the -cgi flags. and you have 'php-cgi' available as that's what spawn-fcgi executes.

yum install nginx spawn-fcgi php -y

This will install both of them with their default config's. You'll need to tweak a few things. First off, let's tackle /etc/sysconfig/spawn-fcgi

Spawn-fcgi Config

SOCKET=/var/run/php-fcgi.sock
OPTIONS="-u nginx -g nginx -s $SOCKET -S -M 0600 -C 8 -F 1 -P /var/run/spawn-fcgi.pid -- /usr/bin/php-cgi"

By default this ships with -C 32, which means it'll start 32 php-cgi processes. This seems like a lot in my experience. We have some very busy image servers and they do well with 4 to 8. I usually go with the "# cores + 2" idea and it's worked well for me so far. Any way, you'll also want to make sure you remember where that 'Socket' is defined. It doesn't really matter where it is, but it matters that you remember it!

Nginx Config

server {
  listen  80;
  server_name zabbix.example.com;
  root   /var/www/zabbix;
 
  location / {
   index  index.html index.htm index.php;
  }
 
  location ~ \.php$ {
   include /etc/nginx/fastcgi.conf;
   fastcgi_pass unix:/var/run/php-fcgi.sock;
   fastcgi_index index.php;
  }
 }

This server block will go in either your main nginx.conf file (/etc/nginx/nginx.conf on CentOS), or in a file included from that one. This will define a vhost listening on that "server_name", hosted in that "root". It will use the info in /etc/nginx/fastcgi.conf and pass that info over to your socket defined above (I told you to remember that!). Basically, the second location block tells nginx that any file ending with .php should use the fast-cgi and php socket to run. ... and that's it! I highly recommend trolling through the php.ini options as well as any other options in nginx to make sure there aren't any red-flags flying (I know I've tweaked a lot outside of this) but this should get you serving php!

2013-02-05

Puppet Error: header too long

If you're working with Puppet and you find that you get this error:

puppet cert --list

Error: header too long

Be mindful of your free space! I've now rolled out 20 servers or so in my puppet setup (soon to be duplicated to over 142 servers once I get these running right. All I'll have to do is spin up a new server, give it an IP and hostname and tell it where the Puppet Master is and Puppet will handle the rest!), and I've found that I'm starting to easily fill up the drive with old reports. Especially when re-running puppet syncs more frequently than the normal 30 min run-interval. I started getting the above error with a lot of various puppet commands, the simplest one, just trying to list certs. Then I checked a "df -h":

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              16G   15G     0 100% /

Oops! Using the following script I was able to clean up old reports easily. Set the "days" variable to as high as you want for your setup. I'm using Puppet Dashboard to pull in reports to a DB, so I don't need to keep the yaml's around too long.

#!/bin/sh
days="+1"       # more than a day old

for d in `find /var/lib/puppet/reports -mindepth 1 -maxdepth 1 -type d`
do
        find $d -type f -name \*.yaml -mtime $days |
        sort -r |
        tail -n +2 |
        xargs /bin/rm -f
done

In my case, since it tried to sync a new server ssl cert while the drive was full, the error came out to be due to not only the free space, but a corrupt cert. To find the offending cert and fix the issue, you'll need to look through the /var/lib/puppet dir for the file. The host I was looking for is 'betamem.example.com' and I found it like this:

# cd /var/lib/puppet
# find ./|grep betamem
./ssl/ca/requests/betamem.example.com

I then removed the cert (held in /var/lib/puppet/ssl/certificate_requests/) from the agent on 'betamem' and told it to try again by cycling it's puppet agent.

# rm -f /var/lib/puppet/ssl/certificate_requests/*
# /etc/init.d/puppet restart
Stopping puppet agent:                                     [  OK  ]
Starting puppet agent:                                     [  OK  ]

Tailing /var/log/messages on the master shows it's got a new request, so let's sign it:

# tail /var/log/messages -n1
puppet-master[22486]: betamem.example.com has a waiting certificate request
# puppet cert --sign betamem.example.com
Signed certificate request for betamem.example.com
Removing file Puppet::SSL::CertificateRequest at '/var/lib/puppet/ssl/ca/requests/betamem.example.com.pem'

Go back to the puppet agent and cycle it again, or just wait until the next run-interval and it should be back to normal!