The JML Continuum: More OpenShift Oddities

I had to fight with OpenShift a bit more today to get my application up and running after a botched code push. Restarting from the website didn't work, and simply re-pushing git code didn't help either... so time to dig in. As you can see here, [node] being in brackets meant it wasn't really running, it was in the process of starting or stopping... in fact, it kept doing it quite frequently according to a tail -f on /nodejs/logs/node.log ... So, I decided I had to stop it restarting, but how?

[(app name).rhcloud.com (username)]\> ps aux
kUSER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     240483  0.0  0.0 105068  3152 ?        S    17:02   0:00 sshd: (user)@pts/1
1313     240486  0.0  0.0 108608  2100 pts/1    Ss   17:02   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     249661  0.1  0.4 397100 35224 ?        Sl   17:08   0:00 /usr/bin/mongod --auth -f /var/lib/openshift/(user)/mongodb//conf/mongodb.conf run
1313     261473  5.5  0.0      0     0 ?        R    17:15   0:00 [node]
1313     261476  2.0  0.0 110244  1156 pts/1    R+   17:15   0:00 ps aux
1313     390906  8.1  0.2 1021240 20196 ?       Sl   Dec10 321:14 node /opt/rh/nodejs010/root/usr/bin/supervisor -e node|js|coffee -p 1000 -- server.js
[(app name).rhcloud.com (username)]\> kill 390906

That killed the process "supervisor" that re-spawns the node process. This is generally helpful, but today, it's continually incrementing the PID and it seems like that's happening more often than the gear can attempt to stop it. Unfortunately, now I can't restart it (rerunning that command in the ps output just gave me an error complaining about an Unhandled 'error' event in the supervisor script, so I decided to start the node service myself.

There are a few ways of doing this. You can go to your code and run 'node' or you can use gear start. But if you try gear start, well, it won't start if it thinks it's already running. After killing supervisor, the node process was not attempting to restart, but gear start didn't work either. I tried tricking it by clearing out the $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid file, but that didn't work either... It did point out something I could use though.

[(appname).rhcloud.com (username)]\> gear stop
Stopping gear...
Stopping NodeJS cartridge
usage: kill [ -s signal | -p ] [ -a ] pid ...
       kill -l [ signal ]
Stopping MongoDB cartridge
[(appname).rhcloud.com (username]\> gear start
Starting gear...
Starting MongoDB cartridge
Starting NodeJS cartridge
Application 'deploy' failed to start
An error occurred executing 'gear start' (exit code: 1)
Error message: Failed to execute: 'control start' for /var/lib/openshift/(username)/nodejs

For more details about the problem, try running the command again with the '--trace' option.

What I found interesting about that was that it apparently tried to pass the empty pid that was in the $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid file along to kill and kill didn't know what to do with that. In fact, kill returns a failed error code if you don't tell it what to kill OR if you tell it to kill something that wasn't there (original issue), so instead of getting an 'okay' back from the kill command when the gear script tried to run it, it got a failure and that meant problems for gear. So, I thought if I got something running on a PID that it COULD kill and put that PID in the file, it'd kill it successfully and everything would be back to normal. Easiest thing I could think of was to stick the '}' in my script that I'd forgotten and run that.

The node code is stored in /app-deloyments/<datestamp>/repo/ .. but don't expect things you put here to stick around.

\> node server.js 
^Z
\> ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1313     240483  0.0  0.0 105068  3152 ?        S    17:02   0:00 sshd: (user)@pts/1
1313     240486  0.0  0.0 108608  2124 pts/1    Ss   17:02   0:00 /bin/bash --init-file /usr/bin/rhcsh -i
1313     275483  0.3  0.4 467788 36892 ?        Sl   17:24   0:01 /usr/bin/mongod --auth -f /var/lib/openshift/(user)/mongodb//conf/mongodb.conf run
1313     284292  2.5  0.6 732440 45924 pts/1    Sl   17:30   0:02 node server.js
1313     287036  2.0  0.0 110240  1156 pts/1    R+   17:32   0:00 ps aux
\> echo "284292" > $OPENSHIFT_NODEJS_PID_DIR/cartridge.pid

So, PID is in the file, and the PID is a valid running node process. Then I did my git commit of my fix, and ran git push... and it was back to normal!

Counting objects: 5, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 344 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Stopping NodeJS cartridge
remote: Stopping MongoDB cartridge
remote: Saving away previously installed Node modules
remote: Building git ref 'master', commit f5e40ef
remote: Building NodeJS cartridge
remote: npm info it worked if it ends with ok
...
remote: npm info ok 
remote: Preparing build for deployment
remote: Deployment id is aa38fed5
remote: Activating deployment
remote: Starting MongoDB cartridge
remote: Starting NodeJS cartridge
remote: Result: success
remote: Activation status: success
remote: Deployment completed with status: success

So, now that the PID was stable and correct, it seemed to deploy properly and I've had no troubles since!

2013-12-18

More OpenShift Oddities

No comments:

Post a Comment