[torqueusers] Troubleshooting blcr checkpoint/restart

Phil Regier pregier at ittc.ku.edu
Tue Jan 22 14:33:24 MST 2013


Hi, all; I've got a strange issue here which might be a simple misconfig or may be a deeper issue with our setup or perhaps even some torque code, but I'm having a difficult time figuring out where to look.

We're trying to implement blcr checkpointing for an existing cluster and testing on a temporary test cluster; blcr is operational locally on the one node I'm using to test qhold/qrls, and I've found that jobs often suspend and resume just fine using a minimally modified (just enough to work, with a few added log entries) version of the scripts on the website:

http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml

Occasionally, however, a job will be dropped from the queue after a restart, perhaps (but not always) leaving its newly restarted process running unmonitored.  Once in a great while, the MOM will crash altogether, though this is very infrequent.  I scripted a qhold/qrls loop (against a prime sieve and now even /bin/sleep as my executable) to get a rough idea of an expected time to live, and was quite surprised to find that depending on the delay between qhold and qrls invocations, this failure can be very predictable; for example, with 10 seconds between commands, the job will suspend and resume successfully for exactly 5 minutes (i.e., exactly 15 total checkpoint/restart cycles) until the job disappears.  This pattern varies consistently (though not monotonically) with delay time; I can forward the script and sample measurements if that might be useful.

More to the point, where should I be looking for a root cause when blcr checkpoint/restart fails intermittently for even the most trivial jobs?  The server logs appear normal for these tests, and the MOM logs are quite cryptic; the closest I've found to an indicator would be

01/22/2013 15:19:46;0080;   pbs_mom.28548;Job;36320.fusion-devel.local;checking job w/subtask pid=0 (child pid=1601)
01/22/2013 15:19:46;0008;   pbs_mom.28548;Job;scan_for_terminated;pid 1601 not tracked, statloc=0, exitval=0
01/22/2013 15:19:46;0080;   pbs_mom.28548;Req;dis_request_read;decoding command DeleteJob from PBS_Server
01/22/2013 15:19:46;0100;   pbs_mom.28548;Req;;Type DeleteJob request received from PBS_Server at fusion-devel.local, sock=8
01/22/2013 15:19:46;0008;   pbs_mom.28548;Job;mom_process_request;request type DeleteJob from host fusion-devel.local received
01/22/2013 15:19:46;0008;   pbs_mom.28548;Job;mom_process_request;request type DeleteJob from host fusion-devel.local allowed
01/22/2013 15:19:46;0008;   pbs_mom.28548;Job;mom_dispatch_request;dispatching request DeleteJob on sd=8
01/22/2013 15:19:46;0008;   pbs_mom.28548;Job;36320.fusion-devel.local;deleting job

...but this doesn't seem to be much to go on (though I don't understand MOM logs very well).  What could possibly cause this behavior, and/or where would I look for further clues?  Many thanks in advance for any guidance.

Phil Regier
pregier at ittc.ku.edu


More information about the torqueusers mailing list