[torqueusers] Troubleshooting blcr checkpoint/restart
pregier at ittc.ku.edu
Tue Jan 22 14:33:24 MST 2013
Hi, all; I've got a strange issue here which might be a simple misconfig or may be a deeper issue with our setup or perhaps even some torque code, but I'm having a difficult time figuring out where to look.
We're trying to implement blcr checkpointing for an existing cluster and testing on a temporary test cluster; blcr is operational locally on the one node I'm using to test qhold/qrls, and I've found that jobs often suspend and resume just fine using a minimally modified (just enough to work, with a few added log entries) version of the scripts on the website:
Occasionally, however, a job will be dropped from the queue after a restart, perhaps (but not always) leaving its newly restarted process running unmonitored. Once in a great while, the MOM will crash altogether, though this is very infrequent. I scripted a qhold/qrls loop (against a prime sieve and now even /bin/sleep as my executable) to get a rough idea of an expected time to live, and was quite surprised to find that depending on the delay between qhold and qrls invocations, this failure can be very predictable; for example, with 10 seconds between commands, the job will suspend and resume successfully for exactly 5 minutes (i.e., exactly 15 total checkpoint/restart cycles) until the job disappears. This pattern varies consistently (though not monotonically) with delay time; I can forward the script and sample measurements if that might be useful.
More to the point, where should I be looking for a root cause when blcr checkpoint/restart fails intermittently for even the most trivial jobs? The server logs appear normal for these tests, and the MOM logs are quite cryptic; the closest I've found to an indicator would be
01/22/2013 15:19:46;0080; pbs_mom.28548;Job;36320.fusion-devel.local;checking job w/subtask pid=0 (child pid=1601)
01/22/2013 15:19:46;0008; pbs_mom.28548;Job;scan_for_terminated;pid 1601 not tracked, statloc=0, exitval=0
01/22/2013 15:19:46;0080; pbs_mom.28548;Req;dis_request_read;decoding command DeleteJob from PBS_Server
01/22/2013 15:19:46;0100; pbs_mom.28548;Req;;Type DeleteJob request received from PBS_Server at fusion-devel.local, sock=8
01/22/2013 15:19:46;0008; pbs_mom.28548;Job;mom_process_request;request type DeleteJob from host fusion-devel.local received
01/22/2013 15:19:46;0008; pbs_mom.28548;Job;mom_process_request;request type DeleteJob from host fusion-devel.local allowed
01/22/2013 15:19:46;0008; pbs_mom.28548;Job;mom_dispatch_request;dispatching request DeleteJob on sd=8
01/22/2013 15:19:46;0008; pbs_mom.28548;Job;36320.fusion-devel.local;deleting job
...but this doesn't seem to be much to go on (though I don't understand MOM logs very well). What could possibly cause this behavior, and/or where would I look for further clues? Many thanks in advance for any guidance.
pregier at ittc.ku.edu
More information about the torqueusers