[torqueusers] Torque 4.1.4 with CPUSETS and BLCR = problems.
johny2015 at wp.pl
Wed Feb 27 05:41:47 MST 2013
We use Torque 4.1.4 on our cluster compiled with CPUSETS support. And it
works great, but now I want to add BLCR to it. So I have compiled Torque
with BLCR and I did some tests. It seems that Torque don't want to work
with this both options enabled.
Here are my results:
* Running checkpointable job works ok.
* Making checkpoint from running job also comes with no problems.
* Torque has problems with creating CPUSETS attached do job when
restarting from checkpoint. I dig into code and I found that function
"TMomFinalizeJob1" returns "FAILURE" (where there is checkpoint file
attached to the job) which prevents running next stages
(TMomFinalizeJob2, and so on) which are responsible for create cpusets
for that job. This leads to two facts:
** Torque after couple of minutes from job restart thinks that restarted
job has no processes (It looks for PID's associated with cpusets which
are empty) and deletes that job form list.
** Torque is unable to delete job PIDS because there are no cpuset
directory associated with that job thus for Torque these jobs has no
processes to kill.
e.g: 02/27/2013 10:53:00;0008;
pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_job found a task to kill
pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: sending signal 15 to
task 0, session 25050
pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: could not send signal
15 to task 0 (session 25050 )--no process was found with this session ID
(marking task as killed)!
But such processes exists:
ps aux | grep restart
root 25050 0.0 0.1 8968 1036 ? Ss 13:20 0:00
/bin/bash /var/spool/torque/mom_priv/blcr_restart_script 22850
80.fqsrv.cis.gov.pl peterw all
root 25219 0.5 0.3 50384 3604 ? S 13:20 0:00 sudo -u
peterw cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
peterw 25250 0.0 0.0 18356 472 ? Sl 13:20 0:00
Does anyone have an idea how to make it work? or maybe where to search
for source of the problems?
Thanks in advance.
More information about the torqueusers