[torqueusers] Torque 4.1.4 with CPUSETS and BLCR = problems.

Johny johny2015 at wp.pl
Wed Feb 27 05:41:47 MST 2013


We use Torque 4.1.4 on our cluster compiled with CPUSETS support. And it 
works great, but now I want to add BLCR to it. So I have compiled Torque 
with BLCR and I did some tests. It seems that Torque don't want to work 
with this both options enabled.

Here are my results:
* Running checkpointable job works ok.
* Making checkpoint from running job also comes with no problems.
* Torque has problems with creating CPUSETS attached do job when 
restarting from checkpoint. I dig into code and I found that function 
"TMomFinalizeJob1" returns "FAILURE" (where there is checkpoint file 
attached to the job) which prevents running next stages 
(TMomFinalizeJob2, and so on) which are responsible for create cpusets 
for that job. This leads to two facts:
** Torque after couple of minutes from job restart thinks that restarted 
job has no processes (It looks for PID's associated with cpusets which 
are empty) and deletes that job form list.
** Torque is unable to delete job PIDS because there are no cpuset 
directory associated with that job thus for Torque these jobs has no 
processes to kill.

e.g: 02/27/2013 10:53:00;0008; 
pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_job found a task to kill
02/27/2013 10:53:23;0008; 
pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: sending signal 15 to 
task 0, session 25050
02/27/2013 10:55:46;0008; 
pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: could not send signal 
15 to task 0 (session 25050 )--no process was found with this session ID 
(marking task as killed)!

But such processes exists:

ps aux | grep restart
root     25050  0.0  0.1   8968  1036 ?        Ss   13:20   0:00 
/bin/bash /var/spool/torque/mom_priv/blcr_restart_script 22850 
80.fqsrv.cis.gov.pl peterw all 
root     25219  0.5  0.3  50384  3604 ?        S    13:20   0:00 sudo -u 
peterw cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
peterw  25250  0.0  0.0  18356   472 ?        Sl   13:20   0:00 
cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507

Does anyone have an idea how to make it work? or maybe where to search 
for source of the problems?
Thanks in advance.


