[torqueusers] Torque 4.1.4 with CPUSETS and BLCR = problems.
johny2015 at wp.pl
Mon Mar 4 03:15:03 MST 2013
W dniu 2013-02-27 13:41, Johny pisze:
> We use Torque 4.1.4 on our cluster compiled with CPUSETS support. And it
> works great, but now I want to add BLCR to it. So I have compiled Torque
> with BLCR and I did some tests. It seems that Torque don't want to work
> with this both options enabled.
> Here are my results:
> * Running checkpointable job works ok.
> * Making checkpoint from running job also comes with no problems.
> * Torque has problems with creating CPUSETS attached do job when
> restarting from checkpoint. I dig into code and I found that function
> "TMomFinalizeJob1" returns "FAILURE" (where there is checkpoint file
> attached to the job) which prevents running next stages
> (TMomFinalizeJob2, and so on) which are responsible for create cpusets
> for that job. This leads to two facts:
> ** Torque after couple of minutes from job restart thinks that restarted
> job has no processes (It looks for PID's associated with cpusets which
> are empty) and deletes that job form list.
> ** Torque is unable to delete job PIDS because there are no cpuset
> directory associated with that job thus for Torque these jobs has no
> processes to kill.
> e.g: 02/27/2013 10:53:00;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_job found a task to kill
> 02/27/2013 10:53:23;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: sending signal 15 to
> task 0, session 25050
> 02/27/2013 10:55:46;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: could not send signal
> 15 to task 0 (session 25050 )--no process was found with this session ID
> (marking task as killed)!
> But such processes exists:
> ps aux | grep restart
> root 25050 0.0 0.1 8968 1036 ? Ss 13:20 0:00
> /bin/bash /var/spool/torque/mom_priv/blcr_restart_script 22850
> 80.fqsrv.cis.gov.pl peterw all
> root 25219 0.5 0.3 50384 3604 ? S 13:20 0:00 sudo -u
> peterw cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
> peterw 25250 0.0 0.0 18356 472 ? Sl 13:20 0:00
> cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
> Does anyone have an idea how to make it work? or maybe where to search
> for source of the problems?
> Thanks in advance.
Another fact is that Torque built only with BLCR also works great (all
processes for jobs are created and killed correctly).
The problems occur only when both options (enable-cpusets and
enable-blcr) are enabled.
Here is my configuration line for torque:
./configure --with-default-server=qsrv.cis.gov.pl --with-rcp=/usr/bin/scp --enable-cpuset --enable-nvidia-gpus --enable-blcr
Then I build rpm's (configure.ac and torque.spec files had some mistakes
so I have patched them) and install it on our server/nodes.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers