[torqueusers] Torque 4.1.4 with CPUSETS and BLCR = problems.
Johny
johny2015 at wp.pl
Mon Mar 4 03:15:03 MST 2013
W dniu 2013-02-27 13:41, Johny pisze:
> Hello,
>
> We use Torque 4.1.4 on our cluster compiled with CPUSETS support. And it
> works great, but now I want to add BLCR to it. So I have compiled Torque
> with BLCR and I did some tests. It seems that Torque don't want to work
> with this both options enabled.
>
> Here are my results:
> * Running checkpointable job works ok.
> * Making checkpoint from running job also comes with no problems.
> * Torque has problems with creating CPUSETS attached do job when
> restarting from checkpoint. I dig into code and I found that function
> "TMomFinalizeJob1" returns "FAILURE" (where there is checkpoint file
> attached to the job) which prevents running next stages
> (TMomFinalizeJob2, and so on) which are responsible for create cpusets
> for that job. This leads to two facts:
> ** Torque after couple of minutes from job restart thinks that restarted
> job has no processes (It looks for PID's associated with cpusets which
> are empty) and deletes that job form list.
> ** Torque is unable to delete job PIDS because there are no cpuset
> directory associated with that job thus for Torque these jobs has no
> processes to kill.
>
> e.g: 02/27/2013 10:53:00;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_job found a task to kill
> 02/27/2013 10:53:23;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: sending signal 15 to
> task 0, session 25050
> 02/27/2013 10:55:46;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: could not send signal
> 15 to task 0 (session 25050 )--no process was found with this session ID
> (marking task as killed)!
>
> But such processes exists:
>
> ps aux | grep restart
> root 25050 0.0 0.1 8968 1036 ? Ss 13:20 0:00
> /bin/bash /var/spool/torque/mom_priv/blcr_restart_script 22850
> 80.fqsrv.cis.gov.pl peterw all
> /var/spool/torque/checkpoint/80.fqsrv.cis.gov.pl.CK
> ckpt.80.fqsrv.cis.gov.pl.1361967507
> root 25219 0.5 0.3 50384 3604 ? S 13:20 0:00 sudo -u
> peterw cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
> peterw 25250 0.0 0.0 18356 472 ? Sl 13:20 0:00
> cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
>
> Does anyone have an idea how to make it work? or maybe where to search
> for source of the problems?
> Thanks in advance.
>
> Regards,
> Peter.
Hi.
Another fact is that Torque built only with BLCR also works great (all
processes for jobs are created and killed correctly).
The problems occur only when both options (enable-cpusets and
enable-blcr) are enabled.
Here is my configuration line for torque:
./configure --with-default-server=qsrv.cis.gov.pl --with-rcp=/usr/bin/scp --enable-cpuset --enable-nvidia-gpus --enable-blcr
Then I build rpm's (configure.ac and torque.spec files had some mistakes
so I have patched them) and install it on our server/nodes.
Regards,
Peter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130304/34ae08d7/attachment-0001.html
More information about the torqueusers
mailing list