[torqueusers] Torque 4.1.4 with CPUSETS and BLCR = problems.

Johny johny2015 at wp.pl
Mon Mar 4 03:15:03 MST 2013


W dniu 2013-02-27 13:41, Johny pisze:
> Hello,
>
> We use Torque 4.1.4 on our cluster compiled with CPUSETS support. And it
> works great, but now I want to add BLCR to it. So I have compiled Torque
> with BLCR and I did some tests. It seems that Torque don't want to work
> with this both options enabled.
>
> Here are my results:
> * Running checkpointable job works ok.
> * Making checkpoint from running job also comes with no problems.
> * Torque has problems with creating CPUSETS attached do job when
> restarting from checkpoint. I dig into code and I found that function
> "TMomFinalizeJob1" returns "FAILURE" (where there is checkpoint file
> attached to the job) which prevents running next stages
> (TMomFinalizeJob2, and so on) which are responsible for create cpusets
> for that job. This leads to two facts:
> ** Torque after couple of minutes from job restart thinks that restarted
> job has no processes (It looks for PID's associated with cpusets which
> are empty) and deletes that job form list.
> ** Torque is unable to delete job PIDS because there are no cpuset
> directory associated with that job thus for Torque these jobs has no
> processes to kill.
>
> e.g: 02/27/2013 10:53:00;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_job found a task to kill
> 02/27/2013 10:53:23;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: sending signal 15 to
> task 0, session 25050
> 02/27/2013 10:55:46;0008;
> pbs_mom.25054;Job;80.fqsrv.cis.gov.pl;kill_task: could not send signal
> 15 to task 0 (session 25050 )--no process was found with this session ID
> (marking task as killed)!
>
> But such processes exists:
>
> ps aux | grep restart
> root     25050  0.0  0.1   8968  1036 ?        Ss   13:20   0:00
> /bin/bash /var/spool/torque/mom_priv/blcr_restart_script 22850
> 80.fqsrv.cis.gov.pl peterw all
> /var/spool/torque/checkpoint/80.fqsrv.cis.gov.pl.CK
> ckpt.80.fqsrv.cis.gov.pl.1361967507
> root     25219  0.5  0.3  50384  3604 ?        S    13:20   0:00 sudo -u
> peterw cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
> peterw  25250  0.0  0.0  18356   472 ?        Sl   13:20   0:00
> cr_restart ckpt.80.fqsrv.cis.gov.pl.1361967507
>
> Does anyone have an idea how to make it work? or maybe where to search
> for source of the problems?
> Thanks in advance.
>
> Regards,
> Peter.
Hi.

Another fact is that Torque built only with BLCR also works great (all 
processes for jobs are created and killed correctly).
The problems occur only when both options (enable-cpusets and 
enable-blcr) are enabled.

Here is my configuration line for torque:

./configure --with-default-server=qsrv.cis.gov.pl --with-rcp=/usr/bin/scp --enable-cpuset --enable-nvidia-gpus --enable-blcr

Then I build rpm's (configure.ac and torque.spec files had some mistakes 
so I have patched them) and install it on our server/nodes.

Regards,
Peter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130304/34ae08d7/attachment-0001.html 


More information about the torqueusers mailing list