[torqueusers] Torque/Maui kills jobs running on the same node

Jerry Smith jdsmit at sandia.gov
Tue Feb 23 11:32:26 MST 2010


Evgeni,

IMHO, that would be it.  That ( pkill -U ) is going to kill any process 
owned by you, be they from job a,b,c.

--Jerry

Evgeni Bezus wrote:
> Jerry,
>
> Here is our epilogue script:
>
> #!/bin/sh
> # epilogue gets 9 arguments:
> # 1 -- jobid
> # 2 -- userid
> # 3 -- grpi
> # 4 -- job name
> # 5 -- sessionid
> # 6 -- resource limits
> # 7 -- resources used
> # 8 -- queue
> # 9 -- account
> #
> jobid=$1
> userid=$2
>
> nodefile=/var/spool/pbs/aux/$jobid
> if [ -r $nodefile ] ; then
>     nodes=$(sort $nodefile | uniq)
> else
>     nodes=localhost
> fi
> tmp=/tmp/pbstmp.$jobid
> for i in $nodes ; do
>     ssh $i pkill -U $2
>     ssh $i rm -rf $tmp
> done
> exit 0
>
>
> I suppose, the line "ssh $i pkill -U $2" kills all the jobs that
> belong to the user with userid $2, so this script is the cause of the
> problem?
>
> Regards,
> Evgeni
>
> 2010/2/20 Smith, Jerry Don II <jdsmit at sandia.gov>:
>   
>> Evgeni,
>>
>> I meant the script that pbs runs at the end of the job on each node:
>>
>> $PBS_HOME/mom_priv/epilogue.parallel.
>>
>> Many of us do process cleanup ( making sure all of a user's processes are removed before scheduling the next job there), in the epilogue scripts.
>>
>> We usually run SINGLEJOB, but I have some rather large SMP nodes that we had to adjust the epilogues to take into account that we share those nodes.
>>
>> If you can post your epilogue.parallel, we can see if this is what happening.  And I would be happy to share our scripts if it would help.
>>
>>
>>
>> Jerry
>>
>>
>> ----- Original Message -----
>> From: Evgeni Bezus <evgeni.bezus at gmail.com>
>> To: Smith, Jerry Don II; jbernstein at penguincomputing.com <jbernstein at penguincomputing.com>; torqueusers at supercluster.org <torqueusers at supercluster.org>
>> Sent: Sat Feb 20 00:50:37 2010
>> Subject: Re: [torqueusers] Torque/Maui kills jobs running on the same node
>>
>> Josh,
>> It seems no: there are maui and pbs_server in the process list, but no
>> pbs_sched.
>>
>> Jerry,
>>
>> Do you mean the cleanup in the job code? The simplest example were the
>> following two scripts:
>>
>> test.sh:
>>
>> #PBS -l nodes=1:ppn=1
>> #PBS -l walltime=10:00:00
>> sleep 30
>>
>> and
>>
>> test2.sh:
>>
>> #PBS -l nodes=1:ppn=1
>> #PBS -l walltime=10:00:00
>> sleep 5
>>
>>
>> I submitted the jobs almost simultaneously, they were allocated to the
>> same node. Here are the tracejob results for these jobs:
>>
>> Job 1044 - first script (test.sh):
>> -------------------------------------------
>> Job: 1044.master.ssau.ru
>>
>> 02/18/2010 09:33:12  S    ready to commit job completed
>> 02/18/2010 09:33:12  S    committing job
>> 02/18/2010 09:33:12  A    queue=workq
>> 02/18/2010 09:33:12  S    ready to commit job
>> 02/18/2010 09:33:14  S    entering post_sendmom
>> 02/18/2010 09:33:14  A    user=bezus group=matlabusergroup
>> jobname=test.sh queue=workq ctime=1266474792 qtime=1266474792
>> etime=1266474792 start=1266474794 exec_host=n14.ssau.ru/0
>> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
>> Resource_List.neednodes=n14.ssau.ru Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
>> 02/18/2010 09:33:23  S    removed job script
>> 02/18/2010 09:33:23  A    user=bezus group=matlabusergroup
>> jobname=test.sh queue=workq ctime=1266474792 qtime=1266474792
>> etime=1266474792 start=1266474794 exec_host=n14.ssau.ru/0
>> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
>> Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
>> session=19618 end=1266474803 Exit_status=271
>> resources_used.cput=00:00:00 resources_used.mem=2568kb
>> resources_used.vmem=25240kb resources_used.walltime=00:00:08
>> 02/18/2010 09:33:27  S    removed job file
>> -------------------------------------------
>>
>> Job 1045 - second script (test2.sh):
>> -------------------------------------------
>> Job: 1045.master.ssau.ru
>>
>> 02/18/2010 09:33:16  S    ready to commit job completed
>> 02/18/2010 09:33:16  S    committing job
>> 02/18/2010 09:33:16  A    queue=workq
>> 02/18/2010 09:33:16  S    ready to commit job
>> 02/18/2010 09:33:17  S    entering post_sendmom
>> 02/18/2010 09:33:17  A    user=bezus group=matlabusergroup
>> jobname=test2.sh queue=workq ctime=1266474796 qtime=1266474796
>> etime=1266474796 start=1266474797 exec_host=n14.ssau.ru/1
>> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
>> Resource_List.neednodes=n14.ssau.ru Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
>> 02/18/2010 09:33:23  S    removed job script
>> 02/18/2010 09:33:23  S    removed job file
>> 02/18/2010 09:33:23  A    user=bezus group=matlabusergroup
>> jobname=test2.sh queue=workq ctime=1266474796 qtime=1266474796
>> etime=1266474796 start=1266474797 exec_host=n14.ssau.ru/1
>> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
>> Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
>> session=19719 end=1266474803 Exit_status=0
>> resources_used.cput=00:00:00 resources_used.mem=0kb
>> resources_used.vmem=0kb resources_used.walltime=00:00:05
>> -------------------------------------------
>>
>> According to the tracejob results, job 1044 was killed along with
>> finished job 1045.
>>
>>
>> -Regards,
>> Evgeni
>>
>>
>> 2010/2/19 Jerry Smith <jdsmit at sandia.gov>:
>>     
>>> Evgeni,
>>>
>>> Are you doing any process cleanup in the epilogue?  If so you may be killing
>>> all of that user's jobs when the first job exits.
>>>
>>> --Jerry
>>>
>>>
>>> Evgeni Bezus wrote:
>>>       
>>>> Hi all,
>>>>
>>>> We are running Maui and Torque on a 14-node cluster. Each node has 8 cores
>>>> (2 4-core processors). When running two (or more) jobs from a single
>>>> user on the same node, Maui(or Torque?) stops all the jobs when one of
>>>> them is
>>>> finished. The finished job has Exit_status=0, killed jobs -
>>>> Exit_status=271. The value of the NODEACCESSPOLICY parameter in
>>>> maui.cfg is SHARED. This problem does not occur when running jobs from
>>>> a single user on different nodes or when running jobs from different
>>>> users on the same node.
>>>>
>>>> Does anyone know how to resolve the problem?
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>         
>>>       
>>     
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100223/2ee6d1f1/attachment.html 


More information about the torqueusers mailing list