[torqueusers] Torque/Maui kills jobs running on the same node

Coyle, James J [ITACD] jjc at iastate.edu
Tue Feb 23 11:27:25 MST 2010


Evgeni,

   Only issue the kill command for nodes for which the job was dedicated.

Instead of the line:

    nodes=$(sort $nodefile | uniq)

try 

  nodes=$(cat $nodefile | awk  -f /var/spool/torque/mom_priv/dedicated_nodes.awk)

where the awk script above for a node with ppn=4 is:

{a[$1]++;if (a[$1] == 4) {print}}

 - Jim C.

 James Coyle, PhD
 High Performance Computing Group     
 115 Durham Center            
 Iowa State Univ.          
 Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc

   
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Evgeni Bezus
Sent: Tuesday, February 23, 2010 12:13 PM
To: Smith, Jerry Don II; torqueusers at supercluster.org
Subject: Re: [torqueusers] Torque/Maui kills jobs running on the same node

Jerry,

Here is our epilogue script:

#!/bin/sh
# epilogue gets 9 arguments:
# 1 -- jobid
# 2 -- userid
# 3 -- grpi
# 4 -- job name
# 5 -- sessionid
# 6 -- resource limits
# 7 -- resources used
# 8 -- queue
# 9 -- account
#
jobid=$1
userid=$2

nodefile=/var/spool/pbs/aux/$jobid
if [ -r $nodefile ] ; then
    nodes=$(sort $nodefile | uniq)
else
    nodes=localhost
fi
tmp=/tmp/pbstmp.$jobid
for i in $nodes ; do
    ssh $i pkill -U $2
    ssh $i rm -rf $tmp
done
exit 0


I suppose, the line "ssh $i pkill -U $2" kills all the jobs that
belong to the user with userid $2, so this script is the cause of the
problem?

Regards,
Evgeni

2010/2/20 Smith, Jerry Don II <jdsmit at sandia.gov>:
> Evgeni,
>
> I meant the script that pbs runs at the end of the job on each node:
>
> $PBS_HOME/mom_priv/epilogue.parallel.
>
> Many of us do process cleanup ( making sure all of a user's processes are removed before scheduling the next job there), in the epilogue scripts.
>
> We usually run SINGLEJOB, but I have some rather large SMP nodes that we had to adjust the epilogues to take into account that we share those nodes.
>
> If you can post your epilogue.parallel, we can see if this is what happening.  And I would be happy to share our scripts if it would help.
>
>
>
> Jerry
>
>
> ----- Original Message -----
> From: Evgeni Bezus <evgeni.bezus at gmail.com>
> To: Smith, Jerry Don II; jbernstein at penguincomputing.com <jbernstein at penguincomputing.com>; torqueusers at supercluster.org <torqueusers at supercluster.org>
> Sent: Sat Feb 20 00:50:37 2010
> Subject: Re: [torqueusers] Torque/Maui kills jobs running on the same node
>
> Josh,
> It seems no: there are maui and pbs_server in the process list, but no
> pbs_sched.
>
> Jerry,
>
> Do you mean the cleanup in the job code? The simplest example were the
> following two scripts:
>
> test.sh:
>
> #PBS -l nodes=1:ppn=1
> #PBS -l walltime=10:00:00
> sleep 30
>
> and
>
> test2.sh:
>
> #PBS -l nodes=1:ppn=1
> #PBS -l walltime=10:00:00
> sleep 5
>
>
> I submitted the jobs almost simultaneously, they were allocated to the
> same node. Here are the tracejob results for these jobs:
>
> Job 1044 - first script (test.sh):
> -------------------------------------------
> Job: 1044.master.ssau.ru
>
> 02/18/2010 09:33:12  S    ready to commit job completed
> 02/18/2010 09:33:12  S    committing job
> 02/18/2010 09:33:12  A    queue=workq
> 02/18/2010 09:33:12  S    ready to commit job
> 02/18/2010 09:33:14  S    entering post_sendmom
> 02/18/2010 09:33:14  A    user=bezus group=matlabusergroup
> jobname=test.sh queue=workq ctime=1266474792 qtime=1266474792
> etime=1266474792 start=1266474794 exec_host=n14.ssau.ru/0
> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
> Resource_List.neednodes=n14.ssau.ru Resource_List.nodect=1
> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
> 02/18/2010 09:33:23  S    removed job script
> 02/18/2010 09:33:23  A    user=bezus group=matlabusergroup
> jobname=test.sh queue=workq ctime=1266474792 qtime=1266474792
> etime=1266474792 start=1266474794 exec_host=n14.ssau.ru/0
> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
> Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
> session=19618 end=1266474803 Exit_status=271
> resources_used.cput=00:00:00 resources_used.mem=2568kb
> resources_used.vmem=25240kb resources_used.walltime=00:00:08
> 02/18/2010 09:33:27  S    removed job file
> -------------------------------------------
>
> Job 1045 - second script (test2.sh):
> -------------------------------------------
> Job: 1045.master.ssau.ru
>
> 02/18/2010 09:33:16  S    ready to commit job completed
> 02/18/2010 09:33:16  S    committing job
> 02/18/2010 09:33:16  A    queue=workq
> 02/18/2010 09:33:16  S    ready to commit job
> 02/18/2010 09:33:17  S    entering post_sendmom
> 02/18/2010 09:33:17  A    user=bezus group=matlabusergroup
> jobname=test2.sh queue=workq ctime=1266474796 qtime=1266474796
> etime=1266474796 start=1266474797 exec_host=n14.ssau.ru/1
> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
> Resource_List.neednodes=n14.ssau.ru Resource_List.nodect=1
> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
> 02/18/2010 09:33:23  S    removed job script
> 02/18/2010 09:33:23  S    removed job file
> 02/18/2010 09:33:23  A    user=bezus group=matlabusergroup
> jobname=test2.sh queue=workq ctime=1266474796 qtime=1266474796
> etime=1266474796 start=1266474797 exec_host=n14.ssau.ru/1
> Resource_List.cput=10000:00:00 Resource_List.ncpus=1
> Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
> Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
> session=19719 end=1266474803 Exit_status=0
> resources_used.cput=00:00:00 resources_used.mem=0kb
> resources_used.vmem=0kb resources_used.walltime=00:00:05
> -------------------------------------------
>
> According to the tracejob results, job 1044 was killed along with
> finished job 1045.
>
>
> -Regards,
> Evgeni
>
>
> 2010/2/19 Jerry Smith <jdsmit at sandia.gov>:
>> Evgeni,
>>
>> Are you doing any process cleanup in the epilogue?  If so you may be killing
>> all of that user's jobs when the first job exits.
>>
>> --Jerry
>>
>>
>> Evgeni Bezus wrote:
>>>
>>> Hi all,
>>>
>>> We are running Maui and Torque on a 14-node cluster. Each node has 8 cores
>>> (2 4-core processors). When running two (or more) jobs from a single
>>> user on the same node, Maui(or Torque?) stops all the jobs when one of
>>> them is
>>> finished. The finished job has Exit_status=0, killed jobs -
>>> Exit_status=271. The value of the NODEACCESSPOLICY parameter in
>>> maui.cfg is SHARED. This problem does not occur when running jobs from
>>> a single user on different nodes or when running jobs from different
>>> users on the same node.
>>>
>>> Does anyone know how to resolve the problem?
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>
>>
>
>
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list