[torqueusers] Torque/Maui kills jobs running on the same node

Evgeni Bezus evgeni.bezus at gmail.com
Sat Feb 20 00:50:37 MST 2010


Josh,
It seems no: there are maui and pbs_server in the process list, but no
pbs_sched.

Jerry,

Do you mean the cleanup in the job code? The simplest example were the
following two scripts:

test.sh:

#PBS -l nodes=1:ppn=1
#PBS -l walltime=10:00:00
sleep 30

and

test2.sh:

#PBS -l nodes=1:ppn=1
#PBS -l walltime=10:00:00
sleep 5


I submitted the jobs almost simultaneously, they were allocated to the
same node. Here are the tracejob results for these jobs:

Job 1044 - first script (test.sh):
-------------------------------------------
Job: 1044.master.ssau.ru

02/18/2010 09:33:12  S    ready to commit job completed
02/18/2010 09:33:12  S    committing job
02/18/2010 09:33:12  A    queue=workq
02/18/2010 09:33:12  S    ready to commit job
02/18/2010 09:33:14  S    entering post_sendmom
02/18/2010 09:33:14  A    user=bezus group=matlabusergroup
jobname=test.sh queue=workq ctime=1266474792 qtime=1266474792
etime=1266474792 start=1266474794 exec_host=n14.ssau.ru/0
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=n14.ssau.ru Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
02/18/2010 09:33:23  S    removed job script
02/18/2010 09:33:23  A    user=bezus group=matlabusergroup
jobname=test.sh queue=workq ctime=1266474792 qtime=1266474792
etime=1266474792 start=1266474794 exec_host=n14.ssau.ru/0
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
session=19618 end=1266474803 Exit_status=271
resources_used.cput=00:00:00 resources_used.mem=2568kb
resources_used.vmem=25240kb resources_used.walltime=00:00:08
02/18/2010 09:33:27  S    removed job file
-------------------------------------------

Job 1045 - second script (test2.sh):
-------------------------------------------
Job: 1045.master.ssau.ru

02/18/2010 09:33:16  S    ready to commit job completed
02/18/2010 09:33:16  S    committing job
02/18/2010 09:33:16  A    queue=workq
02/18/2010 09:33:16  S    ready to commit job
02/18/2010 09:33:17  S    entering post_sendmom
02/18/2010 09:33:17  A    user=bezus group=matlabusergroup
jobname=test2.sh queue=workq ctime=1266474796 qtime=1266474796
etime=1266474796 start=1266474797 exec_host=n14.ssau.ru/1
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=n14.ssau.ru Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
02/18/2010 09:33:23  S    removed job script
02/18/2010 09:33:23  S    removed job file
02/18/2010 09:33:23  A    user=bezus group=matlabusergroup
jobname=test2.sh queue=workq ctime=1266474796 qtime=1266474796
etime=1266474796 start=1266474797 exec_host=n14.ssau.ru/1
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 Resource_List.walltime=10:00:00
session=19719 end=1266474803 Exit_status=0
resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:05
-------------------------------------------

According to the tracejob results, job 1044 was killed along with
finished job 1045.


-Regards,
Evgeni


2010/2/19 Jerry Smith <jdsmit at sandia.gov>:
> Evgeni,
>
> Are you doing any process cleanup in the epilogue?  If so you may be killing
> all of that user's jobs when the first job exits.
>
> --Jerry
>
>
> Evgeni Bezus wrote:
>>
>> Hi all,
>>
>> We are running Maui and Torque on a 14-node cluster. Each node has 8 cores
>> (2 4-core processors). When running two (or more) jobs from a single
>> user on the same node, Maui(or Torque?) stops all the jobs when one of
>> them is
>> finished. The finished job has Exit_status=0, killed jobs -
>> Exit_status=271. The value of the NODEACCESSPOLICY parameter in
>> maui.cfg is SHARED. This problem does not occur when running jobs from
>> a single user on different nodes or when running jobs from different
>> users on the same node.
>>
>> Does anyone know how to resolve the problem?
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>
>


More information about the torqueusers mailing list