[torqueusers] some parallel jobs does not cleanup

Alessandro Federico alessandro.federico at caspur.it
Mon Jan 18 04:59:44 MST 2010


Hi,

we are using Torque 2.3.6 and Moab 5.3.5.

Sometimes pbs_mom fails to clean the processes created by some parallel
jobs.
It happens always with the same kind of jobs from the same users. The
executables running on that
jobs are compiled with openmpi 1.2.5 (that is compiled with Task Manager
support --with-tm).
Please note that most of the executables running on our cluster are compiled
with that version of openmpi
and they are correctly killed by MOMs as the job they belong to is
removed/terminated.

The tracejob command of the Torque server, the master mom and a typical
sister mom looks as follows:

====================== SERVER ===========================
$ tracejob -q -n 4 312345

Job: 312345.master.cvos.cluster

01/16/2010 17:01:45  S    enqueuing into route, state 1 hop 1
01/16/2010 17:01:45  S    dequeuing from route, state QUEUED
01/16/2010 17:01:45  S    enqueuing into mpp_small, state 1 hop 1
01/16/2010 17:01:45  S    Job Queued at request of
girichid at matrix2.cvos.cluster, owner = girichid at matrix2.cvos.cluster, job
name
                          = PL15-s-1, queue = mpp_small
01/16/2010 17:01:45  A    queue=route
01/16/2010 17:01:45  A    queue=mpp_small
01/16/2010 22:00:41  S    Job Modified at request of
root at master.cvos.cluster
01/16/2010 22:00:41  S    Job Run at request of root at master.cvos.cluster
01/16/2010 22:00:41  S    Job Modified at request of
root at master.cvos.cluster
01/16/2010 22:00:41  A    user=girichid group=ajr account=cmp09-849
jobname=PL15-s-1 queue=mpp_small ctime=1263657705
                          qtime=1263657705 etime=1263657705 start=1263675641
owner=girichid at matrix2.cvos.cluster

 exec_host=neo075/7+neo075/6+neo075/5+neo075/4+neo075/3+neo075/2+neo075/1+neo075/0+neo152/7+neo152/6+neo152/5+neo152/4+neo152/3+neo152/2+neo152/1+neo152/0+neo155/7+neo155/6+neo155/5+neo155/4+neo155/3+neo155/2+neo155/1+neo155/0+neo066/7+neo066/6+neo066/5+neo066/4+neo066/3+neo066/2+neo066/1+neo066/0+neo125/7+neo125/6+neo125/5+neo125/4+neo125/3+neo125/2+neo125/1+neo125/0+neo127/7+neo127/6+neo127/5+neo127/4+neo127/3+neo127/2+neo127/1+neo127/0+neo180/7+neo180/6+neo180/5+neo180/4+neo180/3+neo180/2+neo180/1+neo180/0+neo158/7+neo158/6+neo158/5+neo158/4+neo158/3+neo158/2+neo158/1+neo158/0

 Resource_List.neednodes=neo075:ppn=8+neo152:ppn=8+neo155:ppn=8+neo066:ppn=8+neo125:ppn=8+neo127:ppn=8+neo180:ppn=8+neo158:ppn=8
                          Resource_List.nodect=8 Resource_List.nodes=8:ppn=8
Resource_List.walltime=24:00:00
01/17/2010 09:55:06  S    Unauthorized Request, request type: 6, Object:
Job, Name: 312345.master.cvos.cluster, request from:
                          montefer at matrix2.cvos.cluster
01/17/2010 22:10:42  S    Job deleted at request of root at master.cvos.cluster
01/17/2010 22:10:42  S    Job sent signal SIGTERM on delete
01/17/2010 22:10:42  S    purging job without checking MOM
01/17/2010 22:10:42  S    dequeuing from mpp_small, state RUNNING
01/17/2010 22:10:42  A    requestor=root at master.cvos.cluster


====================== MOM (mother) ===========================
$ tracejob -q -n 4 312345

Job: 312345.master.cvos.cluster

01/16/2010 22:00:41  M    job 312345.master.cvos.cluster reported successful
start on 8 node(s)
01/16/2010 22:00:41  M    modifying job
01/16/2010 22:00:41  M    Job Modified at request of
PBS_Server at master.cvos.cluster
01/16/2010 22:00:41  M    all sisters have reported in, launching job
locally
01/16/2010 22:00:41  M    phase 2 of job launch successfully completed
01/16/2010 22:00:41  M    job successfully started
01/16/2010 22:00:42  M    start_process: task started, tid 2, sid 9640, cmd
orted
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.152:15003
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.155:15003
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.125:15003
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.127:15003
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.66:15003
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.180:15003
01/17/2010 00:00:43  M    received request 'ALL_OKAY' for job
312345.master.cvos.cluster from 10.141.0.158:15003
01/18/2010 09:44:26  M    no active process found
01/18/2010 09:44:26  M    no active process found
01/18/2010 09:44:26  M    job was terminated
01/18/2010 09:44:26  M    master task has exited - sent kill job request to
7 sisters
01/18/2010 09:44:26  M    task is dead
01/18/2010 09:44:26  M    task is dead
01/18/2010 09:44:26  M    job is in non-exiting substate RUNNING, no obit
sent at this time
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.152:15003
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.155:15003
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.66:15003
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.125:15003
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.127:15003
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.180:15003
01/18/2010 09:44:26  M    received request 'ERROR' for job
312345.master.cvos.cluster from 10.141.0.158:15003
01/18/2010 09:44:26  M    sending preobit jobstat
01/18/2010 09:44:26  M    deleting job
01/18/2010 09:44:26  M    deleting job 312345.master.cvos.cluster in state
PREOBIT


====================== MOM (sisters) ===========================
$ tracejob -q -n 4 312345

Job: 312345.master.cvos.cluster

01/16/2010 22:00:41  M    received request 'JOIN_JOB' for job
312345.master.cvos.cluster from 10.141.0.75:1023
01/16/2010 22:00:41  M    im_request: JOIN_JOB 312345.master.cvos.cluster
node 2
01/16/2010 22:00:41  M    JOIN JOB as node 2
01/16/2010 22:00:42  M    received request 'SPAWN_TASK' for job
312345.master.cvos.cluster from 10.141.0.75:1023
01/16/2010 22:00:42  M    INFO:     received request 'SPAWN_TASK' from
10.141.0.75:1023 for job '312345.master.cvos.cluster' (spawning task on node
'0' with taskid=4, globid='none'
01/16/2010 22:00:42  M    start_process: task started, tid 4, sid 16531, cmd
orted
01/17/2010 00:00:43  M    received request 'POLL_JOB' for job
312345.master.cvos.cluster from 10.141.0.75:1023
01/17/2010 22:11:03  M    deleting job
01/17/2010 22:11:03  M    deleting job 312345.master.cvos.cluster in state
RUNNING
01/18/2010 09:44:26  M    received request 'KILL_JOB' for job
312345.master.cvos.cluster from 10.141.0.75:1023
01/18/2010 09:44:26  M    ERROR:    received request 'KILL_JOB' from
10.141.0.75:1023 for job '312345.master.cvos.cluster' (job does not exist
locally)


Can anybody help us to resolve this issue, please?

Regards,
Ale

-- 
All work and no play makes Jack a dull boy.
   All work and no play makes Jack a dull
 boy. All work and no play makes Jack...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100118/76c1b0a7/attachment-0001.html 


More information about the torqueusers mailing list