[torqueusers] Job starting on two nodes only one node asked and seen in qstat

Fabien Archambault fabien.archambault at univ-provence.fr
Fri Oct 14 07:16:26 MDT 2011


  Dear list,

I have an issue on a cluster I am administrating. It does not appear 
always but sometimes it is present on many jobs.

The problem is that one job is sent and seen from qstat point of view 
only on the lame6 (in the example I give) but the job is also started on 
lame5.

Here is a recent example I had and some output. If you need more please 
do not hesitate to ask.

Maui version 3.3.1
# /usr/local/sbin/pbs_server -v
version: 2.5.7

The job is:
# qstat -f 61209
Job Id: 61209.master.up.univ-mrs.fr
     Job_Name = Ru-41-b3lypMKvdw
     Job_Owner = biosci at master.up.univ-mrs.fr
     resources_used.cput = 06:44:35
     resources_used.mem = 206392kb
     resources_used.vmem = 6091668kb
     resources_used.walltime = 01:43:10
     job_state = R
     queue = journey
     server = master.up.univ-mrs.fr
     Checkpoint = u
     ctime = Fri Oct 14 13:20:46 2011
     Error_Path = 
master.up.univ-mrs.fr:/home/biosci/Michel/laccase/Ru-41-b3lyp
     MKvdw.e61209
     exec_host = lame6/3+lame6/2+lame6/1+lame6/0
     Hold_Types = n
     Join_Path = oe
     Keep_Files = n
     Mail_Points = a
     mtime = Fri Oct 14 13:22:08 2011
     Output_Path = 
master.up.univ-mrs.fr:/home/biosci/Michel/laccase/Ru-41-b3ly
     pMKvdw.o61209
     Priority = 0
     qtime = Fri Oct 14 13:20:46 2011
     Rerunable = True
     Resource_List.mem = 4gb
     Resource_List.neednodes = 1:ppn=4
     Resource_List.nice = 10
     Resource_List.nodect = 1
     Resource_List.nodes = 1:ppn=4
     Resource_List.other = disk=200gb
     Resource_List.walltime = 48:00:00
     session_id = 20049
     Shell_Path_List = /bin/sh
     substate = 42
     Variable_List = PBS_O_QUEUE=defaut,PBS_O_HOME=/home/biosci,
     PBS_O_LANG=fr_FR.UTF-8,PBS_O_LOGNAME=biosci,
     PBS_O_PATH=/home/biosci/COSMOlogic10/TURBOMOLE/bin/em64t-unknown-linu
     x-gnu:/home/biosci/COSMOlogic10/TURBOMOLE/scripts:/usr/local/maui/bin:
     /usr/local/Mercury_2.4/bin:/usr/local/Ampac-9.2/exec:/usr/local/Ampac-
     9.1/exec:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/biosci/b
     in,PBS_O_MAIL=/var/spool/mail/biosci,PBS_O_SHELL=/bin/bash,
     PBS_O_HOST=master.up.univ-mrs.fr,PBS_SERVER=master,
     PBS_O_WORKDIR=/home/biosci/Michel/laccase
     euser = biosci
     egroup = biosci
     hashname = 61209.master.up.univ-mrs.fr
     queue_rank = 7385
     queue_type = E
     etime = Fri Oct 14 13:20:46 2011
     submit_args = /home/biosci/Michel/laccase/Ru-41-b3lypMKvdw.job
     start_time = Fri Oct 14 13:21:17 2011
     Walltime.Remaining = 166545
     start_count = 2
     fault_tolerant = False
     submit_host = master.up.univ-mrs.fr
     init_work_dir = /home/biosci/Michel/laccase

# diagnose -j 61209
Name                  State Par Proc QOS     WCLimit R  Min     User    
Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  
Procs       Class Features

61209               Running DEF    4 DEF  2:00:00:00 1    4   biosci   
biosci        -    00:15:20   [NONE] [NONE] [NONE] >=0 >=0    NC0 
[journey:1] [NONE]
WARNING:  job '61209' utilizes more procs than dedicated (3.92 > 1)

#pestat
...
   lame5  busy*   11*  24103   8  32104   1333  4/3    8      0    61208 
biosci 61209 biosci 61210 fredbip7
   lame6  excl     8   24103   8  32104   1193  3/3    8      0    61209 
biosci 61211 fredbip7
...


On the master:
# tracejob -q 61209

Job: 61209.master.up.univ-mrs.fr

10/14/2011 13:20:46  S    enqueuing into defaut, state 1 hop 1
10/14/2011 13:20:46  S    dequeuing from defaut, state QUEUED
10/14/2011 13:20:46  S    enqueuing into journey, state 1 hop 1
10/14/2011 13:20:46  S    Job Queued at request of 
biosci at master.up.univ-mrs.fr, owner = biosci at master.up.univ-mrs.fr, job 
name = Ru-41-b3lypMKvdw, queue = journey
10/14/2011 13:20:46  A    queue=defaut
10/14/2011 13:20:46  A    queue=journey
10/14/2011 13:20:57  S    Job Run at request of root at master.up.univ-mrs.fr
10/14/2011 13:21:07  S    unable to run job, MOM rejected/timeout
10/14/2011 13:21:17  S    Job Run at request of root at master.up.univ-mrs.fr
10/14/2011 13:21:22  A    user=biosci group=biosci 
jobname=Ru-41-b3lypMKvdw queue=journey ctime=1318591246 qtime=1318591246 
etime=1318591246 start=1318591282 owner=biosci at master.up.univ-mrs.fr 
exec_host=lame6/3+lame6/2+lame6/1+lame6/0
                           Resource_List.mem=4gb 
Resource_List.neednodes=1:ppn=4 Resource_List.nice=10 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 
Resource_List.other=disk=200gb Resource_List.walltime=48:00:00
10/14/2011 13:22:08  S    Not sending email: User does not want mail of 
this type.
10/14/2011 14:51:48  S    enqueuing into journey, state 4 hop 1
10/14/2011 14:51:48  S    Requeueing job, substate: 42 Requeued in 
queue: journey


On lame5:
# tracejob -q 61209

Job: 61209.master.up.univ-mrs.fr

10/14/2011 13:21:02  M    evaluating limits for job
10/14/2011 13:21:02  M    about to fork child which will become job
10/14/2011 13:21:02  M    phase 2 of job launch successfully completed
10/14/2011 13:21:07  M    job not ready after 5 second timeout, MOM will 
recheck
10/14/2011 13:21:08  M    checking job start in TMOMScanForStarting - 
examining pipe from child
10/14/2011 13:21:09  M    job 61209.master.up.univ-mrs.fr child not 
started, will check later
10/14/2011 13:21:40  M    task/session info loaded
10/14/2011 13:21:40  M    job 61209.master.up.univ-mrs.fr reported 
successful start

On lame6:
# tracejob -q 61209

Job: 61209.master.up.univ-mrs.fr

10/14/2011 13:21:17  M    evaluating limits for job
10/14/2011 13:21:17  M    about to fork child which will become job
10/14/2011 13:21:17  M    phase 2 of job launch successfully completed
10/14/2011 13:21:22  M    job not ready after 5 second timeout, MOM will 
recheck
10/14/2011 13:21:22  M    checking job start in TMOMScanForStarting - 
examining pipe from child
10/14/2011 13:21:23  M    job 61209.master.up.univ-mrs.fr child not 
started, will check later
10/14/2011 13:21:55  M    task/session info loaded
10/14/2011 13:21:55  M    job 61209.master.up.univ-mrs.fr reported 
successful start
10/14/2011 13:22:08  M    encoding "send flagged" attr: session_id


Thanks for your help,
Fabien



More information about the torqueusers mailing list