[torqueusers] Job starting on two nodes only one node asked and seen in qstat
Fabien Archambault
fabien.archambault at univ-provence.fr
Fri Oct 14 07:16:26 MDT 2011
Dear list,
I have an issue on a cluster I am administrating. It does not appear
always but sometimes it is present on many jobs.
The problem is that one job is sent and seen from qstat point of view
only on the lame6 (in the example I give) but the job is also started on
lame5.
Here is a recent example I had and some output. If you need more please
do not hesitate to ask.
Maui version 3.3.1
# /usr/local/sbin/pbs_server -v
version: 2.5.7
The job is:
# qstat -f 61209
Job Id: 61209.master.up.univ-mrs.fr
Job_Name = Ru-41-b3lypMKvdw
Job_Owner = biosci at master.up.univ-mrs.fr
resources_used.cput = 06:44:35
resources_used.mem = 206392kb
resources_used.vmem = 6091668kb
resources_used.walltime = 01:43:10
job_state = R
queue = journey
server = master.up.univ-mrs.fr
Checkpoint = u
ctime = Fri Oct 14 13:20:46 2011
Error_Path =
master.up.univ-mrs.fr:/home/biosci/Michel/laccase/Ru-41-b3lyp
MKvdw.e61209
exec_host = lame6/3+lame6/2+lame6/1+lame6/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Oct 14 13:22:08 2011
Output_Path =
master.up.univ-mrs.fr:/home/biosci/Michel/laccase/Ru-41-b3ly
pMKvdw.o61209
Priority = 0
qtime = Fri Oct 14 13:20:46 2011
Rerunable = True
Resource_List.mem = 4gb
Resource_List.neednodes = 1:ppn=4
Resource_List.nice = 10
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=4
Resource_List.other = disk=200gb
Resource_List.walltime = 48:00:00
session_id = 20049
Shell_Path_List = /bin/sh
substate = 42
Variable_List = PBS_O_QUEUE=defaut,PBS_O_HOME=/home/biosci,
PBS_O_LANG=fr_FR.UTF-8,PBS_O_LOGNAME=biosci,
PBS_O_PATH=/home/biosci/COSMOlogic10/TURBOMOLE/bin/em64t-unknown-linu
x-gnu:/home/biosci/COSMOlogic10/TURBOMOLE/scripts:/usr/local/maui/bin:
/usr/local/Mercury_2.4/bin:/usr/local/Ampac-9.2/exec:/usr/local/Ampac-
9.1/exec:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/biosci/b
in,PBS_O_MAIL=/var/spool/mail/biosci,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=master.up.univ-mrs.fr,PBS_SERVER=master,
PBS_O_WORKDIR=/home/biosci/Michel/laccase
euser = biosci
egroup = biosci
hashname = 61209.master.up.univ-mrs.fr
queue_rank = 7385
queue_type = E
etime = Fri Oct 14 13:20:46 2011
submit_args = /home/biosci/Michel/laccase/Ru-41-b3lypMKvdw.job
start_time = Fri Oct 14 13:21:17 2011
Walltime.Remaining = 166545
start_count = 2
fault_tolerant = False
submit_host = master.up.univ-mrs.fr
init_work_dir = /home/biosci/Michel/laccase
# diagnose -j 61209
Name State Par Proc QOS WCLimit R Min User
Group Account QueuedTime Network Opsys Arch Mem Disk
Procs Class Features
61209 Running DEF 4 DEF 2:00:00:00 1 4 biosci
biosci - 00:15:20 [NONE] [NONE] [NONE] >=0 >=0 NC0
[journey:1] [NONE]
WARNING: job '61209' utilizes more procs than dedicated (3.92 > 1)
#pestat
...
lame5 busy* 11* 24103 8 32104 1333 4/3 8 0 61208
biosci 61209 biosci 61210 fredbip7
lame6 excl 8 24103 8 32104 1193 3/3 8 0 61209
biosci 61211 fredbip7
...
On the master:
# tracejob -q 61209
Job: 61209.master.up.univ-mrs.fr
10/14/2011 13:20:46 S enqueuing into defaut, state 1 hop 1
10/14/2011 13:20:46 S dequeuing from defaut, state QUEUED
10/14/2011 13:20:46 S enqueuing into journey, state 1 hop 1
10/14/2011 13:20:46 S Job Queued at request of
biosci at master.up.univ-mrs.fr, owner = biosci at master.up.univ-mrs.fr, job
name = Ru-41-b3lypMKvdw, queue = journey
10/14/2011 13:20:46 A queue=defaut
10/14/2011 13:20:46 A queue=journey
10/14/2011 13:20:57 S Job Run at request of root at master.up.univ-mrs.fr
10/14/2011 13:21:07 S unable to run job, MOM rejected/timeout
10/14/2011 13:21:17 S Job Run at request of root at master.up.univ-mrs.fr
10/14/2011 13:21:22 A user=biosci group=biosci
jobname=Ru-41-b3lypMKvdw queue=journey ctime=1318591246 qtime=1318591246
etime=1318591246 start=1318591282 owner=biosci at master.up.univ-mrs.fr
exec_host=lame6/3+lame6/2+lame6/1+lame6/0
Resource_List.mem=4gb
Resource_List.neednodes=1:ppn=4 Resource_List.nice=10
Resource_List.nodect=1 Resource_List.nodes=1:ppn=4
Resource_List.other=disk=200gb Resource_List.walltime=48:00:00
10/14/2011 13:22:08 S Not sending email: User does not want mail of
this type.
10/14/2011 14:51:48 S enqueuing into journey, state 4 hop 1
10/14/2011 14:51:48 S Requeueing job, substate: 42 Requeued in
queue: journey
On lame5:
# tracejob -q 61209
Job: 61209.master.up.univ-mrs.fr
10/14/2011 13:21:02 M evaluating limits for job
10/14/2011 13:21:02 M about to fork child which will become job
10/14/2011 13:21:02 M phase 2 of job launch successfully completed
10/14/2011 13:21:07 M job not ready after 5 second timeout, MOM will
recheck
10/14/2011 13:21:08 M checking job start in TMOMScanForStarting -
examining pipe from child
10/14/2011 13:21:09 M job 61209.master.up.univ-mrs.fr child not
started, will check later
10/14/2011 13:21:40 M task/session info loaded
10/14/2011 13:21:40 M job 61209.master.up.univ-mrs.fr reported
successful start
On lame6:
# tracejob -q 61209
Job: 61209.master.up.univ-mrs.fr
10/14/2011 13:21:17 M evaluating limits for job
10/14/2011 13:21:17 M about to fork child which will become job
10/14/2011 13:21:17 M phase 2 of job launch successfully completed
10/14/2011 13:21:22 M job not ready after 5 second timeout, MOM will
recheck
10/14/2011 13:21:22 M checking job start in TMOMScanForStarting -
examining pipe from child
10/14/2011 13:21:23 M job 61209.master.up.univ-mrs.fr child not
started, will check later
10/14/2011 13:21:55 M task/session info loaded
10/14/2011 13:21:55 M job 61209.master.up.univ-mrs.fr reported
successful start
10/14/2011 13:22:08 M encoding "send flagged" attr: session_id
Thanks for your help,
Fabien
More information about the torqueusers
mailing list