[torqueusers] Some jobs not running after upgrade

Wayne Mallett wayne.mallett at jcu.edu.au
Mon Jan 26 17:30:50 MST 2009


G'day all,

I recently upgraded torque and maui.  All looked good in testing so I 
upgraded the production environment.  After the upgrade, I've noticed 
some jobs won't run - I get (from qsub):

qsub: waiting for job 92414.pbs.cluster to start
qsub: job 92414.pbs.cluster ready

qsub: job 92414.pbs.cluster completed

There is no noticable delay between ready and completed.  A tracejob 
produces:

Job: 92414.pbs.cluster

01/27/2009 10:22:23  S    enqueuing into feeder, state 1 hop 1
01/27/2009 10:22:23  S    dequeuing from feeder, state QUEUED
01/27/2009 10:22:23  S    enqueuing into infinity, state 1 hop 1
01/27/2009 10:22:23  S    Job Queued at request of 
sci-wam at cluster.cluster, owner = sci-wam at cluster.cluster, job name =
                           STDIN, queue = infinity
01/27/2009 10:22:23  S    Job Run at request of sci-wam at cluster.cluster
01/27/2009 10:22:23  A    queue=feeder
01/27/2009 10:22:23  A    queue=infinity
01/27/2009 10:22:28  S    child reported success for job after 5 seconds 
(dest=node72), rc=0
01/27/2009 10:22:28  A    user=sci-wam group=sci-wam jobname=STDIN 
queue=infinity ctime=1233015743 qtime=1233015743
                           etime=1233015743 start=1233015748 
owner=sci-wam at cluster.cluster exec_host=node72/0
                           Resource_List.cput=9999:00:00 
Resource_List.ncpus=1 Resource_List.neednodes=batch
                           Resource_List.nice=5 Resource_List.nodect=1 
Resource_List.walltime=9999:00:00
01/27/2009 10:22:29  S    obit received - updating final job usage info
01/27/2009 10:22:29  S    sending 'a' mail for job 92414.pbs.cluster to 
sci-wam at cluster.cluster (Job cannot be executed
01/27/2009 10:22:29  S    job exit status -1 handled
01/27/2009 10:22:29  S    Exit_status=-1 resources_used.cput=00:00:00 
resources_used.mem=0kb resources_used.vmem=0kb
                           resources_used.walltime=00:00:06 
Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
01/27/2009 10:22:29  S    on_job_exit task assigned to job
01/27/2009 10:22:29  S    req_jobobit completed
01/27/2009 10:22:29  S    JOB_SUBSTATE_EXITING
01/27/2009 10:22:29  S    JOB_SUBSTATE_STAGEOUT
01/27/2009 10:22:29  S    no files to copy - deleting job
01/27/2009 10:22:29  S    JOB_SUBSTATE_STAGEDEL
01/27/2009 10:22:29  S    JOB_SUBSTATE_EXITED
01/27/2009 10:22:29  S    JOB_SUBSTATE_COMPLETE
01/27/2009 10:22:29  S    dequeuing from infinity, state COMPLETE
01/27/2009 10:22:29  A    user=sci-wam group=sci-wam jobname=STDIN 
queue=infinity ctime=1233015743 qtime=1233015743
                           etime=1233015743 start=1233015748 
owner=sci-wam at cluster.cluster exec_host=node72/0
                           Resource_List.cput=9999:00:00 
Resource_List.ncpus=1 Resource_List.neednodes=batch
                           Resource_List.nice=5 Resource_List.nodect=1 
Resource_List.walltime=9999:00:00 session=0
                           end=1233015749 Exit_status=-1 
resources_used.cput=00:00:00 resources_used.mem=0kb
                           resources_used.vmem=0kb 
resources_used.walltime=00:00:06 Error_Path=/dev/pts/0
                           Output_Path=/dev/pts/0

If anyone has any ideas, they would be greatly appreciated.  I've seen a 
  similar message in the archives where there is talk about 2 MOMs.  I 
don't see this on my systems however.

Thanks,
Wayne
-- 
Dr. Wayne Mallett
High Performance & Research Computing Support

Phone:	0747815084
Email:	Wayne.Mallett at jcu.edu.au
Smail:	James Cook University
	Townsville  Qld  4811
	Australia


More information about the torqueusers mailing list