[torqueusers] Some jobs not running after upgrade
Wayne Mallett
wayne.mallett at jcu.edu.au
Mon Jan 26 17:30:50 MST 2009
G'day all,
I recently upgraded torque and maui. All looked good in testing so I
upgraded the production environment. After the upgrade, I've noticed
some jobs won't run - I get (from qsub):
qsub: waiting for job 92414.pbs.cluster to start
qsub: job 92414.pbs.cluster ready
qsub: job 92414.pbs.cluster completed
There is no noticable delay between ready and completed. A tracejob
produces:
Job: 92414.pbs.cluster
01/27/2009 10:22:23 S enqueuing into feeder, state 1 hop 1
01/27/2009 10:22:23 S dequeuing from feeder, state QUEUED
01/27/2009 10:22:23 S enqueuing into infinity, state 1 hop 1
01/27/2009 10:22:23 S Job Queued at request of
sci-wam at cluster.cluster, owner = sci-wam at cluster.cluster, job name =
STDIN, queue = infinity
01/27/2009 10:22:23 S Job Run at request of sci-wam at cluster.cluster
01/27/2009 10:22:23 A queue=feeder
01/27/2009 10:22:23 A queue=infinity
01/27/2009 10:22:28 S child reported success for job after 5 seconds
(dest=node72), rc=0
01/27/2009 10:22:28 A user=sci-wam group=sci-wam jobname=STDIN
queue=infinity ctime=1233015743 qtime=1233015743
etime=1233015743 start=1233015748
owner=sci-wam at cluster.cluster exec_host=node72/0
Resource_List.cput=9999:00:00
Resource_List.ncpus=1 Resource_List.neednodes=batch
Resource_List.nice=5 Resource_List.nodect=1
Resource_List.walltime=9999:00:00
01/27/2009 10:22:29 S obit received - updating final job usage info
01/27/2009 10:22:29 S sending 'a' mail for job 92414.pbs.cluster to
sci-wam at cluster.cluster (Job cannot be executed
01/27/2009 10:22:29 S job exit status -1 handled
01/27/2009 10:22:29 S Exit_status=-1 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:06
Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
01/27/2009 10:22:29 S on_job_exit task assigned to job
01/27/2009 10:22:29 S req_jobobit completed
01/27/2009 10:22:29 S JOB_SUBSTATE_EXITING
01/27/2009 10:22:29 S JOB_SUBSTATE_STAGEOUT
01/27/2009 10:22:29 S no files to copy - deleting job
01/27/2009 10:22:29 S JOB_SUBSTATE_STAGEDEL
01/27/2009 10:22:29 S JOB_SUBSTATE_EXITED
01/27/2009 10:22:29 S JOB_SUBSTATE_COMPLETE
01/27/2009 10:22:29 S dequeuing from infinity, state COMPLETE
01/27/2009 10:22:29 A user=sci-wam group=sci-wam jobname=STDIN
queue=infinity ctime=1233015743 qtime=1233015743
etime=1233015743 start=1233015748
owner=sci-wam at cluster.cluster exec_host=node72/0
Resource_List.cput=9999:00:00
Resource_List.ncpus=1 Resource_List.neednodes=batch
Resource_List.nice=5 Resource_List.nodect=1
Resource_List.walltime=9999:00:00 session=0
end=1233015749 Exit_status=-1
resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:06 Error_Path=/dev/pts/0
Output_Path=/dev/pts/0
If anyone has any ideas, they would be greatly appreciated. I've seen a
similar message in the archives where there is talk about 2 MOMs. I
don't see this on my systems however.
Thanks,
Wayne
--
Dr. Wayne Mallett
High Performance & Research Computing Support
Phone: 0747815084
Email: Wayne.Mallett at jcu.edu.au
Smail: James Cook University
Townsville Qld 4811
Australia
More information about the torqueusers
mailing list