[torqueusers] Job stuck in limbo causing logs to fill
Jonas_Berlin at harte-hanks.com
Jonas_Berlin at harte-hanks.com
Mon Feb 13 08:02:30 MST 2006
I am running a job on two machines etlpoc4 and etlpoc3.
When I run a job as a user that exists in ldap it first fails to execute,
then gets stuck when it fails to clean up. When run as a local user the
job runs fine.
The state of the job swiches between running and queued.
This is the state of the job:
Job Id: 66.etlpoc4
Job_Name = dummy_sort.4035
Job_Owner = jberlin at etlpoc4
job_state = R
queue = batch
server = etlpoc4
Checkpoint = u
ctime = Fri Feb 10 16:36:00 2006
Error_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.e66
exec_host = etlpoc3/0+etlpoc4/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Feb 13 09:44:37 2006
Output_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.o66
Priority = 0
qtime = Fri Feb 10 16:36:00 2006
Rerunable = True
Resource_List.neednodes = 2
Resource_List.nodect = 2
Resource_List.nodes = 2
Resource_List.walltime = 01:00:00
Shell_Path_List = /bin/ksh
substate = 40
Variable_List = PBS_O_HOME=/home/jberlin,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=jberlin,
PBS_O_PATH=/prod/software/bin:/usr/local/bin:/opt/syncsort/bin:/opt/SU
NWspro/bin:/tools/bin:/bin:/usr/bin:/usr/ucb:/usr/ccs/bin:/etc:/usr/etc
:/usr/bin/X11:/bin:.:/usr/kerberos/bin:/usr/local/bi
n:/bin:/usr/bin:/usr/X11R6/bin:/u01/app/oracle/product/10.1.0.3:/u01/app/oracle/pr
oduct/10.1.0.3/bin:/u01/app/oracle/product/10.1.0.3/lib,
PBS_O_MAIL=/var/spool/mail/jberlin,PBS_O_SHELL=/bin/ksh,
PBS_O_HOST=etlpoc4,PBS_O_WORKDIR=/sandbox/jberlin/scratch/run,
PBS_O_QUEUE=batch
euser = jberlin
egroup = 107
hashname = 66.etlpoc4
queue_rank = 38
queue_type = E
comment = Job started on Mon Feb 13 at 09:44
etime = Fri Feb 10 16:36:00 2006
exit_status = -3
The server is stuck at:
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at etlpoc3
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from
Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of
Scheduler at etlpoc4
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command
recyc
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED)
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at etlpoc3
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from
Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of
Scheduler at etlpoc4
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command
recyc
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED)
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at etlpoc3
At the same time the mom_log on etlpoc3 keeps repeating:
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type Commit request received from
PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in
job_start_error
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error.
ignoring abort request from node 172.21.148.216:15003
02/13/2006 09:48:44;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
02/13/2006 09:48:44;0001; pbs_mom;Req;obit reply;Job not found for obit
reply
02/13/2006 09:48:44;0001; pbs_mom;Job;66.etlpoc4;server rejected job
obit - unexpected job state
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at etlpoc4, sock=13
02/13/2006 09:48:44;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type QueueJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type JobScript request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type Commit request received from
PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in
job_start_error
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error.
ignoring abort request from node 172.21.148.216:15003
02/13/2006 09:48:44;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
02/13/2006 09:48:44;0001; pbs_mom;Req;obit reply;Job not found for obit
reply
02/13/2006 09:48:44;0001; pbs_mom;Job;66.etlpoc4;server rejected job
obit - unexpected job state
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at etlpoc4, sock=13
02/13/2006 09:48:44;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type QueueJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type JobScript request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type Commit request received from
PBS_Server at etlpoc4, sock=10
Any ideas of how to diagnose would be appreciated.
Thanks,
Jonas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060213/6f9c1190/attachment.html
More information about the torqueusers
mailing list