[torqueusers] Job stuck in limbo causing logs to fill
nathaniel.x.woody at gsk.com
nathaniel.x.woody at gsk.com
Mon Feb 13 11:35:07 MST 2006
We'll, I'll throw my two cents in for what it's worth since noone else has
chimed in...
It seem like your first problem is this:
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in
job_start_error
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error.
ignoring abort request from node 172.21.148.216:15003
I would worry that whatever the ldap is trying to do is failing to get the
users mapped correctly? It looks like you have moms running on etlpoc3
and etlpoc4 and the server is running on etlpoc4? The mom log you show is
from etlpoc3, what does the mom log in etlpoc4 say?
I don't know why the job doesn't seem to clean itself up properly, but in
order to get the job to execute correctly, you'll need to get rid of the
above error first.
Nate
Jonas_Berlin at harte-hanks.com
Sent by: torqueusers-bounces at supercluster.org
13-Feb-2006 10:02
To
torqueusers at supercluster.org
cc
Subject
[torqueusers] Job stuck in limbo causing logs to fill
I am running a job on two machines etlpoc4 and etlpoc3.
When I run a job as a user that exists in ldap it first fails to execute,
then gets stuck when it fails to clean up. When run as a local user the
job runs fine.
The state of the job swiches between running and queued.
This is the state of the job:
Job Id: 66.etlpoc4
Job_Name = dummy_sort.4035
Job_Owner = jberlin at etlpoc4
job_state = R
queue = batch
server = etlpoc4
Checkpoint = u
ctime = Fri Feb 10 16:36:00 2006
Error_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.e66
exec_host = etlpoc3/0+etlpoc4/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Feb 13 09:44:37 2006
Output_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.o66
Priority = 0
qtime = Fri Feb 10 16:36:00 2006
Rerunable = True
Resource_List.neednodes = 2
Resource_List.nodect = 2
Resource_List.nodes = 2
Resource_List.walltime = 01:00:00
Shell_Path_List = /bin/ksh
substate = 40
Variable_List = PBS_O_HOME=/home/jberlin,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=jberlin,
PBS_O_PATH=/prod/software/bin:/usr/local/bin:/opt/syncsort/bin:/opt/SU
NWspro/bin:/tools/bin:/bin:/usr/bin:/usr/ucb:/usr/ccs/bin:/etc:/usr/etc
:/usr/bin/X11:/bin:.:/usr/kerberos/bin:/usr/local/bi
n:/bin:/usr/bin:/usr/X11R6/bin:/u01/app/oracle/product/10.1.0.3:/u01/app/oracle/pr
oduct/10.1.0.3/bin:/u01/app/oracle/product/10.1.0.3/lib,
PBS_O_MAIL=/var/spool/mail/jberlin,PBS_O_SHELL=/bin/ksh,
PBS_O_HOST=etlpoc4,PBS_O_WORKDIR=/sandbox/jberlin/scratch/run,
PBS_O_QUEUE=batch
euser = jberlin
egroup = 107
hashname = 66.etlpoc4
queue_rank = 38
queue_type = E
comment = Job started on Mon Feb 13 at 09:44
etime = Fri Feb 10 16:36:00 2006
exit_status = -3
The server is stuck at:
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at etlpoc3
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from
Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of
Scheduler at etlpoc4
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command
recyc
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED)
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at etlpoc3
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from
Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of
Scheduler at etlpoc4
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command
recyc
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED)
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at etlpoc3
At the same time the mom_log on etlpoc3 keeps repeating:
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type Commit request received from
PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in
job_start_error
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error.
ignoring abort request from node 172.21.148.216:15003
02/13/2006 09:48:44;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
02/13/2006 09:48:44;0001; pbs_mom;Req;obit reply;Job not found for obit
reply
02/13/2006 09:48:44;0001; pbs_mom;Job;66.etlpoc4;server rejected job
obit - unexpected job state
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at etlpoc4, sock=13
02/13/2006 09:48:44;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type QueueJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type JobScript request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type Commit request received from
PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in
job_start_error
02/13/2006 09:48:44;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error.
ignoring abort request from node 172.21.148.216:15003
02/13/2006 09:48:44;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
02/13/2006 09:48:44;0001; pbs_mom;Req;obit reply;Job not found for obit
reply
02/13/2006 09:48:44;0001; pbs_mom;Job;66.etlpoc4;server rejected job
obit - unexpected job state
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at etlpoc4, sock=13
02/13/2006 09:48:44;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type QueueJob request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type JobScript request received
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100; pbs_mom;Req;;Type Commit request received from
PBS_Server at etlpoc4, sock=10
Any ideas of how to diagnose would be appreciated.
Thanks,
Jonas _______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060213/641ba1b4/attachment-0001.html
More information about the torqueusers
mailing list