[torqueusers] Job stuck in limbo causing logs to fill

Jonas_Berlin at harte-hanks.com Jonas_Berlin at harte-hanks.com
Mon Feb 13 08:02:30 MST 2006


I am running a job on two machines etlpoc4 and etlpoc3.

When I run a job as a user that exists in ldap it first fails to execute, 
then gets stuck when it fails to clean up. When run as a local user the 
job runs fine.
The state of the job swiches between running and queued. 

This is the state of the job:

Job Id: 66.etlpoc4
    Job_Name = dummy_sort.4035
    Job_Owner = jberlin at etlpoc4
    job_state = R
    queue = batch
    server = etlpoc4
    Checkpoint = u
    ctime = Fri Feb 10 16:36:00 2006
    Error_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.e66
    exec_host = etlpoc3/0+etlpoc4/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Feb 13 09:44:37 2006
    Output_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.o66
    Priority = 0
    qtime = Fri Feb 10 16:36:00 2006
    Rerunable = True
    Resource_List.neednodes = 2
    Resource_List.nodect = 2
    Resource_List.nodes = 2
    Resource_List.walltime = 01:00:00
    Shell_Path_List = /bin/ksh
    substate = 40
    Variable_List = PBS_O_HOME=/home/jberlin,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=jberlin,
 PBS_O_PATH=/prod/software/bin:/usr/local/bin:/opt/syncsort/bin:/opt/SU
 NWspro/bin:/tools/bin:/bin:/usr/bin:/usr/ucb:/usr/ccs/bin:/etc:/usr/etc
        :/usr/bin/X11:/bin:.:/usr/kerberos/bin:/usr/local/bi
 
n:/bin:/usr/bin:/usr/X11R6/bin:/u01/app/oracle/product/10.1.0.3:/u01/app/oracle/pr
        oduct/10.1.0.3/bin:/u01/app/oracle/product/10.1.0.3/lib,
        PBS_O_MAIL=/var/spool/mail/jberlin,PBS_O_SHELL=/bin/ksh,
        PBS_O_HOST=etlpoc4,PBS_O_WORKDIR=/sandbox/jberlin/scratch/run,
        PBS_O_QUEUE=batch
    euser = jberlin
    egroup = 107
    hashname = 66.etlpoc4
    queue_rank = 38
    queue_type = E
    comment = Job started on Mon Feb 13 at 09:44
    etime = Fri Feb 10 16:36:00 2006
    exit_status = -3

The server is stuck at:

02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at etlpoc3
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received 
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received 
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request 
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from 
Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of 
Scheduler at etlpoc4
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command 
recyc
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request 
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received 
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job 
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED)
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at etlpoc3
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received 
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received 
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request 
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from 
Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of 
Scheduler at etlpoc4
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command 
recyc
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request 
received from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received 
from Scheduler at etlpoc4, sock=13
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job 
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED)
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at etlpoc3

At the same time the mom_log on etlpoc3 keeps repeating:

02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type Commit request received from 
PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in 
job_start_error
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error. 
ignoring abort request from node 172.21.148.216:15003
02/13/2006 09:48:44;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
02/13/2006 09:48:44;0001;   pbs_mom;Req;obit reply;Job not found for obit 
reply
02/13/2006 09:48:44;0001;   pbs_mom;Job;66.etlpoc4;server rejected job 
obit - unexpected job state
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at etlpoc4, sock=13
02/13/2006 09:48:44;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to 
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type JobScript request received 
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type Commit request received from 
PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in 
job_start_error
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error. 
ignoring abort request from node 172.21.148.216:15003
02/13/2006 09:48:44;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
02/13/2006 09:48:44;0001;   pbs_mom;Req;obit reply;Job not found for obit 
reply
02/13/2006 09:48:44;0001;   pbs_mom;Job;66.etlpoc4;server rejected job 
obit - unexpected job state
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at etlpoc4, sock=13
02/13/2006 09:48:44;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to 
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type JobScript request received 
from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at etlpoc4, sock=10
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type Commit request received from 
PBS_Server at etlpoc4, sock=10


Any ideas of how to diagnose would be appreciated.

Thanks,

Jonas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060213/6f9c1190/attachment.html


More information about the torqueusers mailing list