[torqueusers] Job stuck in limbo causing logs to fill

nathaniel.x.woody at gsk.com nathaniel.x.woody at gsk.com
Mon Feb 13 11:35:07 MST 2006


We'll, I'll throw my two cents in for what it's worth since noone else has 
chimed in...

It seem like your first problem is this:
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in 
job_start_error 
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error. 
ignoring abort request from node 172.21.148.216:15003 

I would worry that whatever the ldap is trying to do is failing to get the 
users mapped correctly?  It looks like you have moms running on etlpoc3 
and etlpoc4 and the server is running on etlpoc4?  The mom log you show is 
from etlpoc3, what does the mom log in etlpoc4 say?

I don't know why the job doesn't seem to clean itself up properly, but in 
order to get the job to execute correctly, you'll need to get rid of the 
above error first.

Nate





Jonas_Berlin at harte-hanks.com 
Sent by: torqueusers-bounces at supercluster.org
13-Feb-2006 10:02
 
To
torqueusers at supercluster.org
cc

Subject
[torqueusers] Job stuck in limbo causing logs to fill







I am running a job on two machines etlpoc4 and etlpoc3. 

When I run a job as a user that exists in ldap it first fails to execute, 
then gets stuck when it fails to clean up. When run as a local user the 
job runs fine. 
The state of the job swiches between running and queued. 

This is the state of the job: 

Job Id: 66.etlpoc4 
    Job_Name = dummy_sort.4035 
    Job_Owner = jberlin at etlpoc4 
    job_state = R 
    queue = batch 
    server = etlpoc4 
    Checkpoint = u 
    ctime = Fri Feb 10 16:36:00 2006 
    Error_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.e66 
    exec_host = etlpoc3/0+etlpoc4/0 
    Hold_Types = n 
    Join_Path = n 
    Keep_Files = n 
    Mail_Points = a 
    mtime = Mon Feb 13 09:44:37 2006 
    Output_Path = etlpoc4:/sandbox/jberlin/scratch/run/dummy_sort.4035.o66 

    Priority = 0 
    qtime = Fri Feb 10 16:36:00 2006 
    Rerunable = True 
    Resource_List.neednodes = 2 
    Resource_List.nodect = 2 
    Resource_List.nodes = 2 
    Resource_List.walltime = 01:00:00 
    Shell_Path_List = /bin/ksh 
    substate = 40 
    Variable_List = PBS_O_HOME=/home/jberlin,PBS_O_LANG=en_US.UTF-8, 
        PBS_O_LOGNAME=jberlin, 
 PBS_O_PATH=/prod/software/bin:/usr/local/bin:/opt/syncsort/bin:/opt/SU 
 NWspro/bin:/tools/bin:/bin:/usr/bin:/usr/ucb:/usr/ccs/bin:/etc:/usr/etc 
        :/usr/bin/X11:/bin:.:/usr/kerberos/bin:/usr/local/bi 
 
n:/bin:/usr/bin:/usr/X11R6/bin:/u01/app/oracle/product/10.1.0.3:/u01/app/oracle/pr 

        oduct/10.1.0.3/bin:/u01/app/oracle/product/10.1.0.3/lib, 
        PBS_O_MAIL=/var/spool/mail/jberlin,PBS_O_SHELL=/bin/ksh, 
        PBS_O_HOST=etlpoc4,PBS_O_WORKDIR=/sandbox/jberlin/scratch/run, 
        PBS_O_QUEUE=batch 
    euser = jberlin 
    egroup = 107 
    hashname = 66.etlpoc4 
    queue_rank = 38 
    queue_type = E 
    comment = Job started on Mon Feb 13 at 09:44 
    etime = Fri Feb 10 16:36:00 2006 
    exit_status = -3 

The server is stuck at: 

02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at etlpoc3 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received 
from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received 
from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request 
received from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from 
Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of 
Scheduler at etlpoc4 
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command 
recyc 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10 
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new 

02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request 
received from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received 
from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10 
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job 
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED) 
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at etlpoc3 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusQueue request received 
from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type SelStat request received 
from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type ResourceQuery request 
received from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type RunJob request received from 
Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0008;PBS_Server;Job;66.etlpoc4;Job Run at request of 
Scheduler at etlpoc4 
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command 
recyc 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10 
02/13/2006 09:51:45;0040;PBS_Server;Svr;etlpoc4;Scheduler sent command new 

02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusServer request 
received from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type StatusNode request received 
from Scheduler at etlpoc4, sock=13 
02/13/2006 09:51:45;0100;PBS_Server;Req;;Type JobObituary request received 
from pbs_mom at etlpoc3, sock=10 
02/13/2006 09:51:45;0009;PBS_Server;Job;66.etlpoc4;obit received for job 
66.etlpoc4 from host etlpoc3 with bad state (state: QUEUED) 
02/13/2006 09:51:45;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at etlpoc3 

At the same time the mom_log on etlpoc3 keeps repeating: 

02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type Commit request received from 
PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in 
job_start_error 
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error. 
ignoring abort request from node 172.21.148.216:15003 
02/13/2006 09:48:44;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters 
02/13/2006 09:48:44;0001;   pbs_mom;Req;obit reply;Job not found for obit 
reply 
02/13/2006 09:48:44;0001;   pbs_mom;Job;66.etlpoc4;server rejected job 
obit - unexpected job state 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at etlpoc4, sock=13 
02/13/2006 09:48:44;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to 
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type JobScript request received 
from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type Commit request received from 
PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, job_start_error from node 172.21.148.216:15003 in 
job_start_error 
02/13/2006 09:48:44;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution 
(15023) in 66.etlpoc4, abort attempted 16 times in job_start_error. 
ignoring abort request from node 172.21.148.216:15003 
02/13/2006 09:48:44;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters 
02/13/2006 09:48:44;0001;   pbs_mom;Req;obit reply;Job not found for obit 
reply 
02/13/2006 09:48:44;0001;   pbs_mom;Job;66.etlpoc4;server rejected job 
obit - unexpected job state 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at etlpoc4, sock=13 
02/13/2006 09:48:44;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15001(Unknown Job Id REJHOST=etlpoc3 MSG=cannot locate job to 
delete), aux=0, type=DeleteJob, from PBS_Server at etlpoc4 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type JobScript request received 
from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at etlpoc4, sock=10 
02/13/2006 09:48:44;0100;   pbs_mom;Req;;Type Commit request received from 
PBS_Server at etlpoc4, sock=10 


Any ideas of how to diagnose would be appreciated. 

Thanks, 

Jonas _______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060213/641ba1b4/attachment-0001.html


More information about the torqueusers mailing list