Bug 209 - pbs_server rejects Obits with cray_enabled
: pbs_server rejects Obits with cray_enabled
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 4.0.*
: PC Linux
: P5 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2012-07-25 21:48 MDT by Matt Ezell
Modified: 2012-09-13 19:30 MDT (History)
2 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Matt Ezell 2012-07-25 21:48:50 MDT
With cray_enabled, I have been seeing rejected Obits that leave "orphaned"
jobs.


07/25/2012 20:10:21;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id Error), aux=0, type=JobObituary, from
pbs_mom@c1-batch3
07/25/2012 20:10:21;0008;PBS_Server;Job;reply_send_svr;Reply sent for request
type JobObituary on socket 8
07/25/2012 20:10:21;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Unknown Job Id
Error (15001) in 370821.c1-sys0.ncrc.gov, Job Obit notice received from
c1-batch3 has error 15001


c1-sys0:/var/spool/torque/server_logs # qdel 370821
qdel: Server could not connect to MOM 370821.c1-sys0.ncrc.gov

The problem is that req_jobobit() checks that the requesting host's address
matches ji_qs.ji_un.ji_exect.ji_momaddr.  For cray_enabled, this is set to the
first Cray node, not the login_node.

I'm not sure if it's better for req_jobobit() to become cray_enabled-aware and
realize that the connecting host's address should match the login_node, or if
set_job_exec_info() should set the login node as the mother superior.

Let me know if you need more logs or want me to gather more info when we get a
job in this state.
Comment 1 Matt Ezell 2012-08-16 19:59:21 MDT
It looks like this might be fixed by revision 6675.  Next time I get a chance,
I'll upgrade and report back.
Comment 2 David Beer 2012-08-17 09:05:09 MDT
Matt,

Sorry I neglected to update this bug. Yes, this should be fixed as of revision
6675. I'll wait for you to verify before I close the bug.

David

(In reply to comment #1)
> It looks like this might be fixed by revision 6675.  Next time I get a chance,
> I'll upgrade and report back.
Comment 3 Matt Ezell 2012-09-13 19:30:34 MDT
I haven't seen this again.  Closing.