Bugzilla – Bug 209
pbs_server rejects Obits with cray_enabled
Last modified: 2012-09-13 19:30:34 MDT
You need to
before you can comment on or make changes to this bug.
With cray_enabled, I have been seeing rejected Obits that leave "orphaned"
07/25/2012 20:10:21;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id Error), aux=0, type=JobObituary, from
07/25/2012 20:10:21;0008;PBS_Server;Job;reply_send_svr;Reply sent for request
type JobObituary on socket 8
07/25/2012 20:10:21;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Unknown Job Id
Error (15001) in 370821.c1-sys0.ncrc.gov, Job Obit notice received from
c1-batch3 has error 15001
c1-sys0:/var/spool/torque/server_logs # qdel 370821
qdel: Server could not connect to MOM 370821.c1-sys0.ncrc.gov
The problem is that req_jobobit() checks that the requesting host's address
matches ji_qs.ji_un.ji_exect.ji_momaddr. For cray_enabled, this is set to the
first Cray node, not the login_node.
I'm not sure if it's better for req_jobobit() to become cray_enabled-aware and
realize that the connecting host's address should match the login_node, or if
set_job_exec_info() should set the login node as the mother superior.
Let me know if you need more logs or want me to gather more info when we get a
job in this state.
It looks like this might be fixed by revision 6675. Next time I get a chance,
I'll upgrade and report back.
Sorry I neglected to update this bug. Yes, this should be fixed as of revision
6675. I'll wait for you to verify before I close the bug.
(In reply to comment #1)
> It looks like this might be fixed by revision 6675. Next time I get a chance,
> I'll upgrade and report back.
I haven't seen this again. Closing.