[torquedev] [Bug 209] New: pbs_server rejects Obits with cray_enabled
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Wed Jul 25 21:48:50 MDT 2012
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=209
Summary: pbs_server rejects Obits with cray_enabled
Product: TORQUE
Version: 4.0.*
Platform: PC
OS/Version: Linux
Status: NEW
Severity: major
Priority: P5
Component: pbs_server
AssignedTo: dbeer at adaptivecomputing.com
ReportedBy: ezellma at ornl.gov
CC: torquedev at supercluster.org
Estimated Hours: 0.0
With cray_enabled, I have been seeing rejected Obits that leave "orphaned"
jobs.
07/25/2012 20:10:21;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id Error), aux=0, type=JobObituary, from
pbs_mom at c1-batch3
07/25/2012 20:10:21;0008;PBS_Server;Job;reply_send_svr;Reply sent for request
type JobObituary on socket 8
07/25/2012 20:10:21;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Unknown Job Id
Error (15001) in 370821.c1-sys0.ncrc.gov, Job Obit notice received from
c1-batch3 has error 15001
c1-sys0:/var/spool/torque/server_logs # qdel 370821
qdel: Server could not connect to MOM 370821.c1-sys0.ncrc.gov
The problem is that req_jobobit() checks that the requesting host's address
matches ji_qs.ji_un.ji_exect.ji_momaddr. For cray_enabled, this is set to the
first Cray node, not the login_node.
I'm not sure if it's better for req_jobobit() to become cray_enabled-aware and
realize that the connecting host's address should match the login_node, or if
set_job_exec_info() should set the login node as the mother superior.
Let me know if you need more logs or want me to gather more info when we get a
job in this state.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list