[torquedev] [Bug 209] New: pbs_server rejects Obits with cray_enabled

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Jul 25 21:48:50 MDT 2012


           Summary: pbs_server rejects Obits with cray_enabled
           Product: TORQUE
           Version: 4.0.*
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_server
        AssignedTo: dbeer at adaptivecomputing.com
        ReportedBy: ezellma at ornl.gov
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

With cray_enabled, I have been seeing rejected Obits that leave "orphaned"

07/25/2012 20:10:21;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id Error), aux=0, type=JobObituary, from
pbs_mom at c1-batch3
07/25/2012 20:10:21;0008;PBS_Server;Job;reply_send_svr;Reply sent for request
type JobObituary on socket 8
07/25/2012 20:10:21;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Unknown Job Id
Error (15001) in 370821.c1-sys0.ncrc.gov, Job Obit notice received from
c1-batch3 has error 15001

c1-sys0:/var/spool/torque/server_logs # qdel 370821
qdel: Server could not connect to MOM 370821.c1-sys0.ncrc.gov

The problem is that req_jobobit() checks that the requesting host's address
matches ji_qs.ji_un.ji_exect.ji_momaddr.  For cray_enabled, this is set to the
first Cray node, not the login_node.

I'm not sure if it's better for req_jobobit() to become cray_enabled-aware and
realize that the connecting host's address should match the login_node, or if
set_job_exec_info() should set the login node as the mother superior.

Let me know if you need more logs or want me to gather more info when we get a
job in this state.

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list