[torqueusers] ComputeNodes' /var/log/messages flooded with "unknown command 5"

Ezell, Matthew A. ezellma at ornl.gov
Mon Oct 28 10:41:26 MDT 2013


I see this message in the mom_log on my desktop where I'm running
pbs_sched.  I briefly looked into it.

The fifo scheduler of pbs_sched has a talk_with_mom() function that
contacts each pbs_mom to find out information like ncpus, arch, max_load,
ideal_load, etc.  This uses the RM protocol on the RM port (the same one
momctl uses). Back in 2011, commit
577e8cb29263075c2d38155e8fc6686b88e0d5af changed the RM protocol, but
pbs_sched didn't get the memo.  (By the way, are the PBS ERS and IDS still
being updated?)  A new field was added to the "header" to indicate how
many commands were coming across the wire.  Since pbs_sched doesn't send
this, the pbs_mom reads the command as the number of commands and the
first string as the command.  This happens to be "ncpus", a 5-character
string.  When read with disrui() instead of diswcs(), you get command #5
(followed by garbage on the wire).

I'm not sure if the fifo scheduler actually *needs* this information from
the mom, so it might be OK to just comment out the talk_to_mom() function.
 If it is needed, then the RM functions in the PBS API need to be updated
(and potentially some code in the fifo scheduler also).

~Matt

---
Matt Ezell
HPC Systems Administrator
Oak Ridge National Laboratory




On 9/5/13 7:09 PM, "David Beer" <dbeer at adaptivecomputing.com> wrote:

>No worries, I was just curious to make sure the rest of it was typed
>correctly.
>
>
>I don't know of anything that runs momctl - that is usually a user
>command that has to be run by root. I'm really at a loss for what might
>cause it to get run and even more for why it'd be getting run with the
>wrong command.
>
>
>
>On Thu, Sep 5, 2013 at 4:47 PM, Kamran Khan
><kamran at pssclabs.com> wrote:
>
>Hi David,
>
>
>
>Sorry, that was a typo.  I didn't paste it, typed it out.  It does say
>"rm_request"
>
>
>
>Where would that command '5' be coming from?  Is there a spot that I can
>check which runs momctl every 10 seconds or so?
>
>
>
>Please let me know.
>
>
>
>Thanks.
>--
>Kamran Khan
>PSSC Labs
>HPC Software Technical Engineer
>
>
>________________________________________
>
>From: "David Beer" <dbeer at adaptivecomputing.com>
>To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>Sent: Thursday, September 5, 2013 2:53:13 PM
>Subject: Re: [torqueusers] ComputeNodes' /var/log/messages flooded with
>"unknown command 5"
>
>
>
>Can you be sure you're pasting the exact message from the syslog? I'm
>just suspicious because that says "rpm_request" when it should say
>"rm_request." Assuming the rest of it is correct command 5 would mean
>someone is sending a command '5' via
> the momctl command which isn't a recognized command.
>
>
>
>
>On Thu, Sep 5, 2013 at 3:22 PM, Kamran Khan
><kamran at pssclabs.com> wrote:
>
>Hi All,
>
>I have a HeadNode and (11) ComputeNodes, all configured with Torque.
>
>On the ComputeNodes, the /var/log/messages files are being flooded every
>10 seconds with the following message:
>
>n001 pbs_mom: LOG_ERROR: :rpm_request, unknown command 5
>
>
>So far as I can tell, I am having no problems running any jobs through
>Torque, but this cluster is for a customer who may see the logs and start
>freaking out.  Is this a common error?  Is there anyway to get rid of
>these messages?
>
>Any help would be appreciated.
>
>Thanks.
>--
>Kamran Khan
>PSSC Labs
>HPC Software Technical Engineer
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>-- 
>David Beer | Senior Software Engineer
>Adaptive Computing
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>-- 
>David Beer | Senior Software Engineer
>Adaptive Computing
>



More information about the torqueusers mailing list