[torqueusers] Unable to get a simple job unqueued...

skip at pobox.com skip at pobox.com
Mon Oct 11 10:40:46 MDT 2010


I'm having trouble getting a new Torque installation running on a different
subnet here.  So far I have pbs_server, pbs_mom and maui all running on the
same host, known locally as druserver16.wackerbcp.  It has a private IP
address: 192.168.66.214, but resolves both ways:

    % host druserver16
    druserver16.wackerbcp.<TLD> is an alias for
    druserver16.vlan100.wackerbcp.<TLD>.
    druserver16.vlan100.wackerbcp.<TLD> has address 192.168.66.214
    druserver16.vlan100.wackerbcp.<TLD> mail is handled by 5
    mailhost.wackerbcp.<TLD>.
    druserver16.vlan100.wackerbcp.<TLD> mail is handled by 10
    druserver16.vlan100.wackerbcp.<TLD>.
    % host druserver16.wackerbcp
    druserver16.wackerbcp.<TLD> is an alias for
    druserver16.vlan100.wackerbcp.<TLD>.
    druserver16.vlan100.wackerbcp.<TLD> has address 192.168.66.214
    druserver16.vlan100.wackerbcp.<TLD> mail is handled by 10
    druserver16.vlan100.wackerbcp.<TLD>.
    druserver16.vlan100.wackerbcp.<TLD> mail is handled by 5
    mailhost.wackerbcp.<TLD>.
    % host 192.168.66.214
    214.66.168.192.in-addr.arpa domain name pointer
    druserver16.vlan100.wackerbcp.<TLD>.

("<TLD>" is our top-level domain.)

I successfully submitted a simple job:

    echo 'echo hi' | qsub

but that job remains queued and won't run:

    % qstat -1n

    druserver16:
                                                                             Req'd  Req'd   Elap
    Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
    -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
    2.druserver16        skipm    batch    STDIN               --    --   --    --    --  Q   --     --

Looking in server_logs/YYYYMMDD I see this warning:

    10/11/2010 11:28:29;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node druserver16.wackerbcp

but there is no further explanation of why the contact attempt failed.  The
mom_logs/YYMMDD file shows:

    10/11/2010 11:28:26;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server druserver16
    10/11/2010 11:32:35;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.4.8, loglevel = 0

which looks okay to me.

I don't know enough about a queued job to know if Maui has done its work at
that point or not, but I do see these warnings in the maui.log file:

    10/11 11:28:41 WARNING:  no resources detected
    10/11 11:28:41 WARNING:  no workload detected

Any suggestions about where to look for the barrier to execution?

Thanks,

-- 
Skip Montanaro - skip at pobox.com - http://www.smontanaro.net/


More information about the torqueusers mailing list