[torqueusers] Using TORQUE in a supercomputer with lots of CPU's - one node - gets job-exclusive

Silas Silva silasdb at gmail.com
Wed Feb 12 11:34:05 MST 2014


On Tue, Feb 11, 2014 at 03:49:01PM -0200, Silas Silva wrote:
> 
> Hi there!
> 
> After installing TORQUE with NUMA support, cpus are recognized as
> independent, separated by NUMA nodes.  The mom.layout file was generated
> by the mom_gencfg script in the contrib/ directory.  After configuring,
> NUMA nodes appear like a charm in pbsnodes.
> 
> But there is a problem, I have some nodes just free (I could even
> allocate a job for bachianas-8) but others are just down.  Anybody could
> help me with this?
> 
> Just below is the output of pbsnodes.
> 
> Thank you very much.

Here I am again.  It seems it is not a Maui issue, but a resource
manager (TORQUE) issue.  For some reason, it is reporting other nodes as
unavaiable, although they appear as "free" in pbsnodes.  Here is the log
of TORQUE:

    02/12/2014 16:14:11;0080;PBS_Server.117097;Req;dis_request_read;decoding command AuthenticateUser from root
    02/12/2014 16:14:11;0008;PBS_Server.117097;Job;dispatch_request;dispatching request AuthenticateUser on sd=10
    02/12/2014 16:14:11;0200;PBS_Server.117097;trqauthd;req_authenuser;addr: 3364214663  port: 2681
    02/12/2014 16:14:11;0008;PBS_Server.117097;Job;reply_send_svr;Reply sent for request type AuthenticateUser on socket 10
    02/12/2014 16:14:11;0080;PBS_Server.117097;Req;dis_request_read;decoding command Disconnect from root
    02/12/2014 16:14:11;0080;PBS_Server.140569;Req;dis_request_read;decoding command RunJob from root
    02/12/2014 16:14:11;0008;PBS_Server.140569;Job;dispatch_request;dispatching request RunJob on sd=8
    02/12/2014 16:14:11;0040;PBS_Server.140569;Req;set_nodes;allocating nodes for job 50.bachianas with node expression 'bachianas-7'
    02/12/2014 16:14:11;0040;PBS_Server.140569;Req;node_spec;entered spec=bachianas-7
    02/12/2014 16:14:11;0040;PBS_Server.140569;Req;node_spec;job allocation debug: 1 requested, 136 svr_clnodes, 1 svr_totnodes
    02/12/2014 16:14:11;0001;PBS_Server.140569;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus available on node bachianas-7
    02/12/2014 16:14:11;0001;PBS_Server.140569;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus free on node bachianas-7
    02/12/2014 16:14:11;0040;PBS_Server.140569;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
    02/12/2014 16:14:11;0008;PBS_Server.140569;Job;50.bachianas;could not locate requested resources 'bachianas-7' (node_spec failed) job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
    02/12/2014 16:14:11;0080;PBS_Server.140569;Req;req_reject;Reject reply code=15046(Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available), aux=0, type=RunJob, from root at bachianas
    02/12/2014 16:14:11;0008;PBS_Server.140569;Job;reply_send_svr;Reply sent for request type RunJob on socket 8
    02/12/2014 16:14:11;0080;PBS_Server.140569;Req;dis_request_read;decoding command Disconnect from root

Any clue?  There is anything strange about gpu_count, no?  How do I
discover what resource is unavailable so I can debug it more deeply?

Thanks!

-- 
Silas Silva


More information about the torqueusers mailing list