[torqueusers] node allocation issue (UNCLASSIFIED)

Hazelrig, Chris AMRDEC/SIMTECH chris.hazelrig at us.army.mil
Mon Sep 15 09:50:23 MDT 2008


Classification:  UNCLASSIFIED 
Caveats: NONE

Greetings, fellow torque users!

I'm having a problem with node allocation.  The first node in the nodes list
is always loaded with one job too many.  This happens no matter which node
is listed first and no matter what value of np is used for that node.  My
configuration is as follows:

SLES 10 (x86_64)

kernel 2.6.16.21-0.8-smp

TORQUE version 2.2.1

stock fifo scheduler

server_priv/nodes:  n065 np=7 PE1950 fastest
                    n066 np=8 PE1950 fastest
                    n067 np=8 PE1950 fastest
                    n068 np=8 PE1950 fastest
                       .
                       .
                       .
                    n075 np=8 PE1950 fastest
                    n076 np=8 PE1950 fastest
                    n077 np=8 PE1950 fastest
                    n000 np=8 PE1950 fastest
                    n033 np=2 PE1850 faster
                    n034 np=2 PE1850 faster
                       .
                       .
                       .
                    n063 np=2 PE1850 faster
                    n064 np=2 PE1850 faster

(The PE1950s are dual quad-cores, hence np=8, and the PE1850s are dual
uni-cores, so np=2.  The head node is n000 and is included as a compute
node)

qmgr -c 'l s':  Server n000.unclassified.vtc
                        server_state = Active
                        scheduling = True
                        total_jobs = 989
                        state_count = Transit:0 Queued:817 Held:0 Waiting:0
Running:172 Exiting:0
                        acl_roots = root
                        managers = root at n000.unclassified.vtc
                        operators = root at n000.unclassified.vtc
                        default_queue = submit
                        log_events = 511
                        mail_from = adm
                        query_other_jobs = True
                        resource_available.ncpus = 176
                        resource_default.ncpus = 1
                        resource_max.ncpus = 176
                        resource_assigned.ncpus = 172
                        scheduler_iteration = 600
                        node_check_rate = 150
                        tcp_timeout = 6
                        mom_job_sync = True
                        pbs_version = 2.2.1
                        log_file_max_size = 100000
                        log_file_roll_depth = 99999
                        net_counter = 6 4 45

qmgr -c 'l q submit':  Queue submit
                               queue_type = Route
                               total_jobs = 0
                               state_count = Transit:0 Queued:0 Held:0
Waiting:0 Running:0 Exiting:0
                               mtime = Fri Jun 27 12:00:26 2008
                               route_destinations = default
                               enabled = True
                               started = True

qmgr -c 'l q default':  Queue default
                               queue_type = Execution
                               total_jobs = 989
                               state_count = Transit:0 Queued:817 Held:0
Waiting:0 Running:172 Exiting:0
                               from_rout_only = True
                               mtime = Fri Jun 27 12:01:07 2008
                               resources_assigned.ncpus = 172
                               enabled = True
                               started = True

qmgr -c 'l n n065':  Node n065
                             state = job-sharing
                             np = 7
                             properties = PE1950,fastest
                             ntype = cluster
                             jobs = 0/11029.n000.unclassified.vtc,
0/11004.n000.unclassified.vtc,
                                    0/10994.n000.unclassified.vtc,
0/10983.n000.unclassified.vtc,
                                    0/10979.n000.unclassified.vtc,
0/10832.n000.unclassified.vtc,
                                    0/10825.n000.unclassified.vtc,
0/10813.n000.unclassified.vtc
                             status = opsys=linux,
                                      uname = Linux n065 2.6.16.21-0.8-smp
#1 SMP Mon Jul 3 18:25:39 UTC 2006 X86_64
                                      sessions=? 15201,nsessions=?
15201,nusers=0,idletime=452668,
 
totmem=18545960kb,availmem=16898436kb,physmem=16441488kb,
 
ncpus=8,loadave=9.00,netload=3901175490,state=free,
                                      jobs=10818.unclassified.vtc
10825.unclassified.vtc 10832.unclassified.vtc 10979.unclassified.vtc
10983.unclassified.vtc 10994.unclassified.vtc 11004.unclassified.vtc
11029.unclassified.vtc,
                                      varattr=,rectime=1221488389

pbsnodes -l:  n044
              n045

torque configure script:  #!/bin/bash
                          ../configure \
                          --libdir=/usr/local/lib64 \
                          --enable-docs \
                          --enable-server \
                          --enable-mom \
                          --enable-clients \
                          --with-tmptdir=/usr/tmp \
                          --enable-syslog \
                          --with-sched=c \
                          --with-rcp=scp \
                          --disable=rpp \
                          --enable-tcl-qstat \
                          CC="gcc -m64"

I thought that setting the value of np for the first node to (total cores -
1) would be an easy fix, but this seems to cause an error message regarding
"nps needed/free: 1/-1" to be issued many many times in the server log for
every job in queue.  tracejob also reports it, e.g.:


Job 11046.n000.unclassified.vtc

09/15/2008 06:54:29  S    Job modified at request of
Scheduler at n000.unclassified.vtc
09/15/2008 06:54:29  S    could not locate requested resources '1#shared'
(node_spec failed) cannot allocate node 'n065' to job - node not currently
available (nps needed/free: 1/-1,  joblist:
10994.n000.unclassified.vtc:0,10983.n000.unclassified.vtc:0,10979.n000.uncla
ssified.vtc:0,10955.n000.unclassified.vtc:0,10909.n000.unclassified.vtc:0,10
832.n000.unclassified.vtc:0,10825.n000.unclassified.vtc:0,10818.n000.unclass
ified.vtc:0
09/15/2008 06:57:03  S    could not locate requested resources '1#shared'
(node_spec failed) cannot allocate node 'n065' to job - node not currently
available (nps needed/free: 1/-1,  joblist:
11004.n000.unclassified.vtc:0,10994.n000.unclassified.vtc:0,10983.n000.uncla
ssified.vtc:0,10979.n000.unclassified.vtc:0,10909.n000.unclassified.vtc:0,10
832.n000.unclassified.vtc:0,10825.n000.unclassified.vtc:0,10818.n000.unclass
ified.vtc:0
09/15/2008 08:40:23  S    could not locate requested resources '1#shared'
(node_spec failed) cannot allocate node 'n065' to job - node not currently
available (nps needed/free: 1/-1,  joblist:
11029.n000.unclassified.vtc:0,11004.n000.unclassified.vtc:0,10994.n000.uncla
ssified.vtc:0,10983.n000.unclassified.vtc:0,10979.n000.unclassified.vtc:0,10
832.n000.unclassified.vtc:0,10825.n000.unclassified.vtc:0,10818.n000.unclass
ified.vtc:0


I found users group threads related to this message but the symptoms
reported there are different.

Any help would be greatly appreciated.


Regards,
Chris

___________________________

Chris Hazelrig
Simulation Technologies,Inc.
Rm. H456, Bldg. 5400, RSA, AL
phone:  (256)955-7305
        (256)876-4204
FAX:    (256)955-7376
email:  Chris.Hazelrig at us.army.mil

Hardware In The Loop Simulation
Systems Simulation and Development Directorate
Aviation & Missile Research, Development, & Engineering Center (AMRDEC)
US Army Research, Development, & Engineering Command (AMSRD-AMR-SS-HW)
Redstone Arsenal, AL

Classification:  UNCLASSIFIED 
Caveats: NONE

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5257 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080915/35e87c3e/smime.bin


More information about the torqueusers mailing list