[Mauiusers] Re: [torqueusers] One queue (fast) works - one queue doesn't (medium)

Thomas H Dr Pierce TPierce at rohmhaas.com
Thu May 1 08:12:13 MDT 2008


Hi Steve,

Thanks for thinking about this.

But there are other nodes in the medium queue that do not have a load. 

I think it is maui-3.2.6p20, with a problem in multiple queues somewhere 
in the config. I am considering
installing Maui p18, which used to work with multiple queues..

===========================================================

[  ~]$ qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
392.ralphie         ba               6xwwww                 0 Q medium
395.ralphie         Ba               xxxxxx          22:22:03 R fast
396.ralphie         Ba               eeeeee          22:21:18 R fast
397.ralphie         Ba               ffffff          20:39:29 R fast
[ ~]$ checknode node41


checking node node41

State:      Idle  (in current state for 18:06:02)
Configured Resources: PROCS: 2  MEM: 7977M  SWAP: 7977M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       2.010
Network:    [DEFAULT]
Features:   [d1850]
Attributes: [Batch]
Classes:    [medium 2:2][batch 2:2]

Total Time: 1:19:41:35  Up: 1:15:39:08 (90.75%)  Active: 00:02:12 (0.08%)

Reservations:
NOTE:  no reservations on node
ALERT:  node is in state Idle but load is high (2.010)

[  ~]$ checknode node42


checking node node42

State:      Idle  (in current state for 18:06:13)
Configured Resources: PROCS: 2  MEM: 7977M  SWAP: 7977M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.830
Network:    [DEFAULT]
Features:   [d1850]
Attributes: [Batch]
Classes:    [medium 2:2][batch 2:2]

Total Time: 1:19:41:46  Up: 1:19:40:54 (99.97%)  Active: 00:00:00 (0.00%)

Reservations:
NOTE:  no reservations on node

[  ~]$ checknode node43


checking node node43

State:      Idle  (in current state for 18:06:13)
Configured Resources: PROCS: 2  MEM: 7977M  SWAP: 7977M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.000
Network:    [DEFAULT]
Features:   [d1850]
Attributes: [Batch]
Classes:    [medium 2:2][batch 2:2]

Total Time: 1:19:41:46  Up: 1:19:41:05 (99.97%)  Active: 00:00:00 (0.00%)

Reservations:
NOTE:  no reservations on node

[ ~]$ checknode node46


checking node node46

State:      Idle  (in current state for 18:06:13)
Configured Resources: PROCS: 2  MEM: 7977M  SWAP: 7977M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       1.090
Network:    [DEFAULT]
Features:   [d1850]
Attributes: [Batch]
Classes:    [medium 2:2][batch 2:2]

Total Time: 1:19:41:46  Up: 18:42:50 (42.83%)  Active: 00:00:00 (0.00%)

Reservations:
NOTE:  no reservations on node
ALERT:  node is in state Idle but load is high (1.090)

[  ~]$ checknode node47


checking node node47

State:      Idle  (in current state for 18:06:35)
Configured Resources: PROCS: 2  MEM: 3042M  SWAP: 4881M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.060
Network:    [DEFAULT]
Features:   [d1850]
Attributes: [Batch]
Classes:    [medium 2:2][batch 2:2]

Total Time: 82:05:22:27  Up: 72:17:20:32 (88.44%)  Active: 00:00:00 
(0.00%)

Reservations:
NOTE:  no reservations on node

------
Sincerely,

   Tom Pierce




Steve Young <slyoung at hamilton.edu> 
04/30/2008 07:03 PM


To
Thomas H Dr Pierce <TPierce at rohmhaas.com>
cc

Subject
Re: [torqueusers] One queue (fast) works  - one queue doesn't (medium)






Hi,
If you look at your checknode output you'll notice an alert... the machine 
is idle (no jobs scheduled on it) but the load is already at the max of 
2.0. So because of this node46 won't get allocated until the load goes 
back down. 

-Steve

On Apr 30, 2008, at 4:41 PM, Thomas H Dr Pierce wrote:


Dear Schedulers, 
I cannot tell if this is a Torque or a Maui issue. 

Basically one queue runs jobs (the fast queue)  and the other does not 
(the medium queue). It seems like it wants a SWAP of 16 GB. A setting I 
cannot overwrite.   

Why would one queue work and the other not? 

Thanks for your help. 

 Qstat: 
Job id              Name             User            Time Use S Queue 
------------------- ---------------- --------------- -------- - ----- 
392.ralphie         base_83          xxxxxx                 0 Q medium 
395.ralphie         Ba                      yyyyyy          05:18:33 R 
fast 
396.ralphie         Ba               yyyyyy          05:18:16 R fast 
397.ralphie         Ba               yyyyyy          03:27:32 R fast 
402.ralphie         cfd              zzzzzz          00:25:23 R fast 
404.ralphie         test_8300        wwwwww                 0 Q medium 


 checkjob 404 


checking job 404 

State: Idle  EState: Deferred 
Creds:  user:rrrrrr  group:users  class:medium  qos:DEFAULT 
WallTime: 00:00:00 of 99:23:59:59 
SubmitTime: Wed Apr 30 16:20:46 
  (Time Queued  Total: 00:04:03  Eligible: 00:00:01) 

Total Tasks: 2 

Req[0]  TaskCount: 2  Partition: ALL 
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0 
Opsys: [NONE]  Arch: [NONE]  Features: [d1850] 
Dedicated Resources Per Task: PROCS: 1  MEM: 250M  SWAP: 16G 


IWD: [NONE]  Executable:  [NONE] 
Bypass: 0  StartCount: 0 
PartitionMask: [ALL] 
Flags:       RESTARTABLE 

job is deferred.  Reason:  NoResources  (cannot create reservation for job 
'404' (intital reservation attempt) 
) 
Holds:    Defer  (hold reason:  NoResources) 
PE:  15.80  StartPriority:  1 
cannot select job 404 for partition DEFAULT (job hold active) 

=========================================================================================== 

the qmgr -c "p s" 
# Create and define queue fast 
# 
create queue fast 
set queue fast queue_type = Execution 
set queue fast Priority = 40 
set queue fast max_running = 64 
set queue fast acl_host_enable = False 
set queue fast acl_hosts = node19 
set queue fast acl_hosts += node09 
set queue fast acl_hosts += node18 
set queue fast acl_hosts += node08 
set queue fast resources_default.neednodes = d1950 
set queue fast resources_default.nodes = 1 
set queue fast resources_available.nodect = 64 
set queue fast enabled = True 
set queue fast started = True 
# 
# Create and define queue medium 
# 
create queue medium 
set queue medium queue_type = Execution 
set queue medium Priority = 40 
set queue medium max_running = 10 
set queue medium acl_host_enable = False 
set queue medium acl_hosts = node49 
set queue medium acl_hosts += node48 
set queue medium acl_hosts += node41 
set queue medium acl_hosts += node46 
set queue medium acl_hosts += node43 
set queue medium acl_hosts += node42 
set queue medium resources_max.mem = 32gb 
set queue medium resources_max.vmem = 32gb 
set queue medium resources_default.neednodes = d1850 
set queue medium resources_default.nodes = 1 
set queue medium resources_available.nodect = 40 
set queue medium enabled = True 
set queue medium started = True 

================================================================================================== 

pbsnodes -a 

node41 
     state = free 
     np = 2 
     properties = d1850 
     ntype = cluster 
     status = opsys=linux,uname=Linux node41 2.6.9-42.ELsmp #1 SMP Wed Jul 
12 23:32:02 EDT 2006 x86_64,sessions=4329 
4379,nsessions=2,nusers=2,idletime=33784,totmem=5678720kb,availmem=2859524kb,physmem=8169096kb,ncpus=4,loadave=1.98,netload=4294967294,state=free,jobs=? 
0,rectime=1209587507 

node42 
     state = free 
     np = 2 
     properties = d1850 
     ntype = cluster 
     status = opsys=linux,uname=Linux node42 2.6.9-42.ELsmp #1 SMP Wed Jul 
12 23:32:02 EDT 2006 x86_64,sessions=4321 
6593,nsessions=2,nusers=2,idletime=818682,totmem=4826752kb,availmem=1952512kb,physmem=8169096kb,ncpus=4,loadave=0.87,netload=4294967294,state=free,jobs=? 
0,rectime=1209587506 

node43 
     state = free 
     np = 2 
     properties = d1850 
     ntype = cluster 
     status = opsys=linux,uname=Linux node43 2.6.9-42.ELsmp #1 SMP Wed Jul 
12 23:32:02 EDT 2006 
x86_64,sessions=4319,nsessions=1,nusers=1,idletime=818333,totmem=6006400kb,availmem=5874292kb,physmem=8169096kb,ncpus=4,loadave=0.00,netload=38605147,state=free,jobs=? 
0,rectime=1209587506 

===================================================================================================== 

checknode node46 


checking node node46 

State:      Idle  (in current state for 1:03:36) 
Configured Resources: PROCS: 2  MEM: 7977M  SWAP: 7977M  DISK: 1M 
Utilized   Resources: [NONE] 
Dedicated  Resources: [NONE] 
Opsys:         linux  Arch:      [NONE] 
Speed:      1.00  Load:       2.000 
Network:    [DEFAULT] 
Features:   [d1850] 
Attributes: [Batch] 
Classes:    [medium 2:2][batch 2:2] 

Total Time: 1:02:46:37  Up: 1:47:41 (6.70%)  Active: 00:00:00 (0.00%) 

Reservations: 
NOTE:  no reservations on node 
ALERT:  node is in state Idle but load is high (2.000) 





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20080501/69aba3da/attachment-0001.html


More information about the mauiusers mailing list