From naveed at caltech.edu Fri Apr 6 12:29:45 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 06 Apr 2012 11:29:45 -0700 Subject: [Mauiusers] priority job failing to get reservation Message-ID: <4F7F3619.9030102@caltech.edu> Hi all, I am having an issue with a priority job not getting a reservation. When I set resevation depth to 2, the second priority job does get a reservation though. The cluster has 3552 core available for the queue it is submitted to, at the moment they are all in use. Since the jobs has the highest priority, it should start reserving nodes (and it does try.) WHen i change the RESERVATIONDEPTH to 2, the second highest priority job does get a reservation, though this is a much smaller job. We don't have a size limit on jobs and the cluster does have the resources for this job. Does anyone know what may be going on here? We have this type of workflow where some people send it very large jobs, and some small so I would like to figure out what is happy. Here is the checkjob output and as you can see, it isn't requesting any resources other than cores. I have no idead where it is getting the idle procs from since none are actually idle: checking job 213152 State: Idle Creds: user:user group:group class:default qos:dedicated WallTime: 00:00:00 of 1:12:00:00 SubmitTime: Fri Apr 6 03:35:23 (Time Queued Total: 7:45:59 Eligible: 1:30:06) Total Tasks: 1501 Req[0] TaskCount: 1501 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [default] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PREEMPTEE DEDICATEDNODE Attr: PREEMPTEE PE: 1501.00 StartPriority: 144235 job cannot run in partition DEFAULT (insufficient idle procs available: 1056 < 1501) Here are the relevant log entries: 04/06 03:35:24 MJobPReserve(213152,DEFAULT,ResCount,ResCountRej) 04/06 03:35:24 INFO: 3552 feasible tasks found for job 213152:0 in partition DEFAULT (1501 Needed) 04/06 03:35:24 ALERT: job 213152 cannot run in any partition 04/06 03:35:24 ALERT: cannot create new reservation for job 213152 (shape[1] 1501) 04/06 03:35:24 ALERT: cannot create new reservation for job 213152 04/06 03:35:24 ALERT: job '213152' cannot run (deferring job for 3600 seconds) 04/06 03:35:24 WARNING: cannot reserve priority job '213152' -- Naveed Near-Ansari E: naveed at caltech.edu O: 626-395-2212 M: 626-394-3845 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4887 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20120406/7c49f443/attachment.bin From medernac at clermont.in2p3.fr Thu Apr 19 02:36:09 2012 From: medernac at clermont.in2p3.fr (Emmanuel Medernach) Date: Thu, 19 Apr 2012 10:36:09 +0200 Subject: [Mauiusers] Problem with reservation: job does not start Message-ID: <4F8FCE79.9050109@clermont.in2p3.fr> Hello, We are using Maui on a WLCG grid cluster and have a standing reservation for group OPS: SRCFG[ops] FLAGS=SPACEFLEX SRCFG[ops] TASKCOUNT=1 RESOURCES=PROCS:1 SRCFG[ops] PERIOD=INFINITY SRCFG[ops] CLASSLIST=ops,dteam However since few days, OPS jobs don't start, even with free reservation : # showres -n | grep ops <****> User ops.0.0 N/A 1 -00:09:14 INFINITE Thu Apr 19 10:20:38 The Worker Node is marked as free with pbsnodes and is not full. However jobs don't start : # showq | grep ops 5656815 opssgm Idle 1 3:00:00:00 Thu Apr 19 10:23:23 5656819 opssgm Idle 1 3:00:00:00 Thu Apr 19 10:29:18 # checkjob 5656815 checking job 5656815 State: Idle Creds: user:opssgm group:ops class:ops qos:DEFAULT WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Thu Apr 19 10:23:23 (Time Queued Total: 00:09:08 Eligible: 00:09:08) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PE: 1.00 StartPriority: 14611 job can run in partition DEFAULT (1 procs available. 1 procs required) # checkjob 5656819 checking job 5656819 State: Idle Creds: user:opssgm group:ops class:ops qos:DEFAULT WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Thu Apr 19 10:29:18 (Time Queued Total: 00:03:46 Eligible: 00:03:46) StartDate: -00:03:45 Thu Apr 19 10:29:19 Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PE: 1.00 StartPriority: 14585 job can run in partition DEFAULT (1 procs available. 1 procs required) What could we do to solve this issue ? Best regards, -- -------------- next part -------------- A non-text attachment was scrubbed... Name: medernac.vcf Type: text/x-vcard Size: 259 bytes Desc: not available Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20120419/7f36518f/attachment.vcf From naveed at caltech.edu Fri Apr 20 15:27:57 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 20 Apr 2012 14:27:57 -0700 Subject: [Mauiusers] priority job failing to get reservation Message-ID: <4F91D4DD.2010204@caltech.edu> I know this isn't technically torque, but i haven't seen any activity on the maui list and I though there might be some overlap in users here. I am having an issue with a priority job not getting a reservation. When I set reservation depth to 2, the second priority job does get a reservation though. The cluster has 3552 core available for the queue it is submitted to, at the moment they are all in use. Since the jobs has the highest priority, it should start reserving nodes (and it does try.) When i change the RESERVATIONDEPTH to 2, the second highest priority job does get a reservation, though this is a much smaller job. Perhaps I am misunderstanding how these reservation work. If there a timefram in which it has to reserve nodes? We don't have a size limit on jobs and the cluster does have the resources for this job. Does anyone know what may be going on here? We have this type of workflow where some people send it very large jobs, and some small so I would like to figure out what is happening. Do you have any good strategies to deal with the type of workflow? Here is the checkjob output and as you can see, it isn't requesting any resources other than cores. I have no idea where it is getting the idle procs from since none are actually idle. perhaps it has do do with reservable nodes? The idle procs tends to fluctuate over time. checking job 213152 State: Idle Creds: user:user group:group class:default qos:dedicated WallTime: 00:00:00 of 1:12:00:00 SubmitTime: Fri Apr 6 03:35:23 (Time Queued Total: 7:45:59 Eligible: 1:30:06) Total Tasks: 1501 Req[0] TaskCount: 1501 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [default] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PREEMPTEE DEDICATEDNODE Attr: PREEMPTEE PE: 1501.00 StartPriority: 144235 job cannot run in partition DEFAULT (insufficient idle procs available: 1056 < 1501) Here are the relevant log entries: 04/06 03:35:24 MJobPReserve(213152,DEFAULT,ResCount,ResCountRej) 04/06 03:35:24 INFO: 3552 feasible tasks found for job 213152:0 in partition DEFAULT (1501 Needed) 04/06 03:35:24 ALERT: job 213152 cannot run in any partition 04/06 03:35:24 ALERT: cannot create new reservation for job 213152 (shape[1] 1501) 04/06 03:35:24 ALERT: cannot create new reservation for job 213152 04/06 03:35:24 ALERT: job '213152' cannot run (deferring job for 3600 seconds) 04/06 03:35:24 WARNING: cannot reserve priority job '213152' -- Naveed Near-Ansari -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4887 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20120420/c870c203/attachment-0001.bin From patrick.jaeger at fr.ibm.com Fri Apr 20 20:02:54 2012 From: patrick.jaeger at fr.ibm.com (Patrick Jaeger) Date: Sat, 21 Apr 2012 04:02:54 +0200 Subject: [Mauiusers] AUTO: Patrick Jaeger is out of the office - (returning 28/04/2012) Message-ID: I am out of the office until 28/04/2012. si besoin contactez mon manager Sylvain Chebassier qui saura vous aiguiller . Antoine Tabary assure mon backup I am Out of office , if necessary please call my manager Sylvain Chebassier or My backup Antoine Tabary . Note: This is an automated response to your message "mauiusers Digest, Vol 93, Issue 1" sent on 20/4/2012 23:28:34. This is the only notification you will receive while this person is away. From s.breedveld at erasmusmc.nl Thu Apr 5 05:32:40 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Thu, 05 Apr 2012 11:32:40 -0000 Subject: [Mauiusers] Simple Torque+Maui setup: jobs stay queued, no resources Message-ID: <4F7D82D3.3000409@erasmusmc.nl> Dear list, I am trying to setup a very basic Torque+Maui system. I am running a Torque cluster for a year now, and wanted to improve the scheduling with Maui. To this end, I installed a fresh test-system, with server and node on a single computer. Torque version: 2.4.16 Maui version: 3.3.1 uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux I was able to run (simple) jobs with the Torque scheduler. When I replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by: $ qsub -q batch test-script.sh where test-script.sh is nothing more than a 'sleep 1m' script. Checking the job: # checkjob -v 55 checking job 55 (RM job '55.testing.azr.nl') State: Idle EState: Deferred Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT WallTime: 00:00:00 of 6:00:00 SubmitTime: Thu Apr 5 13:21:33 (Time Queued Total: 00:00:32 Eligible: 00:00:01) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G NodeAccess: SHARED NodeCount: 1 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE job is deferred. Reason: NoResources (cannot create reservation for job '55' (intital reservation attempt) ) Holds: Defer (hold reason: NoResources) PE: 16.03 StartPriority: 1 cannot select job 55 for partition DEFAULT (job hold active) show that there are no resources available. The node is free, and unloaded: # checknode testing checking node testing.azr.nl State: Idle (in current state for 2:23:54) Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M Utilized Resources: SWAP: 149M Dedicated Resources: [NONE] Opsys: linux Arch: [NONE] Speed: 1.00 Load: 0.050 Network: [DEFAULT] Features: [NONE] Attributes: [Batch] Classes: [batch 2:2] Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%) Reservations: NOTE: no reservations on node When the job is added, maui.log shows this: 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0) 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) 04/05 13:21:34 INFO: processing node request line '1' 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan 21600 Idle 0 1333624893 [NONE] [NONE] [NONE] >= 0 >= 0 [1][ppn=1] 1333624894 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING 04/05 13:21:34 INFO: jobs detected: 12 04/05 13:21:34 MStatClearUsage(node,Active) 04/05 13:21:34 MClusterUpdateNodeState() 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) 04/05 13:21:34 INFO: job '40' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '41' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '42' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '44' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '45' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '47' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '48' Priority: 16 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '49' Priority: 12 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '52' Priority: 8 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '53' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '54' Priority: 60 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '55' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 MStatClearUsage([NONE],Active) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) 04/05 13:21:34 INFO: job '40' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '41' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '42' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '44' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '45' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '47' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '48' Priority: 16 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '49' Priority: 12 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '52' Priority: 8 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '53' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '54' Priority: 60 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '55' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 MStatClearUsage([NONE],Idle) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 04/05 13:21:34 MQueueScheduleRJobs(Q) 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in partition DEFAULT (1 Needed) 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) 04/05 13:21:34 MJobReserve(55,Priority) 04/05 13:21:34 ALERT: job 55 cannot run in any partition 04/05 13:21:34 ALERT: cannot create new reservation for job 55 (shape[1] 1) 04/05 13:21:34 ALERT: cannot create new reservation for job 55 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create reservation for job '55' (intital reservation attempt) ) 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600 seconds) 04/05 13:21:34 WARNING: cannot reserve priority job '55' Active Jobs------ ------------------ 04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: 2 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1 [EState: 1] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 [EState: 1] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 [EState: 1] 04/05 13:21:34 MSchedUpdateStats() 04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 seconds 04/05 13:21:34 MResUpdateStats() 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% active jobs: 0 of 2 (completed: 1) 04/05 13:21:34 MQueueCheckStatus() 04/05 13:21:34 MNodeCheckStatus() 04/05 13:21:34 MUClearChild(PID) 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds I think the relevant line is: 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in partition DEFAULT (1 Needed) but I have no idea how to make a feasible task for the job. I have tried queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem to have had effect. Torque config. I have tried setting different attributes to the queue properties, hoping that it would have some effect: # qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch Priority = 20 set queue batch max_running = 8 set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodect = 10 set queue batch resources_max.nodes = 2 set queue batch resources_min.ncpus = 0 set queue batch resources_default.mem = 2000mb set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.pvmem = 16000mb set queue batch resources_default.walltime = 06:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = testing.azr.nl set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 10 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 56 Maui configuration, untouched: # maui.cfg 3.3.1 SERVERHOST testing # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[TESTING] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Any ideas? Thanks in advance, Sebastiaan -- Sebastiaan Breedveld, MSc. Ph.D. student Erasmus MC - Daniel den Hoed Cancer Center Department of Radiation Oncology Groene Hilledijk 301 3075 EA Rotterdam The Netherlands Phone: +31 10 7042693 Room: Gs-20 -------------- next part -------------- A non-text attachment was scrubbed... Name: s_breedveld.vcf Type: text/x-vcard Size: 365 bytes Desc: not available Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20120405/b806864c/attachment-0001.vcf From s.breedveld at erasmusmc.nl Tue Apr 10 04:39:33 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Tue, 10 Apr 2012 10:39:33 -0000 Subject: [Mauiusers] Simple Torque+Maui setup: jobs stay queued, no resources Message-ID: <4F840DDF.2020103@erasmusmc.nl> Dear list, I am trying to setup a very basic Torque+Maui system. I am running a Torque cluster for a year now, and wanted to improve the scheduling with Maui. To this end, I installed a fresh test-system, with server and node on a single computer. Torque version: 2.4.16 Maui version: 3.3.1 uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux I was able to run (simple) jobs with the Torque scheduler. When I replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by: $ qsub -q batch test-script.sh where test-script.sh is nothing more than a 'sleep 1m' script. Checking the job: # checkjob -v 55 checking job 55 (RM job '55.testing.azr.nl') State: Idle EState: Deferred Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT WallTime: 00:00:00 of 6:00:00 SubmitTime: Thu Apr 5 13:21:33 (Time Queued Total: 00:00:32 Eligible: 00:00:01) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G NodeAccess: SHARED NodeCount: 1 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE job is deferred. Reason: NoResources (cannot create reservation for job '55' (intital reservation attempt) ) Holds: Defer (hold reason: NoResources) PE: 16.03 StartPriority: 1 cannot select job 55 for partition DEFAULT (job hold active) show that there are no resources available. The node is free, and unloaded: # checknode testing checking node testing.azr.nl State: Idle (in current state for 2:23:54) Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M Utilized Resources: SWAP: 149M Dedicated Resources: [NONE] Opsys: linux Arch: [NONE] Speed: 1.00 Load: 0.050 Network: [DEFAULT] Features: [NONE] Attributes: [Batch] Classes: [batch 2:2] Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%) Reservations: NOTE: no reservations on node When the job is added, maui.log shows this: 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0) 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) 04/05 13:21:34 INFO: processing node request line '1' 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan 21600 Idle 0 1333624893 [NONE] [NONE] [NONE] >= 0 >= 0 [1][ppn=1] 1333624894 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING 04/05 13:21:34 INFO: jobs detected: 12 04/05 13:21:34 MStatClearUsage(node,Active) 04/05 13:21:34 MClusterUpdateNodeState() 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) 04/05 13:21:34 INFO: job '40' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '41' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '42' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '44' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '45' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '47' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '48' Priority: 16 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '49' Priority: 12 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '52' Priority: 8 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '53' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '54' Priority: 60 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '55' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 MStatClearUsage([NONE],Active) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) 04/05 13:21:34 INFO: job '40' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '41' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '42' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '44' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '45' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '47' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '48' Priority: 16 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '49' Priority: 12 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '52' Priority: 8 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '53' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '54' Priority: 60 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '55' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 MStatClearUsage([NONE],Idle) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 04/05 13:21:34 MQueueScheduleRJobs(Q) 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in partition DEFAULT (1 Needed) 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) 04/05 13:21:34 MJobReserve(55,Priority) 04/05 13:21:34 ALERT: job 55 cannot run in any partition 04/05 13:21:34 ALERT: cannot create new reservation for job 55 (shape[1] 1) 04/05 13:21:34 ALERT: cannot create new reservation for job 55 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create reservation for job '55' (intital reservation attempt) ) 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600 seconds) 04/05 13:21:34 WARNING: cannot reserve priority job '55' Active Jobs------ ------------------ 04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: 2 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1 [EState: 1] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 [EState: 1] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 [EState: 1] 04/05 13:21:34 MSchedUpdateStats() 04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 seconds 04/05 13:21:34 MResUpdateStats() 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% active jobs: 0 of 2 (completed: 1) 04/05 13:21:34 MQueueCheckStatus() 04/05 13:21:34 MNodeCheckStatus() 04/05 13:21:34 MUClearChild(PID) 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds I think the relevant line is: 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in partition DEFAULT (1 Needed) but I have no idea how to make a feasible task for the job. I have tried queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem to have had effect. Torque config. I have tried setting different attributes to the queue properties, hoping that it would have some effect: # qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch Priority = 20 set queue batch max_running = 8 set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodect = 10 set queue batch resources_max.nodes = 2 set queue batch resources_min.ncpus = 0 set queue batch resources_default.mem = 2000mb set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.pvmem = 16000mb set queue batch resources_default.walltime = 06:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = testing.azr.nl set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 10 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 56 Maui configuration, untouched: # maui.cfg 3.3.1 SERVERHOST testing # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[TESTING] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Any ideas? Thanks in advance, Sebastiaan -- Sebastiaan Breedveld, MSc. Ph.D. student Erasmus MC - Daniel den Hoed Cancer Center Department of Radiation Oncology Groene Hilledijk 301 3075 EA Rotterdam The Netherlands Phone: +31 10 7042693 Room: Gs-20 From martin.ratschek at tugraz.at Wed Apr 11 09:00:20 2012 From: martin.ratschek at tugraz.at (Martin Ratschek) Date: Wed, 11 Apr 2012 15:00:20 -0000 Subject: [Mauiusers] Limit number of processors used per user - Soft and Hard limit Message-ID: <20120411170013.06adee7b@fexphdyn24.tu-graz.ac.at> Hello! I am using Maui 3.3.1 + Torque 2.4.16 on a single multicore server which is shared between several users. Each user should have limited access to the resources, in my case the processors. The overall 16 processor cores should be shared between 2 users equally. So far the configuration is working. (USERCFG[DEFAULT] MAXPROC=8) If not all 16 cores are used, one of the user should be able to access the spare capacity, exceeding the limit "MAXPROC=8". This should be possible by using soft limits, as far as I understand. Unfortunately I was not able to create a working soft/hard limit configuration. "USERCFG[DEFAULT] MAXPROC=8,16" seems to work sometimes but stops working after a while. Not even restarting torque_server and maui resolves this issue, but a restart of the server does. (maui log tells me: "job XXX violates active SOFT MAXPROC limit of 8 for user XXX") I tried different settings in maui.cfg: USERCFG[DEFAULT] MAXPROC=8,16 CLASSCFG[queuename] MAXPROCPERUSER=8,16 USERCFG[user1] MAXPROC=8,16 USERCFG[user2] MAXPROC=8,16 MAXPROCPERUSERPOLICY ON MAXPROCPERUSERCOUNT 16 SMAXPROCPERUSERCOUNT 8 USERCFG[user1] QDEF=users USERCFG[user2] QDEF=users QOSCFG[users] MAXPROC=8,16 Rest of maui.cfg: # maui.cfg 3.3.1 SERVERHOST myservername ADMIN1 root ADMIN2 user1 RMCFG[MYSERVERNAME] TYPE=PBS RMPOLLINTERVAL 00:00:05 (reduced value for testing) JOBAGGREGATIONTIME 00:00:30 (reduced value for testing) SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 ENABLEMULTINODEJOBS FALSE JOBPRIOACCRUALPOLICY QUEUEPOLICY NODEAVAILABILITYPOLICY DEDICATED:PROCS NODEALLOCATIONPOLICY FIRSTAVAILABLE BACKFILLPOLICY NONE RESERVATIONPOLICY NEVER JOBMAXOVERRUN 12:00:00 Thanks, Martin Ratschek From tanbd at ihpc.a-star.edu.sg Tue Apr 24 21:10:32 2012 From: tanbd at ihpc.a-star.edu.sg (Ta Nguyen Binh Duong (IHPC)) Date: Wed, 25 Apr 2012 03:10:32 -0000 Subject: [Mauiusers] Configuring target queue time Message-ID: Hi all, I am trying to configure a target queue time say 30 minutes for jobs that belong to our hi-priority class. I added this to maui .cfg: QOSCFG[hiprio] PRIORITY=50 QTTARGET=30 But there were no effect at all regarding job priority when the target queue time approached. Would appreciate any help. Thanks a lot in advance. Regards, Duong IHPC Values :: Impact :: Honesty :: Performance :: Co-operation This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120425/8ef409ec/attachment-0001.html