From akshar.bhosale at gmail.com Sat Mar 3 04:37:56 2012 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Sat, 3 Mar 2012 17:07:56 +0530 Subject: [Mauiusers] different nodes for single job Message-ID: hi, We have torque 2.5.8 and maui 2.3.6 configured on centos 5.2 cluster. we have 2 partitions using maui of the nodes parA and parB. checkjob says that job should go to one of the nodes from parA say node5. checknode node5 says that it is waiting for job to get on it. showstart says that job should start now on the node5; but unfortunately job remians in idle state in spite of availability of node. "Also checkjob -vvv does not show any of the nodes from parA and rejected for all the nodes from parB." it should have shown some nodes form parA atleast. this is with only a perticular type of job all the other jobs dont have this problem. From adaptivecomputing at bridgemailsystem.com Thu Mar 1 09:25:33 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Thu, 1 Mar 2012 08:25:33 -0800 (PST) Subject: [Mauiusers] News About Moab Technology from Adaptive Computing Message-ID: <7170901.1330619136259.JavaMail.root@mail2.bms.local> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120301/5858ff10/attachment.html From jayavant.patil82 at gmail.com Mon Mar 5 01:36:45 2012 From: jayavant.patil82 at gmail.com (Jayavant Patil) Date: Mon, 5 Mar 2012 14:06:45 +0530 Subject: [Mauiusers] job does not start Message-ID: >Hi, >We have Torque Server Version 2.5.8 and maui version 3.2.6p1 installed on >rhel 5.2 server. "showstart" for one of the jobs says that job should start >now i.e. >Earliest start in 00:00:00 on current time. >######################## >checkjob -vv says that >checkjob -vv 62235 >checking job 62235 (RM job '62235.yc9.cn.yuva.param') >State: Idle >Creds: user:abcd group:pqr account:PQR-PR class:q1 qos:q1-qos >WallTime: 00:00:00 of 2:05:00:00 >SubmitTime: Thu Feb 23 18:56:26 >(Time Queued Total: 1:21:27:05 Eligible: 1:21:27:05) >Total Tasks: 2 >Req[0] TaskCount: 2 Partition: ALL >Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >Opsys: [NONE] Arch: [NONE] Features: [NONE] >Exec: '' ExecSize: 0 ImageSize: 0 >Dedicated Resources Per Task: PROCS: 1 >NodeAccess: SHARED >NodeCount: 0 >IWD: [NONE] Executable: [NONE] >Bypass: 51 StartCount: 0 >PartitionMask: [ALL] >Reservation '62235' (00:00:00 -> 2:05:00:00 Duration: 2:05:00:00) >PE: 2.00 StartPriority: 2727 >job cannot run in partition DEFAULT (insufficient idle procs available: 0 < >2) >job can run in partition P1 (32 procs available. 2 procs required) >job can run in partition P2 (48 procs available. 2 procs required) >######################## >showres -n 62235 says that >reservations on Sat Feb 25 16:28:10 > NodeName Type ReservationID JobState Task Start Duration StartTime > node16.clusternode Job 62235 Idle 2 00:00:00 2:05:00:00 Sat Feb 25 16:28:10 >1 nodes reserved ############################ >checknode node16.clusternode says that node is available for job run. >but somehow job is not going and is not giving any error in maui, pbs_server,pbs_mom logs also. >What can be the issue? Have you seen that Maui is starting the job in maui.log? If yes, then there might be the communication problem with TORQUE. >What can be done to make job run and avoid the same in future? How many partitions you have in you cluster? Can you try to submit the job by specifying the PARTITION as follows: qsub -q -l nodes= -W x=PARTITION: >thank you >-pankakjd -- Thanks & Regards, Jayavant Ningoji Patil +91 9923536030. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120305/a41be37b/attachment-0001.html From abhig at princeton.edu Wed Mar 14 16:35:36 2012 From: abhig at princeton.edu (Abhishek Gupta) Date: Wed, 14 Mar 2012 18:35:36 -0400 Subject: [Mauiusers] setting priorities Message-ID: <4F611D38.5020806@princeton.edu> Hi, I am trying to use setspri command to change priorities of the jobs running on the system. As per documentation, the value can go higher than 1000 if changing relative priority but it does not let me do that. Example: 139811 172972 100.0(24997:14797) 0.0( 0.0) setspri -r 10000 139811 ERROR: system priority must be in the range 0 - 1000 So I did the following: 139811 172974* 100.0(24997:14797) 0.0( 1.3) The overall priority change is 2. I do not get why is it so less or I need to tweak the maui config file to make it working. Is there any other command to control priorities? Thanks, Abhi. From stevenx.a.duchene at intel.com Fri Mar 16 16:42:23 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 16 Mar 2012 22:42:23 +0000 Subject: [Mauiusers] had to alter extern lines in src/maui/MPBSI.c Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC813@ORSMSX106.amr.corp.intel.com> In order to get maui to compile with the include files from a fresh install of Torque-4.0 I had to make the following modifications to src/moab/MPBSI.c --- maui-3.3.1/src/moab/MPBSI.c 2011-03-04 08:28:25.000000000 -0800 +++ maui-3.3.1_altered/src/moab/MPBSI.c 2012-03-16 14:20:02.732259530 -0700 @@ -174,8 +174,8 @@ extern int pbs_errno; -extern int get_svrport(const char *,char *,int); -extern int openrm(char *,int); +extern unsigned int get_svrport(char *,char *,unsigned int); +extern int openrm(char *,unsigned int); extern int addreq(int,char *); extern int closerm(int); extern int pbs_stagein(int,char *,char *,char *); Prior to making this modification the compile would error out with: --- maui-3.3.1/src/moab/MPBSI.c 2011-03-04 08:28:25.000000000 -0800 +++ maui-3.3.1_altered/src/moab/MPBSI.c 2012-03-16 14:20:02.732259530 -0700 @@ -174,8 +174,8 @@ extern int pbs_errno; -extern int get_svrport(const char *,char *,int); -extern int openrm(char *,int); +extern unsigned int get_svrport(char *,char *,unsigned int); +extern int openrm(char *,unsigned int); extern int addreq(int,char *); extern int closerm(int); extern int pbs_stagein(int,char *,char *,char *); After making the above modifications the build was successful and I was able to run maui-3.3.1 -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120316/f1856920/attachment.html From stevenx.a.duchene at intel.com Fri Mar 16 17:19:46 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 16 Mar 2012 23:19:46 +0000 Subject: [Mauiusers] checknode and mdiag do not work with Torque-4.0? Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC843@ORSMSX106.amr.corp.intel.com> I just got maui-3.3.1 to compile on a system where I have Torque-4.0 running. The maui daemon is running and I can run maui commands like diagnose or mdiag but when I try to run checknode I get the following error: [root at elogin2 src]# checknode eatom009 ERROR: 'checknode' failed ERROR: cannot locate node 'eatom009' I can however run the mdiag command to get node information about that same node I get stuff back but it seems to be incomplete: [root at elogin2 src]# mdiag -n eatom009 diagnosing node table (5120 slots) Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features ----- --- 0:0 0:0 0:0 0:0 Total Nodes: 0 (Active: 0 Idle: 0 Down: 0) No matter what node I give to checknode it gives the same output. If I run the same checknode command on another cluster that is running maui-3.3.1 with Torque-2.5.7 I get back the expected output: [maui at emcutil1 ~]$ checknode eviking09 checking node eviking09 State: Idle (in current state for 00:00:00) Configured Resources: PROCS: 8 MEM: 9881M SWAP: 15G DISK: 1M Utilized Resources: SWAP: 306M Dedicated Resources: [NONE] Opsys: RHEL61 Arch: xeon Speed: 1.00 Load: 0.000 Network: [DEFAULT] Features: [Viking] Attributes: [Batch] Classes: [batch 8:8] Total Time: 00:24:40 Up: 00:00:00 (0.00%) Active: 00:00:00 (0.00%) Reservations: NOTE: no reservations on node The mdiag output for a node on this same cluster also produces much more complete information: [maui at emcutil1 ~]$ mdiag -n eviking09 diagnosing node table (5120 slots) Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features eviking09 Down 0:8 1:1 1:1 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [batch_8:8] [DEFAULT] [Viking] ----- --- 0:8 1:1 1:1 10:10 Total Nodes: 1 (Active: 0 Idle: 0 Down: 1) Is checknode and mdiag in maui not compatible with the data formats provided by Torque-4.0 ??? -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120316/c7fa9010/attachment-0001.html From jerome.pansanel at iphc.cnrs.fr Mon Mar 5 01:45:31 2012 From: jerome.pansanel at iphc.cnrs.fr (=?ISO-8859-1?Q?J=E9r=F4me?= Pansanel) Date: Mon, 05 Mar 2012 08:45:31 -0000 Subject: [Mauiusers] job does not start In-Reply-To: References: Message-ID: <1330937128.2595.16.camel@sbgat346> Hi, We got lot of errors with maui version 3.2.6p1 (segfault mainly). Since the update to version 3.3.4, it works fine. Best regards, Jerome Pansanel On lun., 2012-03-05 at 14:06 +0530, Jayavant Patil wrote: > >Hi, > > >We have Torque Server Version 2.5.8 and maui version 3.2.6p1 > installed on > >rhel 5.2 server. "showstart" for one of the jobs says that job should > start > >now i.e. > > >Earliest start in 00:00:00 on current time. > >######################## > >checkjob -vv says that > > >checkjob -vv 62235 > >checking job 62235 (RM job '62235.yc9.cn.yuva.param') > >State: Idle > >Creds: user:abcd group:pqr account:PQR-PR class:q1 qos:q1-qos > >WallTime: 00:00:00 of 2:05:00:00 > >SubmitTime: Thu Feb 23 18:56:26 > >(Time Queued Total: 1:21:27:05 Eligible: 1:21:27:05) > > >Total Tasks: 2 > > >Req[0] TaskCount: 2 Partition: ALL > >Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > >Opsys: [NONE] Arch: [NONE] Features: [NONE] > >Exec: '' ExecSize: 0 ImageSize: 0 > >Dedicated Resources Per Task: PROCS: 1 > >NodeAccess: SHARED > >NodeCount: 0 > >IWD: [NONE] Executable: [NONE] > >Bypass: 51 StartCount: 0 > >PartitionMask: [ALL] > >Reservation '62235' (00:00:00 -> 2:05:00:00 Duration: 2:05:00:00) > >PE: 2.00 StartPriority: 2727 > >job cannot run in partition DEFAULT (insufficient idle procs > available: 0 < > >2) > >job can run in partition P1 (32 procs available. 2 procs required) > >job can run in partition P2 (48 procs available. 2 procs required) > >######################## > >showres -n 62235 says that > > >reservations on Sat Feb 25 16:28:10 > > > NodeName Type ReservationID JobState Task > Start Duration StartTime > > > node16.clusternode Job 62235 Idle 2 > 00:00:00 2:05:00:00 Sat Feb 25 16:28:10 > >1 nodes reserved > ############################ > >checknode node16.clusternode says that node is available for job run. > > >but somehow job is not going and is not giving any error in maui, > pbs_server,pbs_mom logs also. > > >What can be the issue? > > Have you seen that Maui is starting the job in maui.log? If yes, then > there might be the communication problem with TORQUE. > > >What can be done to make job run and avoid the same in future? > > How many partitions you have in you cluster? > > Can you try to submit the job by specifying the PARTITION as follows: > > qsub -q -l nodes= -W x=PARTITION: name> > > >thank you > > >-pankakjd > > -- > > Thanks & Regards, > Jayavant Ningoji Patil > +91 9923536030. > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Jerome Pansanel IPHC 23 rue du Loess, BP 28 F-67037 STRASBOURG Cedex 2 T. +33 (0)3 88 10 66 24 P. +33 (0)6 25 19 24 43 F. +33 (0)3 88 10 62 34 From rf at q-leap.de Mon Mar 5 11:06:41 2012 From: rf at q-leap.de (rf at q-leap.de) Date: Mon, 05 Mar 2012 18:06:41 -0000 Subject: [Mauiusers] [torqueusers] [Patch] GPUs by the way of GRES In-Reply-To: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> Message-ID: <20309.169.908792.976630@gargle.gargle.HOWL> >>>>> "Jonathan" == Jonathan Michalon writes: Hi Jonathan, while your patch adds some functionality to count allocated GPUs as a GRES, it lacks the important functionality to tell the job which GPUs are available for it. If latest torque 2.5.x is built with GPU support, you have the option to specify a nodes spec like "-l nodes=1:gpus=1" and within the running job you can check $GPUFILE what GPUs you're allocated. Now the problem is that a job with a "-l nodes=1:gpus=1" specification won't be started with maui even if it has your patch. On the other hand, using your "-W x=GRES:gpu at 1" spec (without a "-l nodes=1:gpus=1" spec) makes the job run, but it doesn't have an idea which GPU to use. Is there an easy way to extend your patch, so that maui will make a job run with the "-l nodes=1:gpus=1" spec? Cheers, Roland Jonathan> Hi Maui folks, GPUs in Maui are a long standing Jonathan> problem. Last year a patch was sent by Mariusz Mamo?ski Jonathan> [1], which works based on GRES parameters. I've just made Jonathan> GPUs kind of working, by enhancing that patch. Please find Jonathan> attached the resulting patch, which works well for Maui Jonathan> 3.3.1. It defines a special GRES named "gpu" which works Jonathan> as expected on my test cases. Jonathan> Note that GRES behaviour seems quite confused as sometimes Jonathan> they are mentioned as consumable. This patch annihilates Jonathan> this behaviour, for the needs of GPUs. Jonathan> To use the patch: get the sources of maui-3.3.1 and patch Jonathan> them: patch -p1 < ../Patch-for-gpu-GRES.patch then compile Jonathan> as usual. Jonathan> You have to configure the GPUs in maui.cfg: Jonathan> NODECFG[nodename] GRES=gpu:2 Jonathan> Then when queuing jobs you can request GPUs with (Torque Jonathan> syntax): qsub -W x=GRES:gpu at 1 Jonathan> I hope this helps, please test this and enhance to your Jonathan> needs! Jonathan> [1] Jonathan> http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html Jonathan> Regards, Jonathan> PS. This is the second attempt to send the mail? Jonathan> -- Jonathan Michalon IT student in Strasbourg From brandor5 at gmail.com Mon Mar 5 11:48:52 2012 From: brandor5 at gmail.com (Brandon Sawyers) Date: Mon, 05 Mar 2012 18:48:52 -0000 Subject: [Mauiusers] policy not working as expected? Message-ID: Hello everyone: We are bringing up a new system and are running into an issue with maui. We want jobs to behave like this. A user requests a number of nodes regardless of ppn and gets that number of nodes (nodes=6:ppn=1). At the same time, we want only one job to be running on a node at one time. So that user would get 6 nodes and no other jobs would be able to run on that node while those are running. We expected the following two config changes to make this happen. JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SINGLEJOB While only one job will run on a node like we want, but (using the example above) all 6 cores of that node are getting used, instead of using 1 core on 6 different nodes. Interestingly, the following nodes=1:ppn=1 gives me 1 core from 1 node. nodes=1:ppn=(2-6) gives me 6 cores. What are we missing? Thanks, Brandon -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120305/53c2b1d8/attachment-0001.html From jonathan.michalon at etu.unistra.fr Fri Mar 9 12:59:57 2012 From: jonathan.michalon at etu.unistra.fr (Jonathan Michalon) Date: Fri, 09 Mar 2012 19:59:57 -0000 Subject: [Mauiusers] [torqueusers] [Patch] GPUs by the way of GRES In-Reply-To: <20309.169.908792.976630@gargle.gargle.HOWL> References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> <20309.169.908792.976630@gargle.gargle.HOWL> Message-ID: <20120309205947.12f0dd26@RunningPenguin.chalmion.homelinux.net> Le Mon, 5 Mar 2012 19:06:33 +0100, rf at q-leap.de a ?crit : > Hi Jonathan, > > while your patch adds some functionality to count allocated GPUs as > a GRES, it lacks the important functionality to tell the job which GPUs > are available for it. If latest torque 2.5.x is built with GPU support, > you have the option to specify a nodes spec like "-l nodes=1:gpus=1" and > within the running job you can check $GPUFILE what GPUs you're > allocated. Now the problem is that a job with a "-l nodes=1:gpus=1" > specification won't be started with maui even if it has your patch. On > the other hand, using your "-W x=GRES:gpu at 1" spec (without a "-l > nodes=1:gpus=1" spec) makes the job run, but > it doesn't have an idea which GPU to use. Is there an easy way to extend > your patch, so that maui will make a job run with the "-l > nodes=1:gpus=1" spec? > > Cheers, > > Roland Hi Roland, Hum, since you are speaking about maui / torque interaction I suspect the problem be really hard (at least for a non-maui-dev like me). The GRES support is quite a hack to my mind and to add correct support would probably need full real GPU handling, and torque communication in both directions, as you mention environment variables to be set. Plus the fact that maui should work with non-torque setups too. My job was to find a working solution for the setup here, and I have no time to really dig in maui internals? sorry. In addition, admins here decided (AFAIK) to switch to slurm this summer? -- Jonathan Michalon IT student in Strasbourg From adaptivecomputing at bridgemailsystem.com Wed Mar 14 07:10:08 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Wed, 14 Mar 2012 13:10:08 -0000 Subject: [Mauiusers] What's New from Adaptive - Moab 7.0 Message-ID: <29980184.1331730596341.JavaMail.root@mail2.bms.local> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120314/33918b6b/attachment-0001.html From adaptivecomputing at bridgemailsystem.com Thu Mar 15 12:15:07 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Thu, 15 Mar 2012 18:15:07 -0000 Subject: [Mauiusers] Live Cloud & HPC Webinars from Adaptive Computing Message-ID: <28131491.1331835301120.JavaMail.root@mail4.bridgemailsystem.com> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120315/23d7d86e/attachment-0001.html From cwebber at ucr.edu Mon Mar 19 11:40:10 2012 From: cwebber at ucr.edu (Christopher R Webber) Date: Mon, 19 Mar 2012 17:40:10 -0000 Subject: [Mauiusers] CLASSCFG and USERCFG Message-ID: <28A19697-DFF9-4292-B296-CCB018EE1A0C@ucr.edu> All, I have applied the following parameters via changeparam but cannot seem to find a way to output from maui all of the throttling policies. I have tried using the various diagnose flags and showconfig -v. Any thoughts? CLASSCFG[batch] MAXPROCPERUSER=64 USERCFG[user] MAXPROC=128 Thanks. -- cwebber Christopher Webber - Systems Administrator Bioinformatics - University of California, Riverside Twitter: @cwebber Tel: 951.867.7108 http://cwebber.ucr.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120319/28da7d58/attachment-0001.html From brandor5 at gmail.com Fri Mar 23 15:14:28 2012 From: brandor5 at gmail.com (Brandon Sawyers) Date: Fri, 23 Mar 2012 21:14:28 -0000 Subject: [Mauiusers] MAXNODE throttling Message-ID: Hello everyone: We are attempting to limit the number of nodes that a user gets via the MAXNODE setting. When setting USERCFG[DEFAULT] MAXNODE=12,12 it sort of works as long as we have MAXPROCS set to 72,72 as well. (6 cores per cpu). I say sort of, because if someone requests nodes=13:ppn=1 the job will run. (13 is less than 72, but maxnode fails to trigger). Is there any way to force the nodes to behave like we want? IE reject/block a job that requests more than 12? Thanks for your help. Brandon -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120323/756fc486/attachment-0001.html