From cholam20 at yahoo.co.in Wed Feb 1 00:01:22 2012
From: cholam20 at yahoo.co.in (revathi ganesh)
Date: Wed, 1 Feb 2012 12:31:22 +0530 (IST)
Subject: [torqueusers] Fwd: This Kit changed all my life...
Message-ID: <1328079682.68091.androidMobile@web137302.mail.in.yahoo.com>
ive been frustrated with myself lately this really intrigued me I had finally hit rock bottom...
http://lacadenasport.es/newsjournal/26MartinLewis/ now I can finally advance
no pressure just check it out.
see you later
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120201/e13f6f23/attachment-0001.html
From fabien.archambault at univ-amu.fr Wed Feb 1 01:15:41 2012
From: fabien.archambault at univ-amu.fr (Fabien Archambault)
Date: Wed, 1 Feb 2012 09:15:41 +0100
Subject: [torqueusers] error opening pipe in PBSD_munge_authenticate: errno
= 24
Message-ID:
Dear list,
I updated torque to version 2.5.9 and also updated CentOS. Since this
update I have some issues with munge. I do not know why it is useful but is
a dependency of python-pbs which is required by jomonarch.
I tried to find online some information on this message but did not get any
result.
This is the munge version: munge-0.5.8-8.el5
I created the key using create-munge-key on the master then copied to all
nodes (all nodes have the same version).
In /var/log/messages I do not have any reference to munge.
# service munge status
munged (pid 9524) is running... (also on nodes)
# tail /var/log/munge/munged.log
2012-01-31 11:24:47 Notice: Exiting on signal=15
2012-01-31 11:24:47 Info: Wrote 1024 bytes to PRNG seed
"/var/lib/munge/munge.seed"
2012-01-31 11:24:47 Notice: Stopping munge-0.5.8 daemon (pid 1704)
2012-01-31 11:24:47 Notice: Running on "slater.up.univ-mrs.fr"
(147.94.185.151)
2012-01-31 11:24:47 Info: PRNG seeded with 1024 bytes from
"/var/lib/munge/munge.seed"
2012-01-31 11:24:47 Info: Updating supplementary group mapping every
3600 seconds
2012-01-31 11:24:47 Info: Enabled supplementary group mtime check of
"/etc/group"
2012-01-31 11:24:47 Notice: Starting munge-0.5.8 daemon (pid 9524)
2012-01-31 11:24:47 Info: Created 2 work threads
2012-01-31 11:24:47 Info: Found 35 users with supplementary groups in
0.001 seconds
Torque is configured with
./configure --disable-mom --disable-cpuset --disable-gui --with-rcp=scp
If someone has some clue. Thanks,
Fabien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120201/c49d4932/attachment.html
From listsarnau at gmail.com Wed Feb 1 03:32:21 2012
From: listsarnau at gmail.com (Arnau Bria)
Date: Wed, 1 Feb 2012 11:32:21 +0100
Subject: [torqueusers] momctl -h $node issue
Message-ID: <20120201113221.2483b218@amarrosa.pic.es>
Hi all,
every time I run once a momctl -d3 -h $node I get the error:
# momctl -d3 -h $workernode
ERROR: query[0] 'diag3' failed on $workernode (errno=0-Success: 5-Input/output error)
the second time it works fine:
# momctl -d3 -h $workernode
Host: $workernode/$workernode Version: 2.5.9 PID: 13822
Init Msgs Received: 0 hellos/1 cluster-addrs
[...]
I'm wondering if we have any network issue or if this is normal command
behaviour.
Anyone had this problem before? What is the source of this error
message?
TIA,
Arnau
As it could be relevant, here it is our network conf:
Our server is not in the same network/vlan as our client:
server: 193.109.174
clients: 192.168.101
tracepath from server to client:
# tracepath $workernode
1: $server (193.109.174.13) 0.218ms pmtu 1500
1: CISCO ROUTER (193.109.174.1) 0.868ms
2: ARISTA SWITCH/ROUTER (192.168.50.) 0.307ms
3: $workernode (192.168.101.108) 0.192ms reached
Resume: pmtu 1500 hops 3 back 3
MTU on server is 1500, on nodes 9000.
Nodes and client have a active-backup bonding
From Gareth.Williams at csiro.au Thu Feb 2 20:17:36 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Fri, 3 Feb 2012 14:17:36 +1100
Subject: [torqueusers] requesting gpus
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au>
Hi All,
I added a basic gpus count information to one of our compute nodes with:
qmgr -c 's n n121 gpus = 2'
and it seems fine:
> pbsnodes -a n121
n121
state = free
np = 12
ntype = cluster
status = rectime=1328238593,varattr=,jobs=,state=free,size=133709780kb:144492840kb,netload=156768229618,gres=,loadave=2.00,ncpus=24,physmem=99195396kb,availmem=95103784kb,totmem=101299868kb,idletime=173222,nusers=0,nsessions=0,uname=Linux n121 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64,opsys=sles11,arch=x86_64
mom_service_port = 15002
mom_manager_port = 15003
gpus = 2
However when I run a job with the recommended syntax:
http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.7schedulinggpus.php
I get:
> qsub -I -q viz -l nodes=1:ppn=1:gpus=1
qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes
The torque version is 3.0.3-snap.201108261653
Note that this is _not_ the --enable-nvidia-gpus functionality.
Also note that the server has not been restarted.
The scheduler is moab but I'm pretty sure the job gets rejected well before moab comes into the picture.
Does anyone have such a setup working or can anyone see what is wrong (or have an idea of where to look)?
Regards,
Gareth
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/b5961f5e/attachment.html
From cwest at vpac.org Thu Feb 2 22:55:18 2012
From: cwest at vpac.org (Craig West)
Date: Fri, 03 Feb 2012 16:55:18 +1100
Subject: [torqueusers] requesting gpus
In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au>
References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au>
Message-ID: <4F2B76C6.2080104@vpac.org>
Hi Gareth,
> However when I run a job with the recommended syntax:
> http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.7schedulinggpus.php
>
> I get:
>> qsub -I -q viz -l nodes=1:ppn=1:gpus=1
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes
>
> The torque version is 3.0.3-snap.201108261653
>
> Note that this is _/not/_ the --enable-nvidia-gpus functionality.
> Also note that the server has not been restarted.
> The scheduler is moab but I?m pretty sure the job gets rejected well
> before moab comes into the picture.
>
> Does anyone have such a setup working or can anyone see what is wrong
> (or have an idea of where to look)?
Your pbsnodes output looks correct and similar to our systems.
Few questions for you:
1. What version of Moab are you running?
2. Does the Viz queue have the ability to schedule to that node?
3. What is in the "Configured Resources" line of "checknode n121"?
It should have a "GPUS: 2" parameter.
Cheers,
Craig.
--
Craig West Systems Manager
Victorian Partnership for Advanced Computing
110 Victoria Street, Carlton South VIC 3053
P: +61 3 9925 4751 E: cwest at vpac.org
http://www.vpac.org
From jonathan.michalon at etu.unistra.fr Fri Feb 3 01:58:10 2012
From: jonathan.michalon at etu.unistra.fr (Jonathan Michalon)
Date: Fri, 3 Feb 2012 09:58:10 +0100
Subject: [torqueusers] [Patch] GPUs by the way of GRES
Message-ID: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net>
Hi Maui folks,
GPUs in Maui are a long standing problem. Last year a patch was sent by Mariusz
Mamo?ski [1], which works based on GRES parameters.
I've just made GPUs kind of working, by enhancing that patch. Please find
attached the resulting patch, which works well for Maui 3.3.1.
It defines a special GRES named "gpu" which works as expected on my test cases.
Note that GRES behaviour seems quite confused as sometimes they are mentioned
as consumable. This patch annihilates this behaviour, for the needs of GPUs.
To use the patch:
get the sources of maui-3.3.1 and patch them:
patch -p1 < ../Patch-for-gpu-GRES.patch
then compile as usual.
You have to configure the GPUs in maui.cfg:
NODECFG[nodename] GRES=gpu:2
Then when queuing jobs you can request GPUs with (Torque syntax):
qsub -W x=GRES:gpu at 1
I hope this helps, please test this and enhance to your needs!
[1]
http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html
Regards,
PS. This is the second attempt to send the mail?
--
Jonathan Michalon
IT student in Strasbourg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Patch-for-gpu-GRES.patch
Type: text/x-patch
Size: 4803 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/e6847559/attachment-0001.bin
From Gareth.Williams at csiro.au Fri Feb 3 02:08:18 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Fri, 3 Feb 2012 20:08:18 +1100
Subject: [torqueusers] requesting gpus
In-Reply-To: <4F2B76C6.2080104@vpac.org>
References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au>
<4F2B76C6.2080104@vpac.org>
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD2@exvic-mbx04.nexus.csiro.au>
> -----Original Message-----
> From: Craig West [mailto:cwest at vpac.org]
> Sent: Friday, 3 February 2012 4:55 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] requesting gpus
>
>
> Hi Gareth,
>
> > However when I run a job with the recommended syntax:
> > http://www.adaptivecomputing.com/resources/docs/torque/3-0-
> 3/3.7schedulinggpus.php
> >
> > I get:
> >> qsub -I -q viz -l nodes=1:ppn=1:gpus=1
> > qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes
> >
> > The torque version is 3.0.3-snap.201108261653
> >
> > Note that this is _/not/_ the --enable-nvidia-gpus functionality.
> > Also note that the server has not been restarted.
> > The scheduler is moab but I'm pretty sure the job gets rejected well
> > before moab comes into the picture.
> >
> > Does anyone have such a setup working or can anyone see what is wrong
> > (or have an idea of where to look)?
>
> Your pbsnodes output looks correct and similar to our systems.
>
> Few questions for you:
> 1. What version of Moab are you running?
> 2. Does the Viz queue have the ability to schedule to that node?
> 3. What is in the "Configured Resources" line of "checknode n121"?
> It should have a "GPUS: 2" parameter.
>
> Cheers,
> Craig.
-snip-
1) Moab Version: 6.0.2 - due for an upgrade anytime
2) yes - and I can get jobs there with gpus as a gres but that doesn't count them right
3) > checknode n121 | grep Configu
Configured Resources: PROCS: 12 MEM: 94G SWAP: 96G DISK: 137G GPUS: 2
But I think moab is not getting to play a role. I've looked at logs but confess that I've not turned up the logging level yet.
Gareth
From david at unistra.fr Fri Feb 3 02:20:24 2012
From: david at unistra.fr (R. David)
Date: Fri, 3 Feb 2012 10:20:24 +0100
Subject: [torqueusers] [Patch] GPUs by the way of GRES
In-Reply-To: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net>
References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net>
Message-ID: <9DB98485-EECB-48D1-8AEC-5F0877E6704D@unistra.fr>
Hello,
Here at the Computing center of the University of Strasbourg, we have been using this patch with great success.
It makes GPU access much easier for our users, and our batch configuration is now fully operational for GPUs.
Regards,
Le 3 f?vr. 2012 ? 09:58, Jonathan Michalon a ?crit :
> Hi Maui folks,
>
> GPUs in Maui are a long standing problem. Last year a patch was sent by Mariusz
> Mamo?ski [1], which works based on GRES parameters.
> I've just made GPUs kind of working, by enhancing that patch. Please find
> attached the resulting patch, which works well for Maui 3.3.1.
> It defines a special GRES named "gpu" which works as expected on my test cases.
>
> Note that GRES behaviour seems quite confused as sometimes they are mentioned
> as consumable. This patch annihilates this behaviour, for the needs of GPUs.
>
> To use the patch:
> get the sources of maui-3.3.1 and patch them:
> patch -p1 < ../Patch-for-gpu-GRES.patch
> then compile as usual.
>
> You have to configure the GPUs in maui.cfg:
> NODECFG[nodename] GRES=gpu:2
>
> Then when queuing jobs you can request GPUs with (Torque syntax):
> qsub -W x=GRES:gpu at 1
>
> I hope this helps, please test this and enhance to your needs!
>
> [1]
> http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html
>
> Regards,
>
> PS. This is the second attempt to send the mail?
>
> --
> Jonathan Michalon
> IT student in Strasbourg
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
---------------------------------------------------------
R. David - david at unistra.fr
Responsable du meso-centre
UdS / Direction Informatique
Tel. : 03 68 85 45 48
---------------------------------------------------------
From leggett at mcs.anl.gov Fri Feb 3 08:27:11 2012
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Fri, 3 Feb 2012 09:27:11 -0600
Subject: [torqueusers] Torque not honoring max_user_queuable
Message-ID:
We've set queue limits that don't seem to be honored:
sdb:~ # qstat | grep linpyl | grep batch | wc
945 5670 82215
sdb:~ # qmgr -c "print queue batch"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_user_queuable = 500
set queue batch resources_min.mppwidth = 1
set queue batch resources_default.mppwidth = 24
set queue batch resources_default.walltime = 00:10:00
set queue batch acl_group_enable = False
set queue batch resources_available.nodes = 726
set queue batch enabled = True
set queue batch started = True
How would it be possible for a user to have 945 jobs in the queue when the limit should be 500?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/eedf3610/attachment.bin
From leggett at mcs.anl.gov Fri Feb 3 08:29:41 2012
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Fri, 3 Feb 2012 09:29:41 -0600
Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
In-Reply-To:
References: <1ebb959b-2dde-4ef0-9f8e-089d2ffb5d29@mail>
Message-ID: <37E90BE2-E4D5-4345-BD8E-510A5A03BC97@mcs.anl.gov>
Some more information on this problem. The issue is triggered by one user who is using the Intel MPI implementation and using MPDs instead of hydra. My guess is the MPDs are trying to communicate outside of the MOM and this is confusing the MOMs and causing them to bail. I've asked the user to switch to hydra instead but haven't heard back yet.
On Jan 16, 2012, at 10:44 AM, Ti Leggett wrote:
> They seem to die immediately. I can't really run them in gdb since it's randomly on nodes and I haven't found a way to trigger the failure.
>
> On Jan 11, 2012, at 2:52 PM, David Beer wrote:
>
>> Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens?
>>
>> David
>>
>> ----- Original Message -----
>>> torque was configured with --with-debug, "ulimit -c unlimited" is in
>>> the init script right before the moms are started like
>>> "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing
>>> a core file anywhere.
>>>
>>> On Jan 11, 2012, at 10:26 AM, David Beer wrote:
>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> I finally got around to doing this, but I don't see a core file in
>>>>> /var/spool/torque or in /usr/sbin. Where would the core get
>>>>> dumped?
>>>>>
>>>>
>>>> A mom's core file would be in /var/spool/torque/mom_priv. You need
>>>> to make sure ulimit -c is unlimited or set to a very large number.
>>>>
>>>> David
>>>>
>>>>> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Troy Baer"
>>>>>>> To: "Torque Users Mailing List"
>>>>>>> Sent: Tuesday, December 20, 2011 8:59:56 AM
>>>>>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
>>>>>>>
>>>>>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
>>>>>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
>>>>>>>> MOMs
>>>>>>>> keep randomly segfaulting and dying. I see this in the MOM log
>>>>>>>> right before dying:
>>>>>>>>
>>>>>>>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
>>>>>>>> file
>>>>>>>> descriptor (9) in tm_request, comm failed Protocol failure in
>>>>>>>> commit
>>>>>>>>
>>>>>>>>
>>>>>>>> And something similar to this in dmesg:
>>>>>>>>
>>>>>>>> pbs_mom[22354]: segfault at 0000000000000008 rip
>>>>>>>> 00002b585249ed6f
>>>>>>>> rsp 00007fff19e96df0 error 4
>>>>>>>
>>>>>>> We've also seen this on one of our systems and had to fall back
>>>>>>> to
>>>>>>> 2.5.8
>>>>>>> on it.
>>>>>>>
>>>>>>> --Troy
>>>>>>> --
>>>>>>> Troy Baer, HPC System Administrator
>>>>>>> National Institute for Computational Sciences, University of
>>>>>>> Tennessee
>>>>>>> http://www.nics.tennessee.edu/
>>>>>>> Phone: 865-241-4233
>>>>>>
>>>>>> Could someone configure TORQUE using --with-debug and then send a
>>>>>> stack trace of the crash?
>>>>>>
>>>>>> Ken
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>
>>>> --
>>>> David Beer
>>>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>>>> Adaptive Computing
>>>> 1712 S East Bay Blvd, Suite 300
>>>> Provo, UT 84606
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> --
>> David Beer
>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>> Adaptive Computing
>> 1712 S East Bay Blvd, Suite 300
>> Provo, UT 84606
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/3c72600c/attachment.bin
From dbeer at adaptivecomputing.com Fri Feb 3 09:06:45 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Fri, 3 Feb 2012 09:06:45 -0700
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To:
References:
Message-ID:
Ti,
How are you submitting the jobs? I assume this is TORQUE 2.5.9?
David
On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote:
> We've set queue limits that don't seem to be honored:
>
> sdb:~ # qstat | grep linpyl | grep batch | wc
> 945 5670 82215
>
> sdb:~ # qmgr -c "print queue batch"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch max_user_queuable = 500
> set queue batch resources_min.mppwidth = 1
> set queue batch resources_default.mppwidth = 24
> set queue batch resources_default.walltime = 00:10:00
> set queue batch acl_group_enable = False
> set queue batch resources_available.nodes = 726
> set queue batch enabled = True
> set queue batch started = True
>
> How would it be possible for a user to have 945 jobs in the queue when the
> limit should be 500?
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/663ab9c8/attachment.html
From leggett at mcs.anl.gov Fri Feb 3 09:21:01 2012
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Fri, 3 Feb 2012 10:21:01 -0600
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To:
References:
Message-ID: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools?
On Feb 3, 2012, at 10:06 AM, David Beer wrote:
> Ti,
>
> How are you submitting the jobs? I assume this is TORQUE 2.5.9?
>
> David
>
> On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote:
> We've set queue limits that don't seem to be honored:
>
> sdb:~ # qstat | grep linpyl | grep batch | wc
> 945 5670 82215
>
> sdb:~ # qmgr -c "print queue batch"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch max_user_queuable = 500
> set queue batch resources_min.mppwidth = 1
> set queue batch resources_default.mppwidth = 24
> set queue batch resources_default.walltime = 00:10:00
> set queue batch acl_group_enable = False
> set queue batch resources_available.nodes = 726
> set queue batch enabled = True
> set queue batch started = True
>
> How would it be possible for a user to have 945 jobs in the queue when the limit should be 500?
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/ae9acfe5/attachment-0001.bin
From dbeer at adaptivecomputing.com Fri Feb 3 10:03:17 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Fri, 3 Feb 2012 10:03:17 -0700
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
References:
<22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
Message-ID:
If you qstat -f a few of the jobs you can see the submit arguments. At
higher log levels the entire job submission is there, but I don't known if
your log levels would be that high.
David
On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote:
> I'm assuming using qsub, but it's other users doing this so I'm not 100%
> sure. Is there a way to find out from logs or other tools?
>
> On Feb 3, 2012, at 10:06 AM, David Beer wrote:
>
> > Ti,
> >
> > How are you submitting the jobs? I assume this is TORQUE 2.5.9?
> >
> > David
> >
> > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote:
> > We've set queue limits that don't seem to be honored:
> >
> > sdb:~ # qstat | grep linpyl | grep batch | wc
> > 945 5670 82215
> >
> > sdb:~ # qmgr -c "print queue batch"
> > #
> > # Create queues and set their attributes.
> > #
> > #
> > # Create and define queue batch
> > #
> > create queue batch
> > set queue batch queue_type = Execution
> > set queue batch max_user_queuable = 500
> > set queue batch resources_min.mppwidth = 1
> > set queue batch resources_default.mppwidth = 24
> > set queue batch resources_default.walltime = 00:10:00
> > set queue batch acl_group_enable = False
> > set queue batch resources_available.nodes = 726
> > set queue batch enabled = True
> > set queue batch started = True
> >
> > How would it be possible for a user to have 945 jobs in the queue when
> the limit should be 500?
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Software Engineer
> > Adaptive Computing
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/a6d38b71/attachment.html
From leggett at mcs.anl.gov Fri Feb 3 10:15:14 2012
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Fri, 3 Feb 2012 11:15:14 -0600
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To:
References:
<22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
Message-ID: <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov>
submit_args = -A CI-MCB000083 -l walltime=48:00:00,
mppwidth=48 /lustre/beagle/linpyl/project.qsub
On Feb 3, 2012, at 11:03 AM, David Beer wrote:
> If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high.
>
> David
>
> On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote:
> I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools?
>
> On Feb 3, 2012, at 10:06 AM, David Beer wrote:
>
> > Ti,
> >
> > How are you submitting the jobs? I assume this is TORQUE 2.5.9?
> >
> > David
> >
> > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote:
> > We've set queue limits that don't seem to be honored:
> >
> > sdb:~ # qstat | grep linpyl | grep batch | wc
> > 945 5670 82215
> >
> > sdb:~ # qmgr -c "print queue batch"
> > #
> > # Create queues and set their attributes.
> > #
> > #
> > # Create and define queue batch
> > #
> > create queue batch
> > set queue batch queue_type = Execution
> > set queue batch max_user_queuable = 500
> > set queue batch resources_min.mppwidth = 1
> > set queue batch resources_default.mppwidth = 24
> > set queue batch resources_default.walltime = 00:10:00
> > set queue batch acl_group_enable = False
> > set queue batch resources_available.nodes = 726
> > set queue batch enabled = True
> > set queue batch started = True
> >
> > How would it be possible for a user to have 945 jobs in the queue when the limit should be 500?
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Software Engineer
> > Adaptive Computing
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/d6fd7730/attachment.bin
From jjc at iastate.edu Fri Feb 3 10:36:05 2012
From: jjc at iastate.edu (Coyle, James J [ITACD])
Date: Fri, 3 Feb 2012 17:36:05 +0000
Subject: [torqueusers] Torque not honoring max_user_queuable : Two
commands to check
In-Reply-To:
References:
Message-ID: <242421BFAF465844BE24EB90BB97E22101825614@ITSDAG1D.its.iastate.edu>
Ti Legget,
I'd suggest checking two commands to confirm
that there is a problem:
1) Really simple issue:
Make sure your count is correct:
Issue:
qstat -u linpyl | awk '$3 == "batch" {print}' | wc -l
to see if this exceeds 500.
The command that you displayed would count jobs with name batchjob
submitted by a user whose name includes linpyl as part of the name.
(so user linpylon could be adding to the total, or
linpyl could have jobs in two different queues called batchjob.
I encountered these issues because I have users who have similar names
and I have users who use the same name for every job.
The command above should avoid these issues to get a reliable count.
2) Did a torque admin change the max_user_queuable
before/after these jobs were submitted?
Check the pbs_server logs to see if max_user_queuable
was changed after these jobs were submitted.
I am a torque admin, so I could get around max_user_queable, by changing it
and changing it back, as could any other torque admin, and as could
someone who has root privileges (knows root password or has sudo capability).
The evidence should be in the logs then, though.
grep max_user_queuable /var/spool/torque/server_logs/2012*
should get the answer to this questions.
I have two backups, and a user could call them to ask them up
up the count temporarily. If you see evidence of this, I'd ask the
other torque admins first.
James Coyle, PhD
High Performance Computing Group
115 Durham Center
Iowa State Univ. phone: (515)-294-2099
Ames, Iowa 50011 web: http://jjc.public.iastate.edu/
>-----Original Message-----
>From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>bounces at supercluster.org] On Behalf Of Ti Leggett
>Sent: Friday, February 03, 2012 9:27 AM
>To: Torque Users Mailing List
>Subject: [torqueusers] Torque not honoring max_user_queuable
>
>We've set queue limits that don't seem to be honored:
>
>sdb:~ # qstat | grep linpyl | grep batch | wc
> 945 5670 82215
>
>sdb:~ # qmgr -c "print queue batch"
>#
># Create queues and set their attributes.
>#
>#
># Create and define queue batch
>#
>create queue batch
>set queue batch queue_type = Execution
>set queue batch max_user_queuable = 500
>set queue batch resources_min.mppwidth = 1 set queue batch
>resources_default.mppwidth = 24 set queue batch
>resources_default.walltime = 00:10:00 set queue batch
>acl_group_enable = False set queue batch resources_available.nodes =
>726 set queue batch enabled = True set queue batch started = True
>
>How would it be possible for a user to have 945 jobs in the queue
>when the limit should be 500?
From dbeer at adaptivecomputing.com Fri Feb 3 10:44:55 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Fri, 3 Feb 2012 10:44:55 -0700
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To: <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov>
References:
<22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
<93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov>
Message-ID:
I'm also curious - is this done through a routing queue or routing queues?
Is it class remapping in Moab? It looks like it isn't qsub -q
David
On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett wrote:
> submit_args = -A CI-MCB000083 -l walltime=48:00:00,
> mppwidth=48 /lustre/beagle/linpyl/project.qsub
>
> On Feb 3, 2012, at 11:03 AM, David Beer wrote:
>
> > If you qstat -f a few of the jobs you can see the submit arguments. At
> higher log levels the entire job submission is there, but I don't known if
> your log levels would be that high.
> >
> > David
> >
> > On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote:
> > I'm assuming using qsub, but it's other users doing this so I'm not 100%
> sure. Is there a way to find out from logs or other tools?
> >
> > On Feb 3, 2012, at 10:06 AM, David Beer wrote:
> >
> > > Ti,
> > >
> > > How are you submitting the jobs? I assume this is TORQUE 2.5.9?
> > >
> > > David
> > >
> > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett
> wrote:
> > > We've set queue limits that don't seem to be honored:
> > >
> > > sdb:~ # qstat | grep linpyl | grep batch | wc
> > > 945 5670 82215
> > >
> > > sdb:~ # qmgr -c "print queue batch"
> > > #
> > > # Create queues and set their attributes.
> > > #
> > > #
> > > # Create and define queue batch
> > > #
> > > create queue batch
> > > set queue batch queue_type = Execution
> > > set queue batch max_user_queuable = 500
> > > set queue batch resources_min.mppwidth = 1
> > > set queue batch resources_default.mppwidth = 24
> > > set queue batch resources_default.walltime = 00:10:00
> > > set queue batch acl_group_enable = False
> > > set queue batch resources_available.nodes = 726
> > > set queue batch enabled = True
> > > set queue batch started = True
> > >
> > > How would it be possible for a user to have 945 jobs in the queue when
> the limit should be 500?
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > >
> > >
> > > --
> > > David Beer | Software Engineer
> > > Adaptive Computing
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Software Engineer
> > Adaptive Computing
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/61c6331a/attachment-0001.html
From gadre at wisc.edu Fri Feb 3 10:57:06 2012
From: gadre at wisc.edu (Milind)
Date: Fri, 03 Feb 2012 11:57:06 -0600
Subject: [torqueusers] Maui does not know queue to node map? - queue system
is failing, please HELP !
In-Reply-To: <76f0eb901311f0.4f2c1fcc@wiscmail.wisc.edu>
References: <777090be12ac6d.4f2c1036@wiscmail.wisc.edu>
<7560af4b12f902.4f2c1074@wiscmail.wisc.edu>
<7620b4a21297e8.4f2c10b0@wiscmail.wisc.edu>
<75b0f66212cca8.4f2c10ed@wiscmail.wisc.edu>
<75b09046129462.4f2c1129@wiscmail.wisc.edu>
<75d0d3ee129b08.4f2c1166@wiscmail.wisc.edu>
<7730b28512af6d.4f2c1259@wiscmail.wisc.edu>
<7780ba0b12ab65.4f2c1295@wiscmail.wisc.edu>
<75d0eec212aeb7.4f2c12d2@wiscmail.wisc.edu>
<778092d912f8fe.4f2c130e@wiscmail.wisc.edu>
<76f0c0d5129241.4f2c138b@wiscmail.wisc.edu>
<75e097c412b192.4f2c13ca@wiscmail.wisc.edu>
<76208747134a35.4f2c1715@wiscmail.wisc.edu>
<7770a2ea137481.4f2c1752@wiscmail.wisc.edu>
<75b088e8133cd6.4f2c178e@wiscmail.wisc.edu>
<773097d6136bf7.4f2c1cf5@wiscmail.wisc.edu>
<7770ba80131d91.4f2c1ed8@wiscmail.wisc.edu>
<7560dc5c134556.4f2c1f53@wiscmail.wisc.edu>
<75d0f8dd135f34.4f2c1f90@wiscmail.wisc.edu>
<76f0eb901311f0.4f2c1fcc@wiscmail.wisc.edu>
Message-ID: <7730cf1e133768.4f2bcb92@wiscmail.wisc.edu>
Hello,
I am a cluster administrator at
the University of Wisconsin-Madison. At our cluster we have Maui (3.2.5), OpenPBS 2.3 on the ROCKS 5.3 system.
For last few days, our queue system has been haywire : the PBS accepts jobs and puts them in right queues, but the scheduler somehow does something in the middle, and the job ends up on a 'wrong' compute node (which is not supposed to be in that queue), all the while PBS still lists that job as running under the right queue.
example, PBS shows this:
Job id??????????????????? Name???????????? User??????????? Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
60606.bardeen???????????? Cu1_a60_mov????? ??????? 00:52:05 R fast ????
but the job is on a compute node which is not at all in the queue "fast" ! The pbs nodelist (/opt/torque/server_priv/nodes ) is all fine, no errors in maui logs.
In pbs logs, I get this message
?10:54:19;0008;PBS_Server;Job;60606.bardeen.msae.wisc.edu;Job Modified at request of maui at bardeen.msae.wisc.edu
My guess is that maui is doing something wrong / it does not know the correct queue - to - node mapping.
Can someone suggest what is going on or guide me to solve this issue ??
thanks !!
Milind
From jjc at iastate.edu Fri Feb 3 11:39:01 2012
From: jjc at iastate.edu (Coyle, James J [ITACD])
Date: Fri, 3 Feb 2012 18:39:01 +0000
Subject: [torqueusers] Maui does not know queue to node map? - queue
system is failing, please HELP !
In-Reply-To: <7730cf1e133768.4f2bcb92@wiscmail.wisc.edu>
References: <777090be12ac6d.4f2c1036@wiscmail.wisc.edu>
<7560af4b12f902.4f2c1074@wiscmail.wisc.edu>
<7620b4a21297e8.4f2c10b0@wiscmail.wisc.edu>
<75b0f66212cca8.4f2c10ed@wiscmail.wisc.edu>
<75b09046129462.4f2c1129@wiscmail.wisc.edu>
<75d0d3ee129b08.4f2c1166@wiscmail.wisc.edu>
<7730b28512af6d.4f2c1259@wiscmail.wisc.edu>
<7780ba0b12ab65.4f2c1295@wiscmail.wisc.edu>
<75d0eec212aeb7.4f2c12d2@wiscmail.wisc.edu>
<778092d912f8fe.4f2c130e@wiscmail.wisc.edu>
<76f0c0d5129241.4f2c138b@wiscmail.wisc.edu>
<75e097c412b192.4f2c13ca@wiscmail.wisc.edu>
<76208747134a35.4f2c1715@wiscmail.wisc.edu>
<7770a2ea137481.4f2c1752@wiscmail.wisc.edu>
<75b088e8133cd6.4f2c178e@wiscmail.wisc.edu>
<773097d6136bf7.4f2c1cf5@wiscmail.wisc.edu>
<7770ba80131d91.4f2c1ed8@wiscmail.wisc.edu>
<7560dc5c134556.4f2c1f53@wiscmail.wisc.edu>
<75d0f8dd135f34.4f2c1f90@wiscmail.wisc.edu>
<76f0eb901311f0.4f2c1fcc@wiscmail.wisc.edu>
<7730cf1e133768.4f2bcb92@wiscmail.wisc.edu>
Message-ID: <242421BFAF465844BE24EB90BB97E22101825674@ITSDAG1D.its.iastate.edu>
This is the Torque mailing list, OpenPBS has not been
maintained in a long time.
I upgraded to Torque when OpenPBS stopped being supported
about 2004 if I recall correctly.
Torque is available from http://www.adaptivecomputing.com/products/torque.php
>-----Original Message-----
>From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>bounces at supercluster.org] On Behalf Of Milind
>Sent: Friday, February 03, 2012 11:57 AM
>To: torqueusers at supercluster.org
>Subject: [torqueusers] Maui does not know queue to node map? - queue
>system is failing, please HELP !
>
>Hello,
>
>I am a cluster administrator at
>the University of Wisconsin-Madison. At our cluster we have Maui
>(3.2.5), OpenPBS 2.3 on the ROCKS 5.3 system.
>For last few days, our queue system has been haywire : the PBS
>accepts jobs and puts them in right queues, but the scheduler
>somehow does something in the middle, and the job ends up on a
>'wrong' compute node (which is not supposed to be in that queue),
>all the while PBS still lists that job as running under the right
>queue.
>
>example, PBS shows this:
>
>Job id??????????????????? Name???????????? User??????????? Time Use
>S Queue
>------------------------- ---------------- --------------- --------
>- -----
>60606.bardeen???????????? Cu1_a60_mov????? ??????? 00:52:05 R
>fast
>
>but the job is on a compute node which is not at all in the queue
>"fast" ! The pbs nodelist (/opt/torque/server_priv/nodes ) is all
>fine, no errors in maui logs.
>In pbs logs, I get this message
>
>?10:54:19;0008;PBS_Server;Job;60606.bardeen.msae.wisc.edu;Job
>Modified at request of maui at bardeen.msae.wisc.edu
>
>My guess is that maui is doing something wrong / it does not know
>the correct queue - to - node mapping.
>
>Can someone suggest what is going on or guide me to solve this issue
>??
>
>thanks !!
>
>Milind
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
From gadre at wisc.edu Fri Feb 3 12:13:32 2012
From: gadre at wisc.edu (Milind)
Date: Fri, 03 Feb 2012 13:13:32 -0600
Subject: [torqueusers] Maui does not know queue to node map? - queue system
is failing, please HELP !
In-Reply-To: <7730b26c1324af.4f2c31cf@wiscmail.wisc.edu>
References: <7730b26c1324af.4f2c31cf@wiscmail.wisc.edu>
Message-ID: <76208faa13491a.4f2bdd7c@wiscmail.wisc.edu>
Hello,
I am a cluster administrator at? the University of Wisconsin-Madison. At our cluster we have Maui (3.2.5), PBS 2.4.6 on the ROCKS 5.3 system.? (sorry I wrote OpenPBS last email)
For
last few days, our queue system has been haywire : the PBS accepts jobs
and puts them in right queues, but the scheduler somehow does something in the middle, and the job ends up on a 'wrong' compute node (which is
not supposed to be in that queue), all the while PBS still lists that
job as running under the right queue.
example, PBS shows this:
Job id??????????????????? Name???????????? User??????????? Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
60606.bardeen???????????? Cu1_a60_mov????? ??????? 00:52:05 R fast ????
but
the job is on a compute node which is not at all in the queue "fast" !
The pbs nodelist (/opt/torque/server_priv/nodes ) is all fine, no errors
in maui logs.? In pbs logs, I get this message
?10:54:19;0008;PBS_Server;Job;60606.bardeen.msae.wisc.edu;Job Modified at request of maui at bardeen.msae.wisc.edu
My guess is that maui is doing something wrong / it does not know the correct queue - to - node mapping.
Can someone suggest what is going on or guide me to solve this issue ??
thanks !!
Milind
From Gareth.Williams at csiro.au Fri Feb 3 14:45:24 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Sat, 4 Feb 2012 08:45:24 +1100
Subject: [torqueusers] requesting gpus
In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD2@exvic-mbx04.nexus.csiro.au>
References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au>
<4F2B76C6.2080104@vpac.org>
<007DECE986B47F4EABF823C1FBB19C620102CDD74DD2@exvic-mbx04.nexus.csiro.au>
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD4@exvic-mbx04.nexus.csiro.au>
Matt Ismail at Warwick in the UK knew the problem/solution.
> I reported this issue to Adaptive in August last year and it got fixed in torque-3.0.3-snap.201111071556. From the CHANGELOG: "Fixed a problem in qsub where you could not submit a job in interactive mode with gpus in the resource list."
> If it is the same issue you're seeing it'll only be affecting interactive job submissions, i.e. qsub -I.
I can confirm that in our current setup non-interactive jobs are OK - and we'll upgrade to make interactive jobs work too.
Thanks,
Gareth
> -----Original Message-----
> From: Gareth.Williams at csiro.au [mailto:Gareth.Williams at csiro.au]
> Sent: Friday, 3 February 2012 8:08 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] requesting gpus
>
> > -----Original Message-----
> > From: Craig West [mailto:cwest at vpac.org]
> > Sent: Friday, 3 February 2012 4:55 PM
> > To: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] requesting gpus
> >
> >
> > Hi Gareth,
> >
> > > However when I run a job with the recommended syntax:
> > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-
> > 3/3.7schedulinggpus.php
> > >
> > > I get:
> > >> qsub -I -q viz -l nodes=1:ppn=1:gpus=1
> > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> > nodes
> > >
> > > The torque version is 3.0.3-snap.201108261653
> > >
> > > Note that this is _/not/_ the --enable-nvidia-gpus functionality.
> > > Also note that the server has not been restarted.
> > > The scheduler is moab but I'm pretty sure the job gets rejected
> well
> > > before moab comes into the picture.
> > >
> > > Does anyone have such a setup working or can anyone see what is
> wrong
> > > (or have an idea of where to look)?
> >
> > Your pbsnodes output looks correct and similar to our systems.
> >
> > Few questions for you:
> > 1. What version of Moab are you running?
> > 2. Does the Viz queue have the ability to schedule to that node?
> > 3. What is in the "Configured Resources" line of "checknode n121"?
> > It should have a "GPUS: 2" parameter.
> >
> > Cheers,
> > Craig.
> -snip-
>
> 1) Moab Version: 6.0.2 - due for an upgrade anytime
> 2) yes - and I can get jobs there with gpus as a gres but that doesn't
> count them right
> 3) > checknode n121 | grep Configu
> Configured Resources: PROCS: 12 MEM: 94G SWAP: 96G DISK: 137G GPUS:
> 2
>
> But I think moab is not getting to play a role. I've looked at logs but
> confess that I've not turned up the logging level yet.
>
> Gareth
From nt_mahmood at yahoo.com Sat Feb 4 03:58:49 2012
From: nt_mahmood at yahoo.com (Mahmood Naderan)
Date: Sat, 4 Feb 2012 02:58:49 -0800 (PST)
Subject: [torqueusers] changing the column width of "qstat"
Message-ID: <1328353129.37528.YahooMailNeo@web111704.mail.gq1.yahoo.com>
Dear all,
Is it possible to change the column width on "qstat"? Currently the terminal width is large enough but the job names are shown like "tiger-st-5000-64". The complete job name is "tiger-st-5000-64-64". So it truncate the last three characters.
// Naderan *Mahmood;
From leggett at mcs.anl.gov Sat Feb 4 10:44:40 2012
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Sat, 4 Feb 2012 11:44:40 -0600
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To:
References:
<22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
<93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov>
Message-ID:
All jobs go through a routing queue.
On Feb 3, 2012, at 11:44 AM, David Beer wrote:
> I'm also curious - is this done through a routing queue or routing queues? Is it class remapping in Moab? It looks like it isn't qsub -q
>
> David
>
> On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett wrote:
> submit_args = -A CI-MCB000083 -l walltime=48:00:00,
> mppwidth=48 /lustre/beagle/linpyl/project.qsub
>
> On Feb 3, 2012, at 11:03 AM, David Beer wrote:
>
> > If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high.
> >
> > David
> >
> > On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote:
> > I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools?
> >
> > On Feb 3, 2012, at 10:06 AM, David Beer wrote:
> >
> > > Ti,
> > >
> > > How are you submitting the jobs? I assume this is TORQUE 2.5.9?
> > >
> > > David
> > >
> > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote:
> > > We've set queue limits that don't seem to be honored:
> > >
> > > sdb:~ # qstat | grep linpyl | grep batch | wc
> > > 945 5670 82215
> > >
> > > sdb:~ # qmgr -c "print queue batch"
> > > #
> > > # Create queues and set their attributes.
> > > #
> > > #
> > > # Create and define queue batch
> > > #
> > > create queue batch
> > > set queue batch queue_type = Execution
> > > set queue batch max_user_queuable = 500
> > > set queue batch resources_min.mppwidth = 1
> > > set queue batch resources_default.mppwidth = 24
> > > set queue batch resources_default.walltime = 00:10:00
> > > set queue batch acl_group_enable = False
> > > set queue batch resources_available.nodes = 726
> > > set queue batch enabled = True
> > > set queue batch started = True
> > >
> > > How would it be possible for a user to have 945 jobs in the queue when the limit should be 500?
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > >
> > >
> > > --
> > > David Beer | Software Engineer
> > > Adaptive Computing
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Software Engineer
> > Adaptive Computing
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120204/12096961/attachment-0001.bin
From cholam20 at yahoo.co.in Sat Feb 4 14:29:01 2012
From: cholam20 at yahoo.co.in (revathi ganesh)
Date: Sun, 5 Feb 2012 02:59:01 +0530 (IST)
Subject: [torqueusers] I am finally became Boss.
Message-ID: <1328390941.64236.androidMobile@web137304.mail.in.yahoo.com>
ive had so much on my mind this totally took me by surprise it was time to start a new chapter!
http://e-muzyk.freehost.pl/newsjournal/77JasonMiller/ now I feel completed
consider trying it for yourself
see you soon...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120205/6e87c85c/attachment.html
From Gareth.Williams at csiro.au Sun Feb 5 21:50:16 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Mon, 6 Feb 2012 15:50:16 +1100
Subject: [torqueusers] node table is corrupt at index 0
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DE1@exvic-mbx04.nexus.csiro.au>
Hi,
Is anybody familiar with the following server log entry or should I go look at source code :-(
We are having some issues with bad client connections and I went looking at logs. I don't think it is actually related but can't be sure. There's only been a few recently except one day a few weeks ago there were 11545!
Gareth
PBS_Server;Svr;WARNING;ALERT: node table is corrupt at index 0
From pat.callahan at gd-ais.com Sun Feb 5 09:15:56 2012
From: pat.callahan at gd-ais.com (Callahan, Patrick M.)
Date: Sun, 5 Feb 2012 11:15:56 -0500
Subject: [torqueusers] changing column width
Message-ID: <6DDE2978C880C64EB5323275164A30886335144CF2@EADC-E-MABPRD01.ad.gd-ais.com>
You may change the column width (job name) in the source code and recompile. I have done that with the versions we use. Look at either MAXNAMELEN in pbs_ifl.h or PBS_JOBBASE in server_limits.h. I don't have the source code in front of me so ymmv.
Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120205/e3f50827/attachment.html
From s.breedveld at erasmusmc.nl Sun Feb 5 13:03:12 2012
From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld)
Date: Sun, 05 Feb 2012 21:03:12 +0100
Subject: [torqueusers] unsubscribe
Message-ID: <4F2EE080.8030008@erasmusmc.nl>
unsubscribe
--
Sebastiaan Breedveld, MSc.
Ph.D. student
Erasmus MC - Daniel den Hoed Cancer Center
Department of Radiation Oncology
Groene Hilledijk 301
3075 EA Rotterdam
The Netherlands
Phone: +31 10 7042693
Room: Gs-20
-------------- next part --------------
A non-text attachment was scrubbed...
Name: s_breedveld.vcf
Type: text/x-vcard
Size: 365 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120205/cc9dd349/attachment.vcf
From dbeer at adaptivecomputing.com Mon Feb 6 10:19:43 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Mon, 6 Feb 2012 10:19:43 -0700
Subject: [torqueusers] node table is corrupt at index 0
In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DE1@exvic-mbx04.nexus.csiro.au>
References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DE1@exvic-mbx04.nexus.csiro.au>
Message-ID:
To me this looks like a non-issue, but the message comes from a function
logging bad client connections, so perhaps that is your relation. The error
is logged if a node slot or that nodes addresses are NULL, but this
condition is permitted elsewhere in the code. I would work on solving the
client connections and then see if this causes you issues.
David
On Sun, Feb 5, 2012 at 9:50 PM, wrote:
> Hi,
>
> Is anybody familiar with the following server log entry or should I go
> look at source code :-(
> We are having some issues with bad client connections and I went looking
> at logs. I don't think it is actually related but can't be sure. There's
> only been a few recently except one day a few weeks ago there were 11545!
>
> Gareth
>
> PBS_Server;Svr;WARNING;ALERT: node table is corrupt at index 0
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/b113becc/attachment.html
From jjc at iastate.edu Mon Feb 6 10:56:54 2012
From: jjc at iastate.edu (Coyle, James J [ITACD])
Date: Mon, 6 Feb 2012 17:56:54 +0000
Subject: [torqueusers] Using Torque as a meta-scheduler over several
clusters; any advice?
Message-ID: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu>
Currently, we use a different head node for each cluster we deploy.
The head node runs the server and scheduler. I've been asked if it possible to
schedule several clusters from a single login node.
From a hardware standpoint, one could use virtual machines to accomplish
this, but the idea was to somehow direct all the jobs from a single login node
to somehow make it easier for users.
What is being described sounds a lot like a grid.
Has anyone done this?
Can schedulers run on the "head node" of each clusters with a single pbs_server
running on the master head node to interact with the users?
Is the way to do this to set specific properties on the nodes from specific clusters
(cluster1, cluster2, ...) to use MAUI and to have need_nodes set for different queues
small_cluster1 ... ?
Has someone "rolled their own" meta-scheduler?
Thanks,
- Jim
James Coyle, PhD
High Performance Computing Group
115 Durham Center
Iowa State Univ. phone: (515)-294-2099
Ames, Iowa 50011 web: http://jjc.public.iastate.edu/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/e835ac4b/attachment-0001.html
From dbeer at adaptivecomputing.com Mon Feb 6 10:59:42 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Mon, 6 Feb 2012 10:59:42 -0700
Subject: [torqueusers] Using Torque as a meta-scheduler over several
clusters; any advice?
In-Reply-To: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu>
References: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu>
Message-ID:
James,
I don't know about a free version, but Moab has done grid scheduling for
years.
David
On Mon, Feb 6, 2012 at 10:56 AM, Coyle, James J [ITACD] wrote:
> ****
>
> Currently, we use a different head node for each cluster we deploy.***
> *
>
> The head node runs the server and scheduler. I?ve been asked if it
> possible to ****
>
> schedule several clusters from a single login node. ****
>
> From a hardware standpoint, one could use virtual machines to
> accomplish****
>
> this, but the idea was to somehow direct all the jobs from a single login
> node****
>
> to somehow make it easier for users.****
>
> ** **
>
> What is being described sounds a lot like a grid.****
>
> ** **
>
> Has anyone done this?****
>
> ** **
>
> Can schedulers run on the ?head node? of each clusters with a single
> pbs_server****
>
> running on the master head node to interact with the users? ****
>
> ** **
>
> Is the way to do this to set specific properties on the nodes from
> specific clusters****
>
> (cluster1, cluster2, ?) to use MAUI and to have need_nodes set for
> different queues****
>
> small_cluster1 ? ?****
>
> ** **
>
> Has someone ?rolled their own? meta-scheduler?****
>
> ** **
>
> Thanks,****
>
> **- **Jim****
>
> ** **
>
> James Coyle, PhD****
>
> High Performance Computing Group ****
>
> 115 Durham Center ****
>
> Iowa State Univ. phone: (515)-294-2099****
>
> Ames, Iowa 50011 web: http://jjc.public.iastate.edu/****
>
> ** **
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/4f9b76e0/attachment.html
From mej at lbl.gov Mon Feb 6 13:21:10 2012
From: mej at lbl.gov (Michael Jennings)
Date: Mon, 6 Feb 2012 12:21:10 -0800
Subject: [torqueusers] Using Torque as a meta-scheduler over several
clusters; any advice?
In-Reply-To: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu>
References: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu>
Message-ID: <20120206202109.GI2104@lbl.gov>
On Monday, 06 February 2012, at 17:56:54 (+0000),
Coyle, James J [ITACD] wrote:
> Currently, we use a different head node for each cluster we deploy.
> The head node runs the server and scheduler. I've been asked if it possible to
> schedule several clusters from a single login node.
> From a hardware standpoint, one could use virtual machines to accomplish
> this, but the idea was to somehow direct all the jobs from a single login node
> to somehow make it easier for users.
>
> What is being described sounds a lot like a grid.
>
> Has anyone done this?
We have a single instance of TORQUE and Moab governing roughly 20
clusters, but you can also use Moab in a grid scenario as
master/slaves or equal peers.
All our "supercluster" clusters all share common interactive nodes,
login gateway, NFS- and Lustre-based storage, and master node.
> Can schedulers run on the "head node" of each clusters with a single pbs_server
> running on the master head node to interact with the users?
>
> Is the way to do this to set specific properties on the nodes from specific clusters
> (cluster1, cluster2, ...) to use MAUI and to have need_nodes set for different queues
> small_cluster1 ... ?
That's certainly one way to do it. We have one or more queues for
each cluster, and the ACLs are set up within TORQUE and Moab to
restrict access to those users/groups who own each cluster.
> Has someone "rolled their own" meta-scheduler?
You're likely to spend a lot more money doing this than it would cost
for a Moab license. I don't know of any meta-scheduling packages
which currently exist, but Moab's architecture supports a wide variety
of functionality in this vein.
HTH,
Michael
--
Michael Jennings
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
From dbeer at adaptivecomputing.com Mon Feb 6 16:48:30 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Mon, 6 Feb 2012 16:48:30 -0700
Subject: [torqueusers] Updated Beta Build
Message-ID:
All,
More fixes have been made for the beta, most importantly this build has
fixes to the gpu reporting process (for nvidia-enabled gpus).
http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201202061640.tar.gz
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/5379f81d/attachment.html
From cholam20 at yahoo.co.in Mon Feb 6 20:35:08 2012
From: cholam20 at yahoo.co.in (revathi ganesh)
Date: Tue, 7 Feb 2012 09:05:08 +0530 (IST)
Subject: [torqueusers] Fwd: Nice opportunity.
Message-ID: <1328585708.17902.androidMobile@web137302.mail.in.yahoo.com>
hey.
ive learned that things dont aways work out as planned I thought this would intrigue you despite the circumstances I remained hopeful.
http://www.kbsdenoorderborch.nl/currentevents/30ColinJackson/ now nobody disrespects me
you would excell at this
talk to you later!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120207/88f69d81/attachment.html
From jascha.wang at gmail.com Tue Feb 7 02:33:12 2012
From: jascha.wang at gmail.com (Xiangqian Wang)
Date: Tue, 7 Feb 2012 17:33:12 +0800
Subject: [torqueusers] queue to node mapping is wrong when use '-l procs'
option
Message-ID:
I failed to test queue to node mapping feature of torque/maui system, I
use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs
option:
#!/bin/sh
#PBS -N simple-job
#PBS -l procs=3
#PBS -q fluent
#PBS -d /opt/share/job
cd $PBS_O_WORKDIR
date
sleep 30
date
The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the
setting is shown below:
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue fluent
#
create queue fluent
set queue fluent queue_type = Execution
set queue fluent acl_host_enable = False
set queue fluent acl_hosts = cnode01
set queue fluent enabled = True
set queue fluent started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = snode01
set server acl_roots = root@*
set server managers = root at snode01
set server operators = root at snode01
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server auto_node_np = True
set server next_job_number = 94
set server display_job_server_suffix = False
The job should use a single node 'cnode01' , while the allocated node
contains another node. see part of 'qstat -f' output:
exec_host = snode01/1+snode01/0+cnode01/0
...
Resource_List.neednodes = cnode01
Resource_List.procs = 3
Can anyone give me some suggestion, it'll be greatly appreciated.
Xiangqian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120207/a8f33cb5/attachment-0001.html
From sm4082 at nyu.edu Tue Feb 7 06:40:45 2012
From: sm4082 at nyu.edu (Sreedhar Manchu)
Date: Tue, 7 Feb 2012 08:40:45 -0500
Subject: [torqueusers] queue to node mapping is wrong when use '-l
procs' option
In-Reply-To:
References:
Message-ID: <434A0810-1C69-408F-9372-484586F7CF4D@nyu.edu>
Hi,
Instead of using
> set queue fluent acl_host_enable = False
> set queue fluent acl_hosts = cnode01
I set a feature to the node I wanted my jobs to run or wanted it to be under a special queue, I gave a certain feature to the nodes and put it in the pbs script like this:
#PBS -l feature=
Moab can put the jobs on the nodes with those features. I'm not sure how maui does it. I have a qsub wrapper that adds this feature line depending on users' requests.
To give features to nodes, I used
qmgr -c 'set node properties += '
For example, our p48 nodes have features like chassis0, chassis1, etc to indicate the chassis they belong to. Since we are asking for a specific queue with specific features, jobs always go onto right nodes with right feature.
Sreedhar.
On Feb 7, 2012, at 4:33 AM, Xiangqian Wang wrote:
> I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option:
>
> #!/bin/sh
> #PBS -N simple-job
> #PBS -l procs=3
> #PBS -q fluent
> #PBS -d /opt/share/job
> cd $PBS_O_WORKDIR
> date
> sleep 30
> date
>
> The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the setting is shown below:
>
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Create and define queue fluent
> #
> create queue fluent
> set queue fluent queue_type = Execution
> set queue fluent acl_host_enable = False
> set queue fluent acl_hosts = cnode01
> set queue fluent enabled = True
> set queue fluent started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = snode01
> set server acl_roots = root@*
> set server managers = root at snode01
> set server operators = root at snode01
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server keep_completed = 300
> set server auto_node_np = True
> set server next_job_number = 94
> set server display_job_server_suffix = False
>
> The job should use a single node 'cnode01' , while the allocated node contains another node. see part of 'qstat -f' output:
>
> exec_host = snode01/1+snode01/0+cnode01/0
> ...
> Resource_List.neednodes = cnode01
> Resource_List.procs = 3
>
> Can anyone give me some suggestion, it'll be greatly appreciated.
>
> Xiangqian
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
---
Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120207/56ef80c5/attachment.html
From david at unistra.fr Tue Feb 7 14:27:29 2012
From: david at unistra.fr (R. David)
Date: Tue, 7 Feb 2012 22:27:29 +0100
Subject: [torqueusers] queue to node mapping is wrong when use '-l
procs' option
In-Reply-To:
References:
Message-ID: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr>
Le 7 f?vr. 2012 ? 10:33, Xiangqian Wang a ?crit :
> I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option:
>
> #!/bin/sh
> #PBS -N simple-job
> #PBS -l procs=3
> #PBS -q fluent
> #PBS -d /opt/share/job
> cd $PBS_O_WORKDIR
> date
> sleep 30
> date
>
> The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the setting is shown below:
>
[...]
> The job should use a single node 'cnode01' , while the allocated node contains another node. see part of 'qstat -f' output:
>
> exec_host = snode01/1+snode01/0+cnode01/0
> ...
> Resource_List.neednodes = cnode01
> Resource_List.procs = 3
>
> Can anyone give me some suggestion, it'll be greatly appreciated.
You should probably use the -l nodes=XX rather than procs=XX
Depending on how you configured maui, you will have to write nodes=XX:ppn=YY or just nodes=XX
Regards,
R. David
From jascha.wang at gmail.com Tue Feb 7 18:51:18 2012
From: jascha.wang at gmail.com (Xiangqian Wang)
Date: Wed, 8 Feb 2012 09:51:18 +0800
Subject: [torqueusers] queue to node mapping is wrong when use '-l
procs' option
In-Reply-To: <434A0810-1C69-408F-9372-484586F7CF4D@nyu.edu>
References:
<434A0810-1C69-408F-9372-484586F7CF4D@nyu.edu>
Message-ID:
it seems that '-l feature' option has no effect for maui-3.2.6p21, the jobs
runs on nodes without the feature requested.
2012/2/7 Sreedhar Manchu
> Hi,
>
> Instead of using
>
> set queue fluent acl_host_enable = False
> set queue fluent acl_hosts = cnode01
>
>
> I set a feature to the node I wanted my jobs to run or wanted it to be
> under a special queue, I gave a certain feature to the nodes and put it in
> the pbs script like this:
>
> #PBS -l feature=
>
> Moab can put the jobs on the nodes with those features. I'm not sure how
> maui does it. I have a qsub wrapper that adds this feature line depending
> on users' requests.
>
> To give features to nodes, I used
>
> qmgr -c 'set node properties += '
>
> For example, our p48 nodes have features like chassis0, chassis1, etc to
> indicate the chassis they belong to. Since we are asking for a specific
> queue with specific features, jobs always go onto right nodes with right
> feature.
>
> Sreedhar.
>
> On Feb 7, 2012, at 4:33 AM, Xiangqian Wang wrote:
>
> I failed to test queue to node mapping feature of torque/maui system, I
> use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs
> option:
>
> #!/bin/sh
> #PBS -N simple-job
> #PBS -l procs=3
> #PBS -q fluent
> #PBS -d /opt/share/job
> cd $PBS_O_WORKDIR
> date
> sleep 30
> date
>
> The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the
> setting is shown below:
>
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Create and define queue fluent
> #
> create queue fluent
> set queue fluent queue_type = Execution
> set queue fluent acl_host_enable = False
> set queue fluent acl_hosts = cnode01
> set queue fluent enabled = True
> set queue fluent started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = snode01
> set server acl_roots = root@*
> set server managers = root at snode01
> set server operators = root at snode01
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server keep_completed = 300
> set server auto_node_np = True
> set server next_job_number = 94
> set server display_job_server_suffix = False
>
> The job should use a single node 'cnode01' , while the allocated node
> contains another node. see part of 'qstat -f' output:
>
> exec_host = snode01/1+snode01/0+cnode01/0
> ...
> Resource_List.neednodes = cnode01
> Resource_List.procs = 3
>
> Can anyone give me some suggestion, it'll be greatly appreciated.
>
> Xiangqian
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> ---
> Sreedhar Manchu
> HPC Support Specialist
> New York University
> 251 Mercer Street
> New York, NY 10012-1110
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120208/6fa0eac3/attachment.html
From jascha.wang at gmail.com Tue Feb 7 19:02:50 2012
From: jascha.wang at gmail.com (Xiangqian Wang)
Date: Wed, 8 Feb 2012 10:02:50 +0800
Subject: [torqueusers] queue to node mapping is wrong when use '-l
procs' option
In-Reply-To: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr>
References:
<5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr>
Message-ID:
Hello David,
I also test the '-l nodes:ppn' option, the node allocation is right, as you
said.
For '-l procs' option and queue to node mapping scenario, I wonder if it is
possible to configure it to work?
Xiangqian
2012/2/8 R. David
>
> Le 7 f?vr. 2012 ? 10:33, Xiangqian Wang a ?crit :
>
> > I failed to test queue to node mapping feature of torque/maui system, I
> use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs
> option:
> >
> > #!/bin/sh
> > #PBS -N simple-job
> > #PBS -l procs=3
> > #PBS -q fluent
> > #PBS -d /opt/share/job
> > cd $PBS_O_WORKDIR
> > date
> > sleep 30
> > date
> >
> > The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the
> setting is shown below:
> >
>
> [...]
>
> > The job should use a single node 'cnode01' , while the allocated node
> contains another node. see part of 'qstat -f' output:
> >
> > exec_host = snode01/1+snode01/0+cnode01/0
> > ...
> > Resource_List.neednodes = cnode01
> > Resource_List.procs = 3
> >
> > Can anyone give me some suggestion, it'll be greatly appreciated.
>
> You should probably use the -l nodes=XX rather than procs=XX
>
> Depending on how you configured maui, you will have to write
> nodes=XX:ppn=YY or just nodes=XX
>
>
> Regards,
> R. David
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120208/b73dc603/attachment-0001.html
From Gareth.Williams at csiro.au Tue Feb 7 19:24:30 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Wed, 8 Feb 2012 13:24:30 +1100
Subject: [torqueusers] queue to node mapping is wrong when use
'-l procs' option
In-Reply-To:
References:
<5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr>
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au>
Sounds like there is a bug.
You might be able to work around it by using the queue to node mapping described here:
http://www.supercluster.org/pipermail/mauiusers/2012-February/004839.html
Gareth
> I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option:
>
> #!/bin/sh
> #PBS -N simple-job
> #PBS -l procs=3
> #PBS -q fluent
> #PBS -d /opt/share/job
> cd $PBS_O_WORKDIR
> date
> sleep 30
> date
-snip-
______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
From U0850037 at hud.ac.uk Wed Feb 8 06:57:20 2012
From: U0850037 at hud.ac.uk (Ibad Kureshi U0850037)
Date: Wed, 8 Feb 2012 13:57:20 +0000
Subject: [torqueusers] Step Change in Job Arrays
Message-ID:
Hello,
I was wondering is someone could tell me how to adjust the step size in a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small cluster and our users want to submit arrays.
One the SGE and the Moab/Torque based systems
$ -t 1-20:2
or
#PBS -t 1-20:2
respectively, gives them 10 jobs with even ID numbers.
How can this be done with Torque? It throws out "qsub: Bad Job Array Request" error
Have not been able to find much literature on this.
Thanks
-Ibad Kureshi
---
This transmission is confidential and may be legally privileged. If you receive it in error, please notify us immediately by e-mail and remove it from your system. If the content of this e-mail does not relate to the business of the University of Huddersfield, then we do not endorse it and will accept no liability.
From glen.beane at gmail.com Wed Feb 8 07:02:52 2012
From: glen.beane at gmail.com (Glen Beane)
Date: Wed, 8 Feb 2012 09:02:52 -0500
Subject: [torqueusers] Step Change in Job Arrays
In-Reply-To:
References:
Message-ID:
On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037
wrote:
> Hello,
>
> I was wondering is someone could tell me how to adjust the step size in a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small cluster and our users want to submit arrays.
>
> One the SGE and the Moab/Torque based systems
>
> $ -t 1-20:2
>
> or
>
> #PBS -t 1-20:2
>
> respectively, gives them 10 jobs with even ID numbers.
>
> How can this be done with Torque? It throws out "qsub: Bad Job Array Request" error
>
> Have not been able to find much literature on this.
>
> Thanks
this is not currently supported, but it is a great feature request.
unfortunately the only option would be to explicitly specify each array ID:
#PBS -t 2,4,6,8,10 ...20
From acaird at umich.edu Wed Feb 8 07:28:16 2012
From: acaird at umich.edu (Andrew Caird)
Date: Wed, 8 Feb 2012 09:28:16 -0500
Subject: [torqueusers] Step Change in Job Arrays
In-Reply-To:
References:
Message-ID:
On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane wrote:
> On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037
> wrote:
> > Hello,
> >
> > I was wondering is someone could tell me how to adjust the step size in
> a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small
> cluster and our users want to submit arrays.
> >
> > One the SGE and the Moab/Torque based systems
> >
> > $ -t 1-20:2
> >
> > or
> >
> > #PBS -t 1-20:2
> >
> > respectively, gives them 10 jobs with even ID numbers.
> >
> > How can this be done with Torque? It throws out "qsub: Bad Job Array
> Request" error
> >
> > Have not been able to find much literature on this.
> >
> > Thanks
>
>
> this is not currently supported, but it is a great feature request.
>
> unfortunately the only option would be to explicitly specify each array ID:
>
> #PBS -t 2,4,6,8,10 ...20
Or:
qsub -t `seq -s, 2 2 20` pbsfile.txt
in case you don't want to type all the numbers.
--andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120208/d6dad0a5/attachment.html
From U0850037 at hud.ac.uk Wed Feb 8 07:47:36 2012
From: U0850037 at hud.ac.uk (Ibad Kureshi U0850037)
Date: Wed, 8 Feb 2012 14:47:36 +0000
Subject: [torqueusers] Step Change in Job Arrays
In-Reply-To:
References:
,
Message-ID:
Thanks Glen, Andy
Andy: Nice!
-Ibad
________________________________________
From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] On Behalf Of Andrew Caird [acaird at umich.edu]
Sent: Wednesday, February 08, 2012 2:28 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Step Change in Job Arrays
On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane > wrote:
On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037
> wrote:
> Hello,
>
> I was wondering is someone could tell me how to adjust the step size in a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small cluster and our users want to submit arrays.
>
> One the SGE and the Moab/Torque based systems
>
> $ -t 1-20:2
>
> or
>
> #PBS -t 1-20:2
>
> respectively, gives them 10 jobs with even ID numbers.
>
> How can this be done with Torque? It throws out "qsub: Bad Job Array Request" error
>
> Have not been able to find much literature on this.
>
> Thanks
this is not currently supported, but it is a great feature request.
unfortunately the only option would be to explicitly specify each array ID:
#PBS -t 2,4,6,8,10 ...20
Or:
qsub -t `seq -s, 2 2 20` pbsfile.txt
in case you don't want to type all the numbers.
--andy
---
This transmission is confidential and may be legally privileged. If you receive it in error, please notify us immediately by e-mail and remove it from your system. If the content of this e-mail does not relate to the business of the University of Huddersfield, then we do not endorse it and will accept no liability.
From R.M.Krug at gmail.com Thu Feb 9 02:16:07 2012
From: R.M.Krug at gmail.com (Rainer M Krug)
Date: Thu, 09 Feb 2012 10:16:07 +0100
Subject: [torqueusers] Specifying nodes which can be used in array job
Message-ID:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi
assuming I have cluster of 10 nodes (node01, ... node10), of which I
am not the administrator.
Some nodes are setup slightly different, so that a certain job only
runs on nodes node01 to node05.
So I would like to submit an array job and specify "only use the
node01, node02, node03, node04 or node05 to run the each individual job".
How can I do that? I know that I can use -l to specify resource
requirements, but if I specify nodes=..., *each* job will allocate
*all* nodes for the job, which is not what I want - each individual
job should run on one of the nodes.
so:
qsub the_script.sub -t 1-10
and how do I specify the nodes?
Thanks,
Rainer
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk8zjtcACgkQoYgNqgF2egpfBwCfdntKs0vrjLQzJP5soVA0s4+5
Ui4An2IoPirSo0oBKd/CRWRmo1paHGD+
=1dMi
-----END PGP SIGNATURE-----
From jascha.wang at gmail.com Thu Feb 9 03:17:53 2012
From: jascha.wang at gmail.com (Xiangqian Wang)
Date: Thu, 9 Feb 2012 18:17:53 +0800
Subject: [torqueusers] queue to node mapping is wrong when use '-l
procs' option
In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au>
References:
<5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr>
<007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au>
Message-ID:
Thank you, Gareth
It seems that adding standing reservation in maui config has the same
effect to my torque config before. What I did is:
add following configuration to maui.cfg:
SRCFG[queue01] CLASSLIST=queue01
SRCFG[queue01] NODEFEATURES=queue01
SRCFG[queue01] PERIOD=INFINITY
SRCFG[queue01] RESOURCES=PROCS:1
and create a queue named 'queue01', set feature 'queue01' to a node
'cnode01'
When I submit job through queue01, the cnode01 is not allocated.
Seems that it's a bug of maui.
Xiangqian
2012/2/8
> Sounds like there is a bug.
>
> You might be able to work around it by using the queue to node mapping
> described here:
> http://www.supercluster.org/pipermail/mauiusers/2012-February/004839.html
>
> Gareth
>
> > I failed to test queue to node mapping feature of torque/maui system, I
> use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs
> option:
> >
> > #!/bin/sh
> > #PBS -N simple-job
> > #PBS -l procs=3
> > #PBS -q fluent
> > #PBS -d /opt/share/job
> > cd $PBS_O_WORKDIR
> > date
> > sleep 30
> > date
>
> -snip-
> ______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/8110218d/attachment-0001.html
From jascha.wang at gmail.com Thu Feb 9 03:33:21 2012
From: jascha.wang at gmail.com (Xiangqian Wang)
Date: Thu, 9 Feb 2012 18:33:21 +0800
Subject: [torqueusers] How to set the calling interval of prologue script
when job queued by it
Message-ID:
I need a prologue script to ensure some preparation is done before my job
starts, here is my simple script file:
#!/bin/sh
if [ -f /opt/share/prepared ]
then
echo `date` ": ready"
exit 0
fi
echo `date` ": not ready"
exit 2
Using the following job script, i can prevent the job from running before
the preparation file comes up.
#!/bin/sh
#PBS -N prologure-job
#PBS -l nodes=snode01
#PBS -l prologue=/opt/share/shell/prologue.scs
#PBS -q batch
#PBS -d /opt/share/job
#PBS -p 10
#PBS -o $PBS_JOBID.o
#PBS -e $PBS_JOBID.e
# cd $PBS_O_WORKDIR
date
ping localhost -c 20
date
But what i'm not satisfied is that the prologue script is called frequently
when the job is queued, approximately 1 second after the other, see my job
output file:
Thu Feb 9 18:07:32 CST 2012 : not ready
Thu Feb 9 18:07:33 CST 2012 : not ready
Thu Feb 9 18:07:34 CST 2012 : not ready
Thu Feb 9 18:07:35 CST 2012 : not ready
Thu Feb 9 18:07:36 CST 2012 : not ready
Thu Feb 9 18:07:37 CST 2012 : not ready
Thu Feb 9 18:07:38 CST 2012 : not ready
Thu Feb 9 18:07:39 CST 2012 : not ready
Thu Feb 9 18:07:40 CST 2012 : not ready
Thu Feb 9 18:07:41 CST 2012 : not ready
Thu Feb 9 18:07:42 CST 2012 : not ready
...
and the job state switch between 'Q' and 'R', irregularly.
Now what i want to know is:
1. how to set a longer interval of calling the prologure script, maybe 5+
minutes is OK?
2. is it normal that the job state switch between 'Q' and 'R', shouldn't it
always be 'Q'?
Thanks for your concern.
Xiangqian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/702fdce3/attachment.html
From Gareth.Williams at csiro.au Thu Feb 9 03:36:16 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Thu, 9 Feb 2012 21:36:16 +1100
Subject: [torqueusers] queue to node mapping is wrong when use
'-l procs' option
In-Reply-To:
References:
<5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr>
<007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au>
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E05@exvic-mbx04.nexus.csiro.au>
Xiangqian,
You also need:
CLASSCFG[queue01] DEFAULT.FEATURES=queue01
To make the queue go to the nodes with the feature.
The SRCFG part stops other jobs going to the nodes with the feature.
I guess the job went somewhere else. If it was stuck you could run checkjob -v to see why (I think that is OK in maui).
This is turning into a maui discussion so you might want to move to the mauiusers list if you need to continue.
Also, plain text is best for mailing lists. Html is hard to quote sensibly - sorry for the top post :)
Gareth
From: Xiangqian Wang [mailto:jascha.wang at gmail.com]
Sent: Thursday, 9 February 2012 9:18 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] queue to node mapping is wrong when use '-l procs' option
Thank you, Gareth
It seems that adding standing reservation in maui config has the same effect to my torque config before. What I did is:
add following configuration to maui.cfg:
SRCFG[queue01] CLASSLIST=queue01
SRCFG[queue01] NODEFEATURES=queue01
SRCFG[queue01] PERIOD=INFINITY
SRCFG[queue01] RESOURCES=PROCS:1
and create a queue named 'queue01', set feature 'queue01' to a node 'cnode01'
When I submit job through queue01, the cnode01 is not allocated.
Seems that it's a bug of maui.
Xiangqian
2012/2/8
Sounds like there is a bug.
You might be able to work around it by using the queue to node mapping described here:
http://www.supercluster.org/pipermail/mauiusers/2012-February/004839.html
Gareth
> I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option:
>
> #!/bin/sh
> #PBS -N simple-job
> #PBS -l procs=3
> #PBS -q fluent
> #PBS -d /opt/share/job
> cd $PBS_O_WORKDIR
> date
> sleep 30
> date
-snip-
______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/cdc4b62b/attachment.html
From Michael.Zulauf at iberdrolaren.com Thu Feb 9 11:30:09 2012
From: Michael.Zulauf at iberdrolaren.com (Zulauf, Michael)
Date: Thu, 9 Feb 2012 10:30:09 -0800
Subject: [torqueusers] problem with jobs sharing cores
Message-ID:
Hi all. . .
I apologize if this message appears more than once - there was an issue
with my email address and list registration (which I hope is now fixed),
and so I'm having to resend this. . .
Anyway, where I work, we've had a problem for a while that we haven't
been able to resolve. I'm not certain of the cause - if it's related to
Torque, or Maui, or something else. But here goes. . .
We've got a small cluster of 16 nodes, each with dual hex-core
processors. 12 cores per node, 192 cores total. The problem is that if
I launch small jobs, where multiple jobs should be able to share a node
without sharing cores, I instead get cores that are running more than
one process, while other cores are idle. The primary executable is WRF
(weather prediction model), but the problem occurs for other parallel
codes. The codes have been built to utilize MPI (not OpenMP, or
MPI/OpenMP).
As an example, if I launch a series of jobs which request 4 cores each,
I get 3 jobs assigned to each node. That should be fine, as each node
has 12 cores, and there should be no need to share cores. Instead, I
get 4 "overloaded" cores (each running 3 processes) and 8 idle cores.
Obviously not an ideal situation. If I submit only a single small job,
in which case it's alone on a node, then it runs great. Similarly, if I
launch a large job which spans more than one node, it also works well -
as long as it's not sharing nodes with other jobs. The problem only
occurs (and always occurs) when parallel jobs share a node. BTW, the
qsub command does not explicitly request specific cores, or anything
like that.
I'm not the administrator - just the primary user. The administrator
(who was not previously familiar with Torque/Maui) has been struggling
with this for a bit, and is rather busy with other duties, so I thought
I'd check in here to see if anybody had suggestions I could pass along.
Here are some specifics, as far as I know them:
HP blade hardware
dual Intel Xeon X5670 processors
Infiniband interconnect (not an issue in this case?)
the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is
exactly)
Torque 3.0.2
mvapich2-1.7rc1
PGI7.2-5 compilers
WRF 3.3.1
Any thoughts? I've probably left out relevant information. If so,
please ask for clarification.
Thanks,
Mike
--
Mike Zulauf
Meteorologist, Lead Senior
Asset Optimization
Iberdrola Renewables
1125 NW Couch, Suite 700
Portland, OR 97209
Office: 503-478-6304 Cell: 503-913-0403
This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/ec5549a6/attachment-0001.html
From christina.salls at noaa.gov Thu Feb 9 12:47:05 2012
From: christina.salls at noaa.gov (Christina Salls)
Date: Thu, 9 Feb 2012 14:47:05 -0500
Subject: [torqueusers] simple startup troubleshooting
Message-ID:
Hi all,
I am new to Torque. In fact, I have just installed torque-2.5.9
(server) on the head node of a 20 node cluster and torque client and mom
packages on the compute nodes. I used the Torque Administrator's Guide and
the installation process seemed to proceed smoothly (on my second attempt).
My first attempt was complicated by the fact that PBS was pre-installed on
both the head node and server and seemed to be getting in my way because of
processes that were already running and ports that were already in use. I
removed everything I could find of the PBS installation and started from
scratch. I am stuck at the point where I should be seeing my nodes as
free, but they are showing up as down. I am looking for any clues in
troubleshooting this problem. I don't know where to start. I am including
some information to illustrate my setup.
Thanks in advance,
Christina
Here is the output of the pbsnodes command
[root at wings torque-packages]# pbsnodes -a
n001
state = down
np = 1
ntype = cluster
gpus = 0
n002
state = down
np = 1
ntype = cluster
gpus = 0
n003
state = down
np = 1
ntype = cluster
gpus = 0
.....
It is the same for all 20 nodes. I truncated it for the sake of brevity.
On the headnode:
[root at wings server_priv]# ping n001
PING n001.default.domain (10.0.1.1) 56(84) bytes of data.
64 bytes from n001.default.domain (10.0.1.1): icmp_seq=1 ttl=64 time=0.193
ms
64 bytes from n001.default.domain (10.0.1.1): icmp_seq=2 ttl=64 time=0.189
ms
[root at wings server_priv]# qmgr
Max open servers: 10239
Qmgr: list server
Server wings.glerl.noaa.gov
server_state = Active
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
acl_hosts = wings.glerl.noaa.gov
default_queue = batch
log_events = 511
mail_from = adm
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.5.9
keep_completed = 300
next_job_number = 0
net_counter = 4 4 4
Qmgr: list node n001
Node n001
state = down
np = 1
ntype = cluster
gpus = 0
Qmgr: print node n001
#
# Create nodes and set their properties.
#
#
# Create and define node n001
#
create node n001
set node n001 state = down
set node n001 np = 1
set node n001 ntype = cluster
set node n001 gpus = 0
[root at wings server_priv]# ps -ef | grep pbs
root 3925 1 0 Feb03 ? 00:03:00 /usr/local/sbin/pbs_mom -q
-d /var/spool/torque
root 7056 1 0 11:47 ? 00:00:02 pbs_server
root 29031 7993 0 12:59 pts/29 00:00:00 grep pbs
[root at wings torque-2.5.9]# qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = wings.glerl.noaa.gov
set server managers = salls at wings.glerl.noaa.gov
set server operators = salls at wings.glerl.noaa.gov
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
>From the compute nodes:
root 15891 1 0 11:45 ? 00:00:00 pbs_mom
root 16742 16709 0 13:11 pts/0 00:00:00 grep pbs
[root at n001 ~]# ping wings
PING wings.glerl.noaa.gov (192.94.173.9) 56(84) bytes of data.
64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=1 ttl=64
time=0.093 ms
64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=2 ttl=64
time=0.165 ms
[root at n001 ~]# qmgr
Max open servers: 10239
Qmgr: list server
Server wings.glerl.noaa.gov
server_state = Active
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
acl_hosts = wings.glerl.noaa.gov
default_queue = batch
log_events = 511
mail_from = adm
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.5.9
keep_completed = 300
next_job_number = 0
net_counter = 6 5 4
Qmgr: list node n001
Node n001
state = down
np = 1
ntype = cluster
gpus = 0
[root at wings server_priv]# qmgr
Max open servers: 10239
Qmgr: print node n001
#
# Create nodes and set their properties.
#
#
# Create and define node n001
#
create node n001
set node n001 state = down
set node n001 np = 1
set node n001 ntype = cluster
set node n001 gpus = 0
I am not sure how to proceed at this point. Any help would be appreciated.
I wasn't sure what other files or output to include. Let me know if any
other information would be useful.
--
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/6e38106e/attachment.html
From knielson at adaptivecomputing.com Thu Feb 9 13:19:18 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 09 Feb 2012 13:19:18 -0700 (MST)
Subject: [torqueusers] simple startup troubleshooting
In-Reply-To:
Message-ID:
----- Original Message -----
> From: "Christina Salls"
> To: torqueusers at supercluster.org
> Cc: "help >> GLERL IT Help"
> Sent: Thursday, February 9, 2012 12:47:05 PM
> Subject: [torqueusers] simple startup troubleshooting
>
>
> Hi all,
>
>
> I am new to Torque. In fact, I have just installed torque-2.5.9
> (server) on the head node of a 20 node cluster and torque client and
> mom packages on the compute nodes. I used the Torque Administrator's
> Guide and the installation process seemed to proceed smoothly (on my
> second attempt). My first attempt was complicated by the fact that
> PBS was pre-installed on both the head node and server and seemed to
> be getting in my way because of processes that were already running
> and ports that were already in use. I removed everything I could
> find of the PBS installation and started from scratch. I am stuck at
> the point where I should be seeing my nodes as free, but they are
> showing up as down. I am looking for any clues in troubleshooting
> this problem. I don't know where to start. I am including some
> information to illustrate my setup.
>
>
> Thanks in advance,
>
>
> Christina
>
>
Christina,
Something to check is the server_name file on each of the MOM nodes. This should have the host name of where pbs_server is running. That is the $TORQUE_HOME/server_name file.
Ken
From knielson at adaptivecomputing.com Thu Feb 9 13:27:59 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 09 Feb 2012 13:27:59 -0700 (MST)
Subject: [torqueusers] problem with jobs sharing cores
In-Reply-To:
Message-ID: <0a281c3b-fb63-4720-965b-4a7a5e28eba0@mail>
----- Original Message -----
> From: "Michael Zulauf"
> To: torqueusers at supercluster.org
> Sent: Thursday, February 9, 2012 11:30:09 AM
> Subject: [torqueusers] problem with jobs sharing cores
>
>
>
>
>
> Hi all. . .
>
>
>
> I apologize if this message appears more than once ? there was an
> issue with my email address and list registration (which I hope is
> now fixed), and so I?m having to resend this. . .
>
>
>
> Anyway, where I work, we?ve had a problem for a while that we haven?t
> been able to resolve. I?m not certain of the cause - if it?s related
> to Torque, or Maui, or something else. But here goes. . .
>
>
>
> We?ve got a small cluster of 16 nodes, each with dual hex-core
> processors. 12 cores per node, 192 cores total. The problem is that
> if I launch small jobs, where multiple jobs should be able to share
> a node without sharing cores, I instead get cores that are running
> more than one process, while other cores are idle. The primary
> executable is WRF (weather prediction model), but the problem occurs
> for other parallel codes. The codes have been built to utilize MPI
> (not OpenMP, or MPI/OpenMP).
>
>
You do not really schedule cores with Maui/TORQUE or any other scheduler/resource manager. However, there are ways to make sure you get unique cores for your job. In TORQUE use CPUSETs http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/3.5linuxcpusets.php.
When you set the np count in the nodes file it is not physically tied to the number of processors on the node. It is really a count that says I have this many execution slots available on this node. By far most nodes are set to the number of cores available. Even then, however, when jobs are scheduled they are managed by the OS which will run the jobs anywhere it sees fit. CPUSETs allow the user to reserve 1 or more cores exclusively for their job. Their job will not run outside of the CPUSET and no other processes can use their CPUSET either.
Ken
From sm4082 at nyu.edu Thu Feb 9 13:34:26 2012
From: sm4082 at nyu.edu (Sreedhar Manchu)
Date: Thu, 9 Feb 2012 15:34:26 -0500
Subject: [torqueusers] simple startup troubleshooting
In-Reply-To:
References:
Message-ID: <4917601D-FC9A-45AE-99CE-BE3FA3795B31@nyu.edu>
Hi Christina,
Recently, I upgraded torque to its latest version 2.5.10 on our clusters. The way our configuration setup was that our compute nodes couldn't talk to server node with the server name example.sit.nyu.edu. They should talk to example.local. So I had to put it in /opt/torque/mom_priv/config as
$pbsserver example.local
Please check your settings the way network is setup. The other thing I did was to restart the pbs_moms on all nodes and it took care of it. Because the way it was set up, immediately after node came alive with installation it was trying to talk to server with server name variable in /opt/torque (it couldn't read the config file because it was copied after reboot). Once I rebooted pbs_mom it picked it up from config and everything was fine.
Sreedhar.
On Feb 9, 2012, at 2:47 PM, Christina Salls wrote:
> Hi all,
>
> I am new to Torque. In fact, I have just installed torque-2.5.9 (server) on the head node of a 20 node cluster and torque client and mom packages on the compute nodes. I used the Torque Administrator's Guide and the installation process seemed to proceed smoothly (on my second attempt). My first attempt was complicated by the fact that PBS was pre-installed on both the head node and server and seemed to be getting in my way because of processes that were already running and ports that were already in use. I removed everything I could find of the PBS installation and started from scratch. I am stuck at the point where I should be seeing my nodes as free, but they are showing up as down. I am looking for any clues in troubleshooting this problem. I don't know where to start. I am including some information to illustrate my setup.
>
> Thanks in advance,
>
> Christina
>
> Here is the output of the pbsnodes command
>
> [root at wings torque-packages]# pbsnodes -a
> n001
> state = down
> np = 1
> ntype = cluster
> gpus = 0
>
> n002
> state = down
> np = 1
> ntype = cluster
> gpus = 0
>
> n003
> state = down
> np = 1
> ntype = cluster
> gpus = 0
>
> .....
>
> It is the same for all 20 nodes. I truncated it for the sake of brevity.
>
> On the headnode:
>
> [root at wings server_priv]# ping n001
> PING n001.default.domain (10.0.1.1) 56(84) bytes of data.
> 64 bytes from n001.default.domain (10.0.1.1): icmp_seq=1 ttl=64 time=0.193 ms
> 64 bytes from n001.default.domain (10.0.1.1): icmp_seq=2 ttl=64 time=0.189 ms
>
> [root at wings server_priv]# qmgr
> Max open servers: 10239
> Qmgr: list server
> Server wings.glerl.noaa.gov
> server_state = Active
> scheduling = True
> total_jobs = 0
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
> acl_hosts = wings.glerl.noaa.gov
> default_queue = batch
> log_events = 511
> mail_from = adm
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = True
> pbs_version = 2.5.9
> keep_completed = 300
> next_job_number = 0
> net_counter = 4 4 4
> Qmgr: list node n001
> Node n001
> state = down
> np = 1
> ntype = cluster
> gpus = 0
> Qmgr: print node n001
> #
> # Create nodes and set their properties.
> #
> #
> # Create and define node n001
> #
> create node n001
> set node n001 state = down
> set node n001 np = 1
> set node n001 ntype = cluster
> set node n001 gpus = 0
>
>
> [root at wings server_priv]# ps -ef | grep pbs
> root 3925 1 0 Feb03 ? 00:03:00 /usr/local/sbin/pbs_mom -q -d /var/spool/torque
> root 7056 1 0 11:47 ? 00:00:02 pbs_server
> root 29031 7993 0 12:59 pts/29 00:00:00 grep pbs
>
> [root at wings torque-2.5.9]# qmgr -c 'p s'
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = wings.glerl.noaa.gov
> set server managers = salls at wings.glerl.noaa.gov
> set server operators = salls at wings.glerl.noaa.gov
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server keep_completed = 300
>
> From the compute nodes:
>
> root 15891 1 0 11:45 ? 00:00:00 pbs_mom
> root 16742 16709 0 13:11 pts/0 00:00:00 grep pbs
>
> [root at n001 ~]# ping wings
> PING wings.glerl.noaa.gov (192.94.173.9) 56(84) bytes of data.
> 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=1 ttl=64 time=0.093 ms
> 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=2 ttl=64 time=0.165 ms
>
> [root at n001 ~]# qmgr
> Max open servers: 10239
> Qmgr: list server
> Server wings.glerl.noaa.gov
> server_state = Active
> scheduling = True
> total_jobs = 0
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
> acl_hosts = wings.glerl.noaa.gov
> default_queue = batch
> log_events = 511
> mail_from = adm
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = True
> pbs_version = 2.5.9
> keep_completed = 300
> next_job_number = 0
> net_counter = 6 5 4
>
> Qmgr: list node n001
> Node n001
> state = down
> np = 1
> ntype = cluster
> gpus = 0
> [root at wings server_priv]# qmgr
> Max open servers: 10239
> Qmgr: print node n001
> #
> # Create nodes and set their properties.
> #
> #
> # Create and define node n001
> #
> create node n001
> set node n001 state = down
> set node n001 np = 1
> set node n001 ntype = cluster
> set node n001 gpus = 0
>
>
> I am not sure how to proceed at this point. Any help would be appreciated. I wasn't sure what other files or output to include. Let me know if any other information would be useful.
>
>
>
> --
> Christina A. Salls
> GLERL Computer Group
> help.glerl at noaa.gov
> Help Desk x2127
> Christina.Salls at noaa.gov
> Voice Mail 734-741-2446
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
---
Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/b6002ab7/attachment-0001.html
From christina.salls at noaa.gov Thu Feb 9 14:23:10 2012
From: christina.salls at noaa.gov (Christina Salls)
Date: Thu, 9 Feb 2012 16:23:10 -0500
Subject: [torqueusers] simple startup troubleshooting
In-Reply-To:
References:
Message-ID:
> Something to check is the server_name file on each of the MOM nodes. This
> should have the host name of where pbs_server is running. That is the
> $TORQUE_HOME/server_name file.
>
> Ken
>
Thanks Ken,
I think the server_name is good. Here is what I have:
[root at n001 ~]# cd /var/spool
[root at n001 spool]# ls
abrt abrt-upload anacron at audit cron cups lpd mail plymouth
postfix torque up2date
[root at n001 spool]# cd torque
[root at n001 torque]# ls
aux checkpoint mom_logs mom_priv pbs_environment server_name
server_name.new spool undelivered
[root at n001 torque]# more server_name
wings.glerl.noaa.gov
[root at n001 torque]# ping wings.glerl.noaa.gov
PING wings.glerl.noaa.gov (192.94.173.9) 56(84) bytes of data.
64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=1 ttl=64
time=0.075 ms
64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=2 ttl=64
time=0.190 ms
--
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/1cd2e5bf/attachment.html
From halmabrazi at idtdna.com Thu Feb 9 14:33:02 2012
From: halmabrazi at idtdna.com (Hakeem Almabrazi)
Date: Thu, 9 Feb 2012 21:33:02 +0000
Subject: [torqueusers] submitting a job (interactively) issue
Message-ID:
Hi,
I have tried to submit a job using the option -I and I got the message
Qsub: waiting for job # to start
Qsub: job # ready
And that is it.
If I qstat I got a message saying the job # is still "R" running ...
It looks like I have lack of understanding on how to use this option but here is my submit job request:
>qsub -l nodes=1 -N jobName -I -v "some parameters" shellScript
If I run the above request without the -I option, it runs fine without any issue.
Someone might ask the question, why I am running it "interactively"?
Well, I want to force the program which issued the request to wait for the result and do something with it after that.
Thank you for your help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/4a70693f/attachment.html
From christina.salls at noaa.gov Thu Feb 9 14:39:27 2012
From: christina.salls at noaa.gov (Christina Salls)
Date: Thu, 9 Feb 2012 16:39:27 -0500
Subject: [torqueusers] simple startup troubleshooting
In-Reply-To:
References:
Message-ID:
Ok, I got a suggestion and it worked!! This is how I got my nodes to a
free state:
Qmgr: s n
n001,n002,n003,n004,n005,n006,n007,n008,n009,n010,n011,n012,n013,n014,n015,n016,n017,n018,n019,n020
state=free
Qmgr: p n n001,n002,n003
#
# Create nodes and set their properties.
#
#
# Create and define node n001
#
create node n001
set node n001 state = free
set node n001 np = 1
set node n001 ntype = cluster
set node n001 gpus = 0
#
# Create nodes and set their properties.
#
#
# Create and define node n002
#
create node n002
set node n002 state = free
set node n002 np = 1
set node n002 ntype = cluster
set node n002 gpus = 0
#
# Create nodes and set their properties.
#
#
# Create and define node n003
#
create node n003
set node n003 state = free
set node n003 np = 1
set node n003 ntype = cluster
set node n003 gpus = 0
Qmgr: quit
[root at wings ~]# pbsnodes -a
n001
state = free
np = 1
ntype = cluster
gpus = 0
n002
state = free
np = 1
ntype = cluster
gpus = 0
n003
state = free
np = 1
ntype = cluster
gpus = 0
n004
state = free
np = 1
ntype = cluster
gpus = 0
n005
state = free
np = 1
ntype = cluster
gpus = 0
n006
state = free
np = 1
ntype = cluster
gpus = 0
n007
state = free
np = 1
ntype = cluster
gpus = 0
n008
state = free
np = 1
ntype = cluster
gpus = 0
n009
state = free
np = 1
ntype = cluster
gpus = 0
n010
state = free
np = 1
ntype = cluster
gpus = 0
n011
state = free
np = 1
ntype = cluster
gpus = 0
n012
state = free
np = 1
ntype = cluster
gpus = 0
n013
state = free
np = 1
ntype = cluster
gpus = 0
n014
state = free
np = 1
ntype = cluster
gpus = 0
n015
state = free
np = 1
ntype = cluster
gpus = 0
n016
state = free
np = 1
ntype = cluster
gpus = 0
n017
state = free
np = 1
ntype = cluster
gpus = 0
n018
state = free
np = 1
ntype = cluster
gpus = 0
n019
state = free
np = 1
ntype = cluster
gpus = 0
n020
state = free
np = 1
ntype = cluster
gpus = 0
Now I have a lot more work to do to figure out submitting jobs, etc....
Thanks to everyone who responded!!
Christina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/7ccfc127/attachment.html
From leggett at mcs.anl.gov Thu Feb 9 14:41:12 2012
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Thu, 9 Feb 2012 15:41:12 -0600
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To:
References:
<22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
<93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov>
Message-ID: <88388003-8D26-4027-931B-19777974EFCC@mcs.anl.gov>
Is there more configuration I need to do to make this effective with a routing queue?
On Feb 4, 2012, at 11:44 AM, Ti Leggett wrote:
> All jobs go through a routing queue.
>
> On Feb 3, 2012, at 11:44 AM, David Beer wrote:
>
>> I'm also curious - is this done through a routing queue or routing queues? Is it class remapping in Moab? It looks like it isn't qsub -q
>>
>> David
>>
>> On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett wrote:
>> submit_args = -A CI-MCB000083 -l walltime=48:00:00,
>> mppwidth=48 /lustre/beagle/linpyl/project.qsub
>>
>> On Feb 3, 2012, at 11:03 AM, David Beer wrote:
>>
>>> If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high.
>>>
>>> David
>>>
>>> On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote:
>>> I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools?
>>>
>>> On Feb 3, 2012, at 10:06 AM, David Beer wrote:
>>>
>>>> Ti,
>>>>
>>>> How are you submitting the jobs? I assume this is TORQUE 2.5.9?
>>>>
>>>> David
>>>>
>>>> On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote:
>>>> We've set queue limits that don't seem to be honored:
>>>>
>>>> sdb:~ # qstat | grep linpyl | grep batch | wc
>>>> 945 5670 82215
>>>>
>>>> sdb:~ # qmgr -c "print queue batch"
>>>> #
>>>> # Create queues and set their attributes.
>>>> #
>>>> #
>>>> # Create and define queue batch
>>>> #
>>>> create queue batch
>>>> set queue batch queue_type = Execution
>>>> set queue batch max_user_queuable = 500
>>>> set queue batch resources_min.mppwidth = 1
>>>> set queue batch resources_default.mppwidth = 24
>>>> set queue batch resources_default.walltime = 00:10:00
>>>> set queue batch acl_group_enable = False
>>>> set queue batch resources_available.nodes = 726
>>>> set queue batch enabled = True
>>>> set queue batch started = True
>>>>
>>>> How would it be possible for a user to have 945 jobs in the queue when the limit should be 500?
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Software Engineer
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>> --
>>> David Beer | Software Engineer
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>> --
>> David Beer | Software Engineer
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/466c7202/attachment-0001.bin
From dbeer at adaptivecomputing.com Thu Feb 9 14:59:58 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Thu, 9 Feb 2012 14:59:58 -0700
Subject: [torqueusers] Torque not honoring max_user_queuable
In-Reply-To: <88388003-8D26-4027-931B-19777974EFCC@mcs.anl.gov>
References:
<22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov>
<93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov>
<88388003-8D26-4027-931B-19777974EFCC@mcs.anl.gov>
Message-ID:
Ti,
Sorry, I sort of got pulled off onto other things and haven't looked into
this further. There shouldn't be further configuration that you need. I
will see if I can reproduce this.
David
On Thu, Feb 9, 2012 at 2:41 PM, Ti Leggett wrote:
> Is there more configuration I need to do to make this effective with a
> routing queue?
>
> On Feb 4, 2012, at 11:44 AM, Ti Leggett wrote:
>
> > All jobs go through a routing queue.
> >
> > On Feb 3, 2012, at 11:44 AM, David Beer wrote:
> >
> >> I'm also curious - is this done through a routing queue or routing
> queues? Is it class remapping in Moab? It looks like it isn't qsub -q
>
> >>
> >> David
> >>
> >> On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett
> wrote:
> >> submit_args = -A CI-MCB000083 -l walltime=48:00:00,
> >> mppwidth=48 /lustre/beagle/linpyl/project.qsub
> >>
> >> On Feb 3, 2012, at 11:03 AM, David Beer wrote:
> >>
> >>> If you qstat -f a few of the jobs you can see the submit arguments. At
> higher log levels the entire job submission is there, but I don't known if
> your log levels would be that high.
> >>>
> >>> David
> >>>
> >>> On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett
> wrote:
> >>> I'm assuming using qsub, but it's other users doing this so I'm not
> 100% sure. Is there a way to find out from logs or other tools?
> >>>
> >>> On Feb 3, 2012, at 10:06 AM, David Beer wrote:
> >>>
> >>>> Ti,
> >>>>
> >>>> How are you submitting the jobs? I assume this is TORQUE 2.5.9?
> >>>>
> >>>> David
> >>>>
> >>>> On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett
> wrote:
> >>>> We've set queue limits that don't seem to be honored:
> >>>>
> >>>> sdb:~ # qstat | grep linpyl | grep batch | wc
> >>>> 945 5670 82215
> >>>>
> >>>> sdb:~ # qmgr -c "print queue batch"
> >>>> #
> >>>> # Create queues and set their attributes.
> >>>> #
> >>>> #
> >>>> # Create and define queue batch
> >>>> #
> >>>> create queue batch
> >>>> set queue batch queue_type = Execution
> >>>> set queue batch max_user_queuable = 500
> >>>> set queue batch resources_min.mppwidth = 1
> >>>> set queue batch resources_default.mppwidth = 24
> >>>> set queue batch resources_default.walltime = 00:10:00
> >>>> set queue batch acl_group_enable = False
> >>>> set queue batch resources_available.nodes = 726
> >>>> set queue batch enabled = True
> >>>> set queue batch started = True
> >>>>
> >>>> How would it be possible for a user to have 945 jobs in the queue
> when the limit should be 500?
> >>>> _______________________________________________
> >>>> torqueusers mailing list
> >>>> torqueusers at supercluster.org
> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> David Beer | Software Engineer
> >>>> Adaptive Computing
> >>>>
> >>>> _______________________________________________
> >>>> torqueusers mailing list
> >>>> torqueusers at supercluster.org
> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>
> >>>
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> David Beer | Software Engineer
> >>> Adaptive Computing
> >>>
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>
> >>
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>
> >>
> >>
> >>
> >> --
> >> David Beer | Software Engineer
> >> Adaptive Computing
> >>
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/4a0f13c9/attachment.html
From knielson at adaptivecomputing.com Thu Feb 9 15:39:52 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 09 Feb 2012 15:39:52 -0700 (MST)
Subject: [torqueusers] Specifying nodes which can be used in array job
In-Reply-To:
Message-ID:
----- Original Message -----
> From: "Rainer M Krug"
> To: torqueusers at supercluster.org
> Sent: Thursday, February 9, 2012 2:16:07 AM
> Subject: [torqueusers] Specifying nodes which can be used in array job
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi
>
> assuming I have cluster of 10 nodes (node01, ... node10), of which I
> am not the administrator.
>
> Some nodes are setup slightly different, so that a certain job only
> runs on nodes node01 to node05.
>
> So I would like to submit an array job and specify "only use the
> node01, node02, node03, node04 or node05 to run the each individual
> job".
>
> How can I do that? I know that I can use -l to specify resource
> requirements, but if I specify nodes=..., *each* job will allocate
> *all* nodes for the job, which is not what I want - each individual
> job should run on one of the nodes.
>
> so:
>
> qsub the_script.sub -t 1-10
>
> and how do I specify the nodes?
>
> Thanks,
>
> Rainer
Rainer,
Are there feature (properties) in the nodes files of those hosts which would allow you to specify a feature on the qsub line?
Ken
From Gareth.Williams at csiro.au Thu Feb 9 16:04:45 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Fri, 10 Feb 2012 10:04:45 +1100
Subject: [torqueusers] submitting a job (interactively) issue
In-Reply-To:
References:
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E08@exvic-mbx04.nexus.csiro.au>
> From: Hakeem Almabrazi [mailto:halmabrazi at idtdna.com]
> Sent: Friday, 10 February 2012 8:33 AM
> To: Torque Users Mailing List
> Subject: [torqueusers] submitting a job (interactively) issue
> Hi,
> I have tried to submit a job using the option -I and I got the message
> Qsub: waiting for job # to start
> Qsub: job # ready
> And that is it.
> If I qstat I got a message saying the job # is still "R" running .
> It looks like I have lack of understanding on how to use this option but here is my submit job request:
> >qsub -l nodes=1 -N jobName -I -v "some parameters" shellScript
> If I run the above request without ?the -I option, it runs fine without any issue.
> Someone might ask the question, why I am running it "interactively"?
> Well, I want to force the program which issued the request to wait for the result and do something with it after that.
> Thank you for your help.
Hi Hakeem,
For an interactive job you have two options:
1) don't supply a script - then you should get an interactive shell session on a compute node when the job starts like:
> qsub -I
qsub: waiting for job 421117.server_host to start
qsub: job 421117.server_host ready
Begin PBS Prologue Fri Feb 10 10:00:31 EST 2012 1328828431
Job ID: 421117.server_host
Username: wil240
Group: asc
Name: STDIN
Resources: neednodes=1,nodes=1,vmem=500mb,walltime=00:10:00
Queue: normal
Nodes: n001
First Node: n001
Fri Feb 10 10:00:32 EST 2012
Directory: /home/asc/wil240
Fri Feb 10 10:00:32 EST 2012
wil240 at n001:~> uname -a
Linux n001 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux
wil240 at n001:~> logout
qsub: job 421117.server_host completed
2) use the -x option and supply a single command (not a script) - like:
wil240 at burnet-login:~> qsub -Ix 'uname -a'
qsub: waiting for job 421120.burnet-srv.idpx.hpsc.csiro.au to start
qsub: job 421120.burnet-srv.idpx.hpsc.csiro.au ready
Begin PBS Prologue Fri Feb 10 10:02:39 EST 2012 1328828559
Job ID: 421120.server_host
Username: wil240
Group: asc
Name: none
Resources: neednodes=1,nodes=1,vmem=500mb,walltime=00:10:00
Queue: normal
Nodes: n001
First Node: n001
Fri Feb 10 10:02:39 EST 2012
Linux n001 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux
qsub: job 421120.server_host completed
I think you want the second option.
Gareth
ps. It would be better to send plain text email to a mailing list (not html).
?
From jjc at iastate.edu Thu Feb 9 16:20:44 2012
From: jjc at iastate.edu (Coyle, James J [ITACD])
Date: Thu, 9 Feb 2012 23:20:44 +0000
Subject: [torqueusers] problem with jobs sharing cores
In-Reply-To:
References:
Message-ID: <242421BFAF465844BE24EB90BB97E2210196746B@ITSDAG1D.its.iastate.edu>
Mike,
We had this issue with OpenMPI and the mca parameter mpi_paffinity_alone
setting mpi_paffinity_alone gives somewhat better performance than
not setting it due to better cache hits when there is only one job
running on a node.
However, this places the N mpi processes on cores 0 to N-1
so for 3 four process MPI programs running on a 12 core node,
you would have 3 processes each running on cores 0 through 3.
Doing what you are doing, launching 3 jobs using 4 processes each with
openmpi and having mpi_paffinity_alone set on (perhaps by default) would
cause exactly the behavior you are seeing, you would have 3 mpi processes
rank 0 running on core 0, 3 rank 1 processes running on core 1, etc., and no
MPI processes running on cores 4-11.
Perhaps mvapich has a similar mechanism to mpi_paffinity_alone that you are
encountering. man mpirun should help you figure this out, or you could ask
the cluster admin, or whoever is an expert in using mvapich in your environment.
Below, I have included part of the General run-time tuning portion of the FAQ for OpenMPI
from http://www.open-mpi.org/faq/
I hope this helps
- Jim
James Coyle, PhD
High Performance Computing Group
Iowa State Univ.
web: http://jjc.public.iastate.edu/
Open MPI 1.2 offers only crude control, with the MCA parameter "mpi_paffinity_alone". For example:
$ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out
(Just like any other MCA parameter, mpi_paffinity_alone can be set
via any of the normal MCA parameter mechanisms.)
On each node where your job is running, your job's MPI processes will be bound, one-to-one, in the order of their global MPI ranks, to the lowest-numbered processing units (for example, cores or hardware threads) on the node as identified by the OS. Further, memory affinity will also be enabled if it is supported on the node,as described in a different FAQ entry.
If multiple jobs are launched on the same node in this manner, they will compete for the same processing units and severe performance degradation will likely result. Therefore, this MCA parameter is best used when you know your job will be "alone" on the nodes where it will run.
Since each process is bound to a single processing unit, performance will likely suffer catastrophically if processes are multi-threaded.
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zulauf, Michael
Sent: Thursday, February 09, 2012 12:30 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] problem with jobs sharing cores
Hi all. . .
I apologize if this message appears more than once - there was an issue with my email address and list registration (which I hope is now fixed), and so I'm having to resend this. . .
Anyway, where I work, we've had a problem for a while that we haven't been able to resolve. I'm not certain of the cause - if it's related to Torque, or Maui, or something else. But here goes. . .
We've got a small cluster of 16 nodes, each with dual hex-core processors. 12 cores per node, 192 cores total. The problem is that if I launch small jobs, where multiple jobs should be able to share a node without sharing cores, I instead get cores that are running more than one process, while other cores are idle. The primary executable is WRF (weather prediction model), but the problem occurs for other parallel codes. The codes have been built to utilize MPI (not OpenMP, or MPI/OpenMP).
As an example, if I launch a series of jobs which request 4 cores each, I get 3 jobs assigned to each node. That should be fine, as each node has 12 cores, and there should be no need to share cores. Instead, I get 4 "overloaded" cores (each running 3 processes) and 8 idle cores. Obviously not an ideal situation. If I submit only a single small job, in which case it's alone on a node, then it runs great. Similarly, if I launch a large job which spans more than one node, it also works well - as long as it's not sharing nodes with other jobs. The problem only occurs (and always occurs) when parallel jobs share a node. BTW, the qsub command does not explicitly request specific cores, or anything like that.
I'm not the administrator - just the primary user. The administrator (who was not previously familiar with Torque/Maui) has been struggling with this for a bit, and is rather busy with other duties, so I thought I'd check in here to see if anybody had suggestions I could pass along.
Here are some specifics, as far as I know them:
HP blade hardware
dual Intel Xeon X5670 processors
Infiniband interconnect (not an issue in this case?)
the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly)
Torque 3.0.2
mvapich2-1.7rc1
PGI7.2-5 compilers
WRF 3.3.1
Any thoughts? I've probably left out relevant information. If so, please ask for clarification.
Thanks,
Mike
--
Mike Zulauf
Meteorologist, Lead Senior
Asset Optimization
Iberdrola Renewables
1125 NW Couch, Suite 700
Portland, OR 97209
Office: 503-478-6304 Cell: 503-913-0403
This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/e99e2188/attachment-0001.html
From r.m.krug at gmail.com Fri Feb 10 00:49:45 2012
From: r.m.krug at gmail.com (Rainer M Krug)
Date: Fri, 10 Feb 2012 08:49:45 +0100
Subject: [torqueusers] Specifying nodes which can be used in array job
In-Reply-To:
References:
Message-ID: <4F34CC19.6070805@gmail.com>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 09/02/12 23:39, Ken Nielson wrote:
>
>
> ----- Original Message -----
>> From: "Rainer M Krug" To:
>> torqueusers at supercluster.org Sent: Thursday, February 9, 2012
>> 2:16:07 AM Subject: [torqueusers] Specifying nodes which can be
>> used in array job
>>
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> Hi
>>
>> assuming I have cluster of 10 nodes (node01, ... node10), of
>> which I am not the administrator.
>>
>> Some nodes are setup slightly different, so that a certain job
>> only runs on nodes node01 to node05.
>>
>> So I would like to submit an array job and specify "only use the
>> node01, node02, node03, node04 or node05 to run the each
>> individual job".
>>
>> How can I do that? I know that I can use -l to specify resource
>> requirements, but if I specify nodes=..., *each* job will
>> allocate *all* nodes for the job, which is not what I want - each
>> individual job should run on one of the nodes.
>>
>> so:
>>
>> qsub the_script.sub -t 1-10
>>
>> and how do I specify the nodes?
>>
>> Thanks,
>>
>> Rainer
>
> Rainer,
>
> Are there feature (properties) in the nodes files of those hosts
> which would allow you to specify a feature on the qsub line?
No - unfortunately not.
>
> Ken
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk80zBgACgkQoYgNqgF2egri1wCfUUqDmOigKB8hCyCvt30pu5jZ
kewAnjfVc6o7rIjFua0ukEBhkaNe5McS
=nBnt
-----END PGP SIGNATURE-----
From R.M.Krug at gmail.com Fri Feb 10 00:49:45 2012
From: R.M.Krug at gmail.com (Rainer M Krug)
Date: Fri, 10 Feb 2012 08:49:45 +0100
Subject: [torqueusers] Specifying nodes which can be used in array job
In-Reply-To:
References:
Message-ID: <4F34CC19.6070805@gmail.com>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 09/02/12 23:39, Ken Nielson wrote:
>
>
> ----- Original Message -----
>> From: "Rainer M Krug" To:
>> torqueusers at supercluster.org Sent: Thursday, February 9, 2012
>> 2:16:07 AM Subject: [torqueusers] Specifying nodes which can be
>> used in array job
>>
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> Hi
>>
>> assuming I have cluster of 10 nodes (node01, ... node10), of
>> which I am not the administrator.
>>
>> Some nodes are setup slightly different, so that a certain job
>> only runs on nodes node01 to node05.
>>
>> So I would like to submit an array job and specify "only use the
>> node01, node02, node03, node04 or node05 to run the each
>> individual job".
>>
>> How can I do that? I know that I can use -l to specify resource
>> requirements, but if I specify nodes=..., *each* job will
>> allocate *all* nodes for the job, which is not what I want - each
>> individual job should run on one of the nodes.
>>
>> so:
>>
>> qsub the_script.sub -t 1-10
>>
>> and how do I specify the nodes?
>>
>> Thanks,
>>
>> Rainer
>
> Rainer,
>
> Are there feature (properties) in the nodes files of those hosts
> which would allow you to specify a feature on the qsub line?
No - unfortunately not.
>
> Ken
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk80zBgACgkQoYgNqgF2egri1wCfUUqDmOigKB8hCyCvt30pu5jZ
kewAnjfVc6o7rIjFua0ukEBhkaNe5McS
=nBnt
-----END PGP SIGNATURE-----
From christina.salls at noaa.gov Fri Feb 10 06:57:02 2012
From: christina.salls at noaa.gov (Christina Salls)
Date: Fri, 10 Feb 2012 08:57:02 -0500
Subject: [torqueusers] cluster network configuration
Message-ID:
Hi all,
I am experiencing a problem with my torque server and client
connection. My server has an ethernet interface on the public network that
is the named server in my torque config. There is a second network
interface that is a private network to the cluster on a 10.0.10 network,
with a hostname of admin. The compute nodes are only connected to the
private network, however the named server in both the head node and the
compute nodes is the public interface hostname. A pbsnodes command shows
all nodes as down. The log file on the server shows this error:
02/09/2012 18:10:56;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request,
bad attempt to connect from 192.94.173.9:1022 (address not trusted - check
entry in server_priv/nodes)
That address is the public address of the server.
I am wondering if I should name the server admin? The only compute nodes
that the server will need to access are on the private network. What is
the standard way of setting up a single cluster environment?
I am able to ping the compute nodes from the head node and ssh with no
authentication. From the compute nodes I am able to ping the head node as
well, using the public or private network hostname, and I am able to ssh
either to the "wings" interface or the "admin" interface without
authentication. It seems like the communication lines are open. Any
suggestions would be welcome.
Thanks in advance,
Christina
--
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120210/56eeeeb6/attachment.html
From fotis at cern.ch Sat Feb 11 12:53:24 2012
From: fotis at cern.ch (Fotis Georgatos)
Date: Sat, 11 Feb 2012 21:53:24 +0200
Subject: [torqueusers] problem with jobs sharing cores
In-Reply-To:
References:
Message-ID: <4F36C734.9060602@cern.ch>
Hi Mike,
I had to debug a problem during last week which appears somewhat related;
in short, the mpi stack (openmpi) was intervening in cpu affinity.
I was able to solve it in my case with the following line:
"mpiexec --report-bindings --cpus-per-rank 4 -np ..."
In your case I recommend a check on the equivalent FAQ of your mpi stack like:
http://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
From time to time you would like to check that your scheduler is actually
placing jobs on nodes as you imagine it would; this tool would help in this:
http://fotis.web.cern.ch/fotis/QTOP/
(tarball works fine in userspace, rpm & repo are available for sysadmins).
enjoy,
Fotis
On 10/02/2012 01:20, torqueusers-request at supercluster.org wrote:
> From:torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zulauf, Michael
> Sent: Thursday, February 09, 2012 12:30 PM
> To:torqueusers at supercluster.org
> Subject: [torqueusers] problem with jobs sharing cores
>
> Hi all. . .
>
> I apologize if this message appears more than once - there was an issue with my email address and list registration (which I hope is now fixed), and so I'm having to resend this. . .
>
> Anyway, where I work, we've had a problem for a while that we haven't been able to resolve. I'm not certain of the cause - if it's related to Torque, or Maui, or something else. But here goes. . .
>
> We've got a small cluster of 16 nodes, each with dual hex-core processors. 12 cores per node, 192 cores total. The problem is that if I launch small jobs, where multiple jobs should be able to share a node without sharing cores, I instead get cores that are running more than one process, while other cores are idle. The primary executable is WRF (weather prediction model), but the problem occurs for other parallel codes. The codes have been built to utilize MPI (not OpenMP, or MPI/OpenMP).
>
> As an example, if I launch a series of jobs which request 4 cores each, I get 3 jobs assigned to each node. That should be fine, as each node has 12 cores, and there should be no need to share cores. Instead, I get 4 "overloaded" cores (each running 3 processes) and 8 idle cores. Obviously not an ideal situation. If I submit only a single small job, in which case it's alone on a node, then it runs great. Similarly, if I launch a large job which spans more than one node, it also works well - as long as it's not sharing nodes with other jobs. The problem only occurs (and always occurs) when parallel jobs share a node. BTW, the qsub command does not explicitly request specific cores, or anything like that.
>
> I'm not the administrator - just the primary user. The administrator (who was not previously familiar with Torque/Maui) has been struggling with this for a bit, and is rather busy with other duties, so I thought I'd check in here to see if anybody had suggestions I could pass along.
>
> Here are some specifics, as far as I know them:
> HP blade hardware
> dual Intel Xeon X5670 processors
> Infiniband interconnect (not an issue in this case?)
> the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly)
> Torque 3.0.2
> mvapich2-1.7rc1
> PGI7.2-5 compilers
> WRF 3.3.1
>
> Any thoughts? I've probably left out relevant information. If so, please ask for clarification.
>
> Thanks,
> Mike
>
> --
> Mike Zulauf
> Meteorologist, Lead Senior
> Asset Optimization
> Iberdrola Renewables
> 1125 NW Couch, Suite 700
> Portland, OR 97209
> Office: 503-478-6304 Cell: 503-913-0403
--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
| sed 's/better bash/bash better/' # Yelling in a CERN forum
From jwbacon at tds.net Sat Feb 11 14:32:21 2012
From: jwbacon at tds.net (Jason bacon)
Date: Sat, 11 Feb 2012 15:32:21 -0600
Subject: [torqueusers] submitting a job (interactively) issue
In-Reply-To:
References:
Message-ID: <4F36DE65.4010304@tds.net>
I've (apparently) had the same issue, but have not found a solution yet.
Can you provide the following:
1. Operating system and version
2. Torque version
3. Relevant entries from server_logs on the submit node and mom_logs
on the allocated compute node
4. Anything unusual in the system log
When I ran into this issue, I found some errors in the logs regarding
failed socket connections. I think it might be a permissions issue, but
have not had time to investigate yet.
-J
On 2/9/12 3:33 PM, Hakeem Almabrazi wrote:
>
> Hi,
>
> I have tried to submit a job using the option --I and I got the message
>
> Qsub: waiting for job # to start
>
> Qsub: job # ready
>
> And that is it.
>
> If I qstat I got a message saying the job # is still "R" running ...
>
> It looks like I have lack of understanding on how to use this option
> but here is my submit job request:
>
> >qsub --l nodes=1 --N jobName --I --v "some parameters" shellScript
>
> If I run the above request without the --I option, it runs fine
> without any issue.
>
> Someone might ask the question, why I am running it "interactively"?
>
> Well, I want to force the program which issued the request to wait for
> the result and do something with it after that.
>
> Thank you for your help.
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120211/8b9b4ba9/attachment-0001.html
From gus at ldeo.columbia.edu Sat Feb 11 16:01:25 2012
From: gus at ldeo.columbia.edu (Gustavo Correa)
Date: Sat, 11 Feb 2012 18:01:25 -0500
Subject: [torqueusers] submitting a job (interactively) issue
In-Reply-To: <4F36DE65.4010304@tds.net>
References:
<4F36DE65.4010304@tds.net>
Message-ID: <3237A3BF-58F8-4ECA-927A-A0A291B28045@ldeo.columbia.edu>
Hi Hakeem
Not sure if I understood right the issue.
Anyway, did the job return a shell prompt to you?
That is what it is expected to do.
From 'man qsub':
" If the -I option is specified on the command line or in a script
directive, or if the "interactive" job attribute declared true
via the -W option, -W interactive=true, either on the command
line or in a script directive, the job is an interactive job.
The script will be processed for directives, but will not be
included with the job. When the job begins execution, all input
to the job is from the terminal session in which qsub is run-
ning."
Only the #PBS directives in your shell script [if any] would be processed, not the commands.
Have you tried to submit the job without the shell script?
>> qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters?
This should give you a shell prompt in one of the nodes.
From there you can 'cd' to your work directory [cd $PBS_O_WORKDIR],
and run the shell script [./shellscript].
All I/O is on the terminal, no stderr/stdout files are generated.
At the end just do CTRL-D at the shell prompt to end the job.
Check the details in 'man qsub'.
I hope this helps,
Gus Correa
On Feb 11, 2012, at 4:32 PM, Jason bacon wrote:
>
> I've (apparently) had the same issue, but have not found a solution yet.
>
> Can you provide the following:
>
> 1. Operating system and version
> 2. Torque version
> 3. Relevant entries from server_logs on the submit node and mom_logs on the allocated compute node
> 4. Anything unusual in the system log
>
> When I ran into this issue, I found some errors in the logs regarding failed socket connections. I think it might be a permissions issue, but have not had time to investigate yet.
>
> -J
>
> On 2/9/12 3:33 PM, Hakeem Almabrazi wrote:
>> Hi,
>>
>> I have tried to submit a job using the option ?I and I got the message
>>
>> Qsub: waiting for job # to start
>> Qsub: job # ready
>>
>> And that is it.
>>
>> If I qstat I got a message saying the job # is still ?R? running ?
>>
>> It looks like I have lack of understanding on how to use this option but here is my submit job request:
>>
>> >qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters? shellScript
>>
>>
>> If I run the above request without the ?I option, it runs fine without any issue.
>>
>> Someone might ask the question, why I am running it ?interactively??
>>
>> Well, I want to force the program which issued the request to wait for the result and do something with it after that.
>>
>> Thank you for your help.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>>
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
From jwbacon at tds.net Sat Feb 11 16:31:10 2012
From: jwbacon at tds.net (Jason bacon)
Date: Sat, 11 Feb 2012 17:31:10 -0600
Subject: [torqueusers] submitting a job (interactively) issue
In-Reply-To: <3237A3BF-58F8-4ECA-927A-A0A291B28045@ldeo.columbia.edu>
References: <4F36DE65.4010304@tds.net>
<3237A3BF-58F8-4ECA-927A-A0A291B28045@ldeo.columbia.edu>
Message-ID: <4F36FA3E.6000701@tds.net>
If I use -I on the command line, my system drops me into a shell on a
scheduled node, which seems to be what the man page is describing.
However, if I use #PBS -I in a script, the job fails, which seems to
contradict the man page.
Example:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/sh
#PBS -I
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this case, I get the following email:
From adm at peregrine.hpc.uwm.edu Sat Feb 11 17:18:09 2012
Date: Sat, 11 Feb 2012 17:18:09 -0600 (CST)
From: adm at peregrine.hpc.uwm.edu
To: bacon at peregrine.hpc.uwm.edu
Subject: PBS JOB 742.peregrine.hpc.uwm.edu
PBS Job Id: 742.peregrine.hpc.uwm.edu
Job Name: hostname
Exec host: compute-02/0
Aborted by PBS Server
Job cannot be executed
See Administrator for help
and error messages in the logs.
I haven't considered this a major issue. Instead, I've worked around
the issue using the following script, which waits until the job ends and
then shows the output. It doesn't allow interaction, but if all you
need is to see the output when the job is done, it works fine.
#!/bin/sh
# Script: qsubw
# Submit a job, wait for it to complete, and show the output
if [ $# -ne 1 ]; then
printf "Usage: $0 script\n"
exit 1
fi
script=$1
# Strip out job name and output file options
# FIXME: This will strip out other options on the same line
egrep -v '#PBS -[Noe]' $script > $script.tmp
job_id=`qsub $script.tmp | cut -d '.' -f 1`
printf ' Job ID = %s\n' $job_id
qstat -f $job_id | fgrep exec_host
while [ `qstat | grep "^$job_id" | awk ' { print $5 }'` != 'C' ]; do
sleep 1
done
stem=${script%.*}
for file in $stem.pbs.tmp.o$job_id $stem.pbs.tmp.e$job_id; do
if [ -s $file ]; then
printf "\n%s:\n" $file
cat $file
fi
rm $file
done
-J
On 2/11/12 5:01 PM, Gustavo Correa wrote:
> Hi Hakeem
>
> Not sure if I understood right the issue.
> Anyway, did the job return a shell prompt to you?
> That is what it is expected to do.
>
> > From 'man qsub':
>
> " If the -I option is specified on the command line or in a script
> directive, or if the "interactive" job attribute declared true
> via the -W option, -W interactive=true, either on the command
> line or in a script directive, the job is an interactive job.
> The script will be processed for directives, but will not be
> included with the job. When the job begins execution, all input
> to the job is from the terminal session in which qsub is run-
> ning."
>
> Only the #PBS directives in your shell script [if any] would be processed, not the commands.
>
> Have you tried to submit the job without the shell script?
>
>>> qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters?
>
> This should give you a shell prompt in one of the nodes.
> > From there you can 'cd' to your work directory [cd $PBS_O_WORKDIR],
> and run the shell script [./shellscript].
> All I/O is on the terminal, no stderr/stdout files are generated.
> At the end just do CTRL-D at the shell prompt to end the job.
>
> Check the details in 'man qsub'.
>
> I hope this helps,
> Gus Correa
>
> On Feb 11, 2012, at 4:32 PM, Jason bacon wrote:
>
>> I've (apparently) had the same issue, but have not found a solution yet.
>>
>> Can you provide the following:
>>
>> 1. Operating system and version
>> 2. Torque version
>> 3. Relevant entries from server_logs on the submit node and mom_logs on the allocated compute node
>> 4. Anything unusual in the system log
>>
>> When I ran into this issue, I found some errors in the logs regarding failed socket connections. I think it might be a permissions issue, but have not had time to investigate yet.
>>
>> -J
>>
>> On 2/9/12 3:33 PM, Hakeem Almabrazi wrote:
>>> Hi,
>>>
>>> I have tried to submit a job using the option ?I and I got the message
>>>
>>> Qsub: waiting for job # to start
>>> Qsub: job # ready
>>>
>>> And that is it.
>>>
>>> If I qstat I got a message saying the job # is still ?R? running ?
>>>
>>> It looks like I have lack of understanding on how to use this option but here is my submit job request:
>>>
>>>> qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters? shellScript
>>>
>>>
>>> If I run the above request without the ?I option, it runs fine without any issue.
>>>
>>> Someone might ask the question, why I am running it ?interactively??
>>>
>>> Well, I want to force the program which issued the request to wait for the result and do something with it after that.
>>>
>>> Thank you for your help.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>>
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
From sm4082 at nyu.edu Sun Feb 12 08:53:14 2012
From: sm4082 at nyu.edu (Sreedhar Manchu)
Date: Sun, 12 Feb 2012 10:53:14 -0500
Subject: [torqueusers] Specifying nodes which can be used in array job
In-Reply-To: <4F34CC19.6070805@gmail.com>
References:
<4F34CC19.6070805@gmail.com>
Message-ID: <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu>
Hi Rainer,
Like Ken wrote it is possible with feature property. I use this feature heavily to place jobs on specific nodes.
To add feature to nodes
for i in {0..5}; do qmgr -c "set node node0$i properties += arrays"; done
Here feature is arrays. You can replace that with whatever you like.
Once you've done this you can get array jobs placed on these nodes by requesting this feature in qsub such as
>>> qsub the_script.sub -t 1-10 -l feature='arrays'
This would put your jobs on the nodes that have property arrays. In this case the nodes are 0 to 5.
In my case I wrote a qsub wrapper which goes through the pbs scripts and command line and adds this feature line such as #PBS -l feature= to the script so that they are placed on right nodes. This comes very handy especially when you have nodes with diiferent amounts of memory under the same queue.
If your scheduler is moab you can do really cool stuff using this feature property.
Hope this helps.
Sreedhar.
On 10-Feb-2012, at 2:49 AM, Rainer M Krug wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 09/02/12 23:39, Ken Nielson wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Rainer M Krug" To:
>>> torqueusers at supercluster.org Sent: Thursday, February 9, 2012
>>> 2:16:07 AM Subject: [torqueusers] Specifying nodes which can be
>>> used in array job
>>>
>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>
>>> Hi
>>>
>>> assuming I have cluster of 10 nodes (node01, ... node10), of
>>> which I am not the administrator.
>>>
>>> Some nodes are setup slightly different, so that a certain job
>>> only runs on nodes node01 to node05.
>>>
>>> So I would like to submit an array job and specify "only use the
>>> node01, node02, node03, node04 or node05 to run the each
>>> individual job".
>>>
>>> How can I do that? I know that I can use -l to specify resource
>>> requirements, but if I specify nodes=..., *each* job will
>>> allocate *all* nodes for the job, which is not what I want - each
>>> individual job should run on one of the nodes.
>>>
>>> so:
>>>
>>> qsub the_script.sub -t 1-10
>>>
>>> and how do I specify the nodes?
>>>
>>> Thanks,
>>>
>>> Rainer
>>
>> Rainer,
>>
>> Are there feature (properties) in the nodes files of those hosts
>> which would allow you to specify a feature on the qsub line?
>
> No - unfortunately not.
>
>>
>> Ken
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk80zBgACgkQoYgNqgF2egri1wCfUUqDmOigKB8hCyCvt30pu5jZ
> kewAnjfVc6o7rIjFua0ukEBhkaNe5McS
> =nBnt
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120212/d6dc9fed/attachment.html
From r.m.krug at gmail.com Mon Feb 13 01:37:11 2012
From: r.m.krug at gmail.com (Rainer M Krug)
Date: Mon, 13 Feb 2012 09:37:11 +0100
Subject: [torqueusers] Specifying nodes which can be used in array job
In-Reply-To: <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu>
References:
<4F34CC19.6070805@gmail.com>
<35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu>
Message-ID: <4F38CBB7.8000107@gmail.com>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Thanks a lot - this definitely helps. I will get in contact with our
admin to add the features to the nodes.
Cheers,
Rainer
On 12/02/12 16:53, Sreedhar Manchu wrote:
> Hi Rainer,
>
> Like Ken wrote it is possible with feature property. I use this
> feature heavily to place jobs on specific nodes.
>
> To add feature to nodes
>
> for i in {0..5}; do qmgr -c "set node node0$i properties +=
> arrays"; done
>
> Here feature is arrays. You can replace that with whatever you
> like.
>
> Once you've done this you can get array jobs placed on these nodes
> by requesting this feature in qsub such as
>
>>>> qsub the_script.sub -t 1-10 -l feature='arrays'
>
> This would put your jobs on the nodes that have property arrays. In
> this case the nodes are 0 to 5.
>
> In my case I wrote a qsub wrapper which goes through the pbs
> scripts and command line and adds this feature line such as #PBS -l
> feature= to the script so that they are placed on
> right nodes. This comes very handy especially when you have nodes
> with diiferent amounts of memory under the same queue.
>
> If your scheduler is moab you can do really cool stuff using this
> feature property.
>
> Hope this helps.
>
> Sreedhar.
>
>
>
> On 10-Feb-2012, at 2:49 AM, Rainer M Krug > wrote:
>
> On 09/02/12 23:39, Ken Nielson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rainer M Krug" >>>> > To:
>>>>> torqueusers at supercluster.org
>>>>> Sent: Thursday,
>>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers]
>>>>> Specifying nodes which can be used in array job
>>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>>>
>>>>> Hi
>>>>>
>>>>> assuming I have cluster of 10 nodes (node01, ... node10),
>>>>> of which I am not the administrator.
>>>>>
>>>>> Some nodes are setup slightly different, so that a certain
>>>>> job only runs on nodes node01 to node05.
>>>>>
>>>>> So I would like to submit an array job and specify "only
>>>>> use the node01, node02, node03, node04 or node05 to run the
>>>>> each individual job".
>>>>>
>>>>> How can I do that? I know that I can use -l to specify
>>>>> resource requirements, but if I specify nodes=..., *each*
>>>>> job will allocate *all* nodes for the job, which is not
>>>>> what I want - each individual job should run on one of the
>>>>> nodes.
>>>>>
>>>>> so:
>>>>>
>>>>> qsub the_script.sub -t 1-10
>>>>>
>>>>> and how do I specify the nodes?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Rainer
>>>>
>>>> Rainer,
>>>>
>>>> Are there feature (properties) in the nodes files of those
>>>> hosts which would allow you to specify a feature on the qsub
>>>> line?
>
> No - unfortunately not.
>
>>>>
>>>> Ken
>
>>
>> _______________________________________________ torqueusers
>> mailing list torqueusers at supercluster.org
>>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________ torqueusers mailing
> list torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk84y7cACgkQoYgNqgF2egodAgCfXBJiNsn+NtC8B3fO3R1fQTGd
VG0AnjzI5iBr390vLggHRpm4EmRybxSC
=x/dl
-----END PGP SIGNATURE-----
From R.M.Krug at gmail.com Mon Feb 13 01:37:11 2012
From: R.M.Krug at gmail.com (Rainer M Krug)
Date: Mon, 13 Feb 2012 09:37:11 +0100
Subject: [torqueusers] Specifying nodes which can be used in array job
In-Reply-To: <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu>
References:
<4F34CC19.6070805@gmail.com>
<35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu>
Message-ID: <4F38CBB7.8000107@gmail.com>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Thanks a lot - this definitely helps. I will get in contact with our
admin to add the features to the nodes.
Cheers,
Rainer
On 12/02/12 16:53, Sreedhar Manchu wrote:
> Hi Rainer,
>
> Like Ken wrote it is possible with feature property. I use this
> feature heavily to place jobs on specific nodes.
>
> To add feature to nodes
>
> for i in {0..5}; do qmgr -c "set node node0$i properties +=
> arrays"; done
>
> Here feature is arrays. You can replace that with whatever you
> like.
>
> Once you've done this you can get array jobs placed on these nodes
> by requesting this feature in qsub such as
>
>>>> qsub the_script.sub -t 1-10 -l feature='arrays'
>
> This would put your jobs on the nodes that have property arrays. In
> this case the nodes are 0 to 5.
>
> In my case I wrote a qsub wrapper which goes through the pbs
> scripts and command line and adds this feature line such as #PBS -l
> feature= to the script so that they are placed on
> right nodes. This comes very handy especially when you have nodes
> with diiferent amounts of memory under the same queue.
>
> If your scheduler is moab you can do really cool stuff using this
> feature property.
>
> Hope this helps.
>
> Sreedhar.
>
>
>
> On 10-Feb-2012, at 2:49 AM, Rainer M Krug > wrote:
>
> On 09/02/12 23:39, Ken Nielson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rainer M Krug" >>>> > To:
>>>>> torqueusers at supercluster.org
>>>>> Sent: Thursday,
>>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers]
>>>>> Specifying nodes which can be used in array job
>>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>>>
>>>>> Hi
>>>>>
>>>>> assuming I have cluster of 10 nodes (node01, ... node10),
>>>>> of which I am not the administrator.
>>>>>
>>>>> Some nodes are setup slightly different, so that a certain
>>>>> job only runs on nodes node01 to node05.
>>>>>
>>>>> So I would like to submit an array job and specify "only
>>>>> use the node01, node02, node03, node04 or node05 to run the
>>>>> each individual job".
>>>>>
>>>>> How can I do that? I know that I can use -l to specify
>>>>> resource requirements, but if I specify nodes=..., *each*
>>>>> job will allocate *all* nodes for the job, which is not
>>>>> what I want - each individual job should run on one of the
>>>>> nodes.
>>>>>
>>>>> so:
>>>>>
>>>>> qsub the_script.sub -t 1-10
>>>>>
>>>>> and how do I specify the nodes?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Rainer
>>>>
>>>> Rainer,
>>>>
>>>> Are there feature (properties) in the nodes files of those
>>>> hosts which would allow you to specify a feature on the qsub
>>>> line?
>
> No - unfortunately not.
>
>>>>
>>>> Ken
>
>>
>> _______________________________________________ torqueusers
>> mailing list torqueusers at supercluster.org
>>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________ torqueusers mailing
> list torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk84y7cACgkQoYgNqgF2egodAgCfXBJiNsn+NtC8B3fO3R1fQTGd
VG0AnjzI5iBr390vLggHRpm4EmRybxSC
=x/dl
-----END PGP SIGNATURE-----
From R.M.Krug at gmail.com Mon Feb 13 04:42:57 2012
From: R.M.Krug at gmail.com (Rainer M Krug)
Date: Mon, 13 Feb 2012 12:42:57 +0100
Subject: [torqueusers] Setting up torque on PC to test scripts?
Message-ID:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi
I am thinking of setting up a dummy cluster on my local PC to test my
submit scripts. Is this easily possible? Or is there even a virtual
machine to download which has torque installed? That would be the easiest.
Thanks,
Rainer
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk8490EACgkQoYgNqgF2egqJDgCfXYX1gv1i2B/ABAaZjoEtg44C
awMAniAtzHKb8znKxGmbHJUDOcFs5ohT
=RUyd
-----END PGP SIGNATURE-----
From nt_mahmood at yahoo.com Mon Feb 13 06:15:24 2012
From: nt_mahmood at yahoo.com (Mahmood Naderan)
Date: Mon, 13 Feb 2012 05:15:24 -0800 (PST)
Subject: [torqueusers] submitting a job, then modify script and resubmit
Message-ID: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
Dear all,
1- Assume I have a script (named scr.sh) like this:
#PBS -N run1
#PBS -V
#PBS -l nodes=1
#PBS -q long
#PBS -o /home/mahmood/run1.out
#PBS -j oe
cd $PBS_O_WORKDIR
./run1 config1
2- Then I run
qsub scr.sh
3- the jobs state is 'Q' since all cores are busy. That is fine...
4- while run1 is in 'Q', I reopen scr.sh and change it to
#PBS -N run2
#PBS -V
#PBS -l nodes=1
#PBS -q long
#PBS -o /home/mahmood/run2.out
#PBS -j oe
cd $PBS_O_WORKDIR
./run1 config2
5- Then I run?
qsub scr.sh
6- this job also is in 'Q' until some cores become free.
7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
Hope that I state the problem correctly.
// Naderan *Mahmood;
From akohlmey at cmm.chem.upenn.edu Mon Feb 13 07:03:20 2012
From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer)
Date: Mon, 13 Feb 2012 09:03:20 -0500
Subject: [torqueusers] submitting a job, then modify script and resubmit
In-Reply-To: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
Message-ID:
On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote:
> Dear all,
> 1- Assume I have a script (named scr.sh) like this:
>
> #PBS -N run1
> #PBS -V
> #PBS -l nodes=1
> #PBS -q long
> #PBS -o /home/mahmood/run1.out
> #PBS -j oe
> cd $PBS_O_WORKDIR
> ./run1 config1
>
> 2- Then I run
> qsub scr.sh
>
> 3- the jobs state is 'Q' since all cores are busy. That is fine...
> 4- while run1 is in 'Q', I reopen scr.sh and change it to
>
> #PBS -N run2
> #PBS -V
> #PBS -l nodes=1
> #PBS -q long
> #PBS -o /home/mahmood/run2.out
> #PBS -j oe
> cd $PBS_O_WORKDIR
> ./run1 config2
>
>
> 5- Then I run
> qsub scr.sh
>
> 6- this job also is in 'Q' until some cores become free.
> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
>
> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
qsub makes a copy of the submit script and the batch system will
execute that. that is the only way how it can consistently feeding
submission from standard input.
axel.
> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
> Hope that I state the problem correctly.
>
>
> // Naderan *Mahmood;
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
From nt_mahmood at yahoo.com Mon Feb 13 07:48:50 2012
From: nt_mahmood at yahoo.com (Mahmood Naderan)
Date: Mon, 13 Feb 2012 06:48:50 -0800 (PST)
Subject: [torqueusers] submitting a job, then modify script and resubmit
In-Reply-To:
References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
Message-ID: <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com>
Thanks for your answer. I got that.
Another question is about modification of other config files. Does qsub copy all files or it does only copy the script?
In this case, I have a scr.sh which is:
#PBS ...
./run config.ini
the config.ini file looks like:
exe = bin1
args = 8
The first qsub is:
qsub scr.sh
After submitting that, I modify the config.ini to
exe = bin1
args = 64
Then I resubmit scr.sh again
qsub scr.sh
Please note that the scr.sh doesn't change in this case. However the config.ini is modified.
Does your answer apply to this case?
Regards
// Naderan *Mahmood;
________________________________
From: Axel Kohlmeyer
To: Mahmood Naderan ; Torque Users Mailing List
Sent: Monday, February 13, 2012 5:33 PM
Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote:
> Dear all,
> 1- Assume I have a script (named scr.sh) like this:
>
> #PBS -N run1
> #PBS -V
> #PBS -l nodes=1
> #PBS -q long
> #PBS -o /home/mahmood/run1.out
> #PBS -j oe
> cd $PBS_O_WORKDIR
> ./run1 config1
>
> 2- Then I run
> qsub scr.sh
>
> 3- the jobs state is 'Q' since all cores are busy. That is fine...
> 4- while run1 is in 'Q', I reopen scr.sh and change it to
>
> #PBS -N run2
> #PBS -V
> #PBS -l nodes=1
> #PBS -q long
> #PBS -o /home/mahmood/run2.out
> #PBS -j oe
> cd $PBS_O_WORKDIR
> ./run1 config2
>
>
> 5- Then I run
> qsub scr.sh
>
> 6- this job also is in 'Q' until some cores become free.
> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
>
> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
qsub makes a copy of the submit script and the batch system will
execute that. that is the only way how it can consistently feeding
submission from standard input.
axel.
> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
> Hope that I state the problem correctly.
>
>
> // Naderan *Mahmood;
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
From akohlmey at cmm.chem.upenn.edu Mon Feb 13 07:54:23 2012
From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer)
Date: Mon, 13 Feb 2012 09:54:23 -0500
Subject: [torqueusers] submitting a job, then modify script and resubmit
In-Reply-To: <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com>
References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
<1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com>
Message-ID:
On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote:
> Thanks for your answer. I got that.
>
> Another question is about modification of other config files. Does qsub copy all files or it does only copy the script?
it only copies the script. that is all it knows about.
axel.
> In this case, I have a scr.sh which is:
> #PBS ...
> ./run config.ini
>
>
> the config.ini file looks like:
> exe = bin1
> args = 8
>
> The first qsub is:
> qsub scr.sh
>
> After submitting that, I modify the config.ini to
>
> exe = bin1
> args = 64
>
> Then I resubmit scr.sh again
>
> qsub scr.sh
>
> Please note that the scr.sh doesn't change in this case. However the config.ini is modified.
>
>
> Does your answer apply to this case?
>
> Regards
> // Naderan *Mahmood;
>
>
> ________________________________
> From: Axel Kohlmeyer
> To: Mahmood Naderan ; Torque Users Mailing List
> Sent: Monday, February 13, 2012 5:33 PM
> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
>
> On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote:
>> Dear all,
>> 1- Assume I have a script (named scr.sh) like this:
>>
>> #PBS -N run1
>> #PBS -V
>> #PBS -l nodes=1
>> #PBS -q long
>> #PBS -o /home/mahmood/run1.out
>> #PBS -j oe
>> cd $PBS_O_WORKDIR
>> ./run1 config1
>>
>> 2- Then I run
>> qsub scr.sh
>>
>> 3- the jobs state is 'Q' since all cores are busy. That is fine...
>> 4- while run1 is in 'Q', I reopen scr.sh and change it to
>>
>> #PBS -N run2
>> #PBS -V
>> #PBS -l nodes=1
>> #PBS -q long
>> #PBS -o /home/mahmood/run2.out
>> #PBS -j oe
>> cd $PBS_O_WORKDIR
>> ./run1 config2
>>
>>
>> 5- Then I run
>> qsub scr.sh
>>
>> 6- this job also is in 'Q' until some cores become free.
>> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
>>
>> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
>> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
>
> qsub makes a copy of the submit script and the batch system will
> execute that. that is the only way how it can consistently feeding
> submission from standard input.
>
> axel.
>
>
>> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
>> Hope that I state the problem correctly.
>>
>>
>> // Naderan *Mahmood;
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> --
> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
> http://sites.google.com/site/akohlmey/
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
--
Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
From nt_mahmood at yahoo.com Mon Feb 13 07:56:46 2012
From: nt_mahmood at yahoo.com (Mahmood Naderan)
Date: Mon, 13 Feb 2012 06:56:46 -0800 (PST)
Subject: [torqueusers] submitting a job, then modify script and resubmit
In-Reply-To:
References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
<1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com>
Message-ID: <1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com>
That sounds bad...
Why it doesn't copy all necessary files for a job?
?
// Naderan *Mahmood;
----- Original Message -----
From: Axel Kohlmeyer
To: Mahmood Naderan
Cc: torque cluster
Sent: Monday, February 13, 2012 6:24 PM
Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote:
> Thanks for your answer. I got that.
>
> Another question is about modification of other config files. Does qsub copy all files or it does only copy the script?
it only copies the script. that is all it knows about.
axel.
> In this case, I have a scr.sh which is:
> #PBS ...
> ./run config.ini
>
>
> the config.ini file looks like:
> exe = bin1
> args = 8
>
> The first qsub is:
> qsub scr.sh
>
> After submitting that, I modify the config.ini to
>
> exe = bin1
> args = 64
>
> Then I resubmit scr.sh again
>
> qsub scr.sh
>
> Please note that the scr.sh doesn't change in this case. However the config.ini is modified.
>
>
> Does your answer apply to this case?
>
> Regards
> // Naderan *Mahmood;
>
>
> ________________________________
> From: Axel Kohlmeyer
> To: Mahmood Naderan ; Torque Users Mailing List
> Sent: Monday, February 13, 2012 5:33 PM
> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
>
> On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote:
>> Dear all,
>> 1- Assume I have a script (named scr.sh) like this:
>>
>> #PBS -N run1
>> #PBS -V
>> #PBS -l nodes=1
>> #PBS -q long
>> #PBS -o /home/mahmood/run1.out
>> #PBS -j oe
>> cd $PBS_O_WORKDIR
>> ./run1 config1
>>
>> 2- Then I run
>> qsub scr.sh
>>
>> 3- the jobs state is 'Q' since all cores are busy. That is fine...
>> 4- while run1 is in 'Q', I reopen scr.sh and change it to
>>
>> #PBS -N run2
>> #PBS -V
>> #PBS -l nodes=1
>> #PBS -q long
>> #PBS -o /home/mahmood/run2.out
>> #PBS -j oe
>> cd $PBS_O_WORKDIR
>> ./run1 config2
>>
>>
>> 5- Then I run
>> qsub scr.sh
>>
>> 6- this job also is in 'Q' until some cores become free.
>> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
>>
>> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
>> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
>
> qsub makes a copy of the submit script and the batch system will
> execute that. that is the only way how it can consistently feeding
> submission from standard input.
>
> axel.
>
>
>> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
>> Hope that I state the problem correctly.
>>
>>
>> // Naderan *Mahmood;
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> --
> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
> http://sites.google.com/site/akohlmey/
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
--
Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
From akohlmey at cmm.chem.upenn.edu Mon Feb 13 08:01:46 2012
From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer)
Date: Mon, 13 Feb 2012 10:01:46 -0500
Subject: [torqueusers] submitting a job, then modify script and resubmit
In-Reply-To: <1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com>
References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
<1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com>
<1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com>
Message-ID:
On Mon, Feb 13, 2012 at 9:56 AM, Mahmood Naderan wrote:
> That sounds bad...
> Why it doesn't copy all necessary files for a job?
how should it know?
remember the script interpreter for the batch script
can be *anything* that reads a text file. it doesn't
have to be a shell program. and the arguments and
commands in the script can be computed on the
fly. how to know in advance.
if you want consistent behavior, you have to
make the copies yourself, e.g. create a subdirectory
for each submission and copy stuff in there from
a wrapper script to qsub. you are the only person
to set this up, since you know which files will be
needed by a job.
only the submit script is required.
axel.
>
>
> // Naderan *Mahmood;
>
>
> ----- Original Message -----
> From: Axel Kohlmeyer
> To: Mahmood Naderan
> Cc: torque cluster
> Sent: Monday, February 13, 2012 6:24 PM
> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
>
> On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote:
>> Thanks for your answer. I got that.
>>
>> Another question is about modification of other config files. Does qsub copy all files or it does only copy the script?
>
> it only copies the script. that is all it knows about.
>
> axel.
>
>> In this case, I have a scr.sh which is:
>> #PBS ...
>> ./run config.ini
>>
>>
>> the config.ini file looks like:
>> exe = bin1
>> args = 8
>>
>> The first qsub is:
>> qsub scr.sh
>>
>> After submitting that, I modify the config.ini to
>>
>> exe = bin1
>> args = 64
>>
>> Then I resubmit scr.sh again
>>
>> qsub scr.sh
>>
>> Please note that the scr.sh doesn't change in this case. However the config.ini is modified.
>>
>>
>> Does your answer apply to this case?
>>
>> Regards
>> // Naderan *Mahmood;
>>
>>
>> ________________________________
>> From: Axel Kohlmeyer
>> To: Mahmood Naderan ; Torque Users Mailing List
>> Sent: Monday, February 13, 2012 5:33 PM
>> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
>>
>> On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote:
>>> Dear all,
>>> 1- Assume I have a script (named scr.sh) like this:
>>>
>>> #PBS -N run1
>>> #PBS -V
>>> #PBS -l nodes=1
>>> #PBS -q long
>>> #PBS -o /home/mahmood/run1.out
>>> #PBS -j oe
>>> cd $PBS_O_WORKDIR
>>> ./run1 config1
>>>
>>> 2- Then I run
>>> qsub scr.sh
>>>
>>> 3- the jobs state is 'Q' since all cores are busy. That is fine...
>>> 4- while run1 is in 'Q', I reopen scr.sh and change it to
>>>
>>> #PBS -N run2
>>> #PBS -V
>>> #PBS -l nodes=1
>>> #PBS -q long
>>> #PBS -o /home/mahmood/run2.out
>>> #PBS -j oe
>>> cd $PBS_O_WORKDIR
>>> ./run1 config2
>>>
>>>
>>> 5- Then I run
>>> qsub scr.sh
>>>
>>> 6- this job also is in 'Q' until some cores become free.
>>> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
>>>
>>> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
>>> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
>>
>> qsub makes a copy of the submit script and the batch system will
>> execute that. that is the only way how it can consistently feeding
>> submission from standard input.
>>
>> axel.
>>
>>
>>> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
>>> Hope that I state the problem correctly.
>>>
>>>
>>> // Naderan *Mahmood;
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
>> http://sites.google.com/site/akohlmey/
>>
>> Institute for Computational Molecular Science
>> Temple University, Philadelphia PA, USA.
>
>
>
> --
> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
> http://sites.google.com/site/akohlmey/
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
>
--
Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
From nt_mahmood at yahoo.com Mon Feb 13 08:06:50 2012
From: nt_mahmood at yahoo.com (Mahmood Naderan)
Date: Mon, 13 Feb 2012 07:06:50 -0800 (PST)
Subject: [torqueusers] submitting a job, then modify script and resubmit
In-Reply-To:
References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com>
<1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com>
<1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com>
Message-ID: <1329145610.26261.YahooMailNeo@web111717.mail.gq1.yahoo.com>
ok thanks.
Its my job then
?
// Naderan *Mahmood;
----- Original Message -----
From: Axel Kohlmeyer
To: Mahmood Naderan
Cc: torque cluster
Sent: Monday, February 13, 2012 6:31 PM
Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
On Mon, Feb 13, 2012 at 9:56 AM, Mahmood Naderan wrote:
> That sounds bad...
> Why it doesn't copy all necessary files for a job?
how should it know?
remember the script interpreter for the batch script
can be *anything* that reads a text file. it doesn't
have to be a shell program. and the arguments and
commands in the script can be computed on the
fly. how to know in advance.
if you want consistent behavior, you have to
make the copies yourself, e.g. create a subdirectory
for each submission and copy stuff in there from
a wrapper script to qsub. you are the only person
to set this up, since you know which files will be
needed by a job.
only the submit script is required.
axel.
>
>
> // Naderan *Mahmood;
>
>
> ----- Original Message -----
> From: Axel Kohlmeyer
> To: Mahmood Naderan
> Cc: torque cluster
> Sent: Monday, February 13, 2012 6:24 PM
> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
>
> On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote:
>> Thanks for your answer. I got that.
>>
>> Another question is about modification of other config files. Does qsub copy all files or it does only copy the script?
>
> it only copies the script. that is all it knows about.
>
> axel.
>
>> In this case, I have a scr.sh which is:
>> #PBS ...
>> ./run config.ini
>>
>>
>> the config.ini file looks like:
>> exe = bin1
>> args = 8
>>
>> The first qsub is:
>> qsub scr.sh
>>
>> After submitting that, I modify the config.ini to
>>
>> exe = bin1
>> args = 64
>>
>> Then I resubmit scr.sh again
>>
>> qsub scr.sh
>>
>> Please note that the scr.sh doesn't change in this case. However the config.ini is modified.
>>
>>
>> Does your answer apply to this case?
>>
>> Regards
>> // Naderan *Mahmood;
>>
>>
>> ________________________________
>> From: Axel Kohlmeyer
>> To: Mahmood Naderan ; Torque Users Mailing List
>> Sent: Monday, February 13, 2012 5:33 PM
>> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit
>>
>> On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote:
>>> Dear all,
>>> 1- Assume I have a script (named scr.sh) like this:
>>>
>>> #PBS -N run1
>>> #PBS -V
>>> #PBS -l nodes=1
>>> #PBS -q long
>>> #PBS -o /home/mahmood/run1.out
>>> #PBS -j oe
>>> cd $PBS_O_WORKDIR
>>> ./run1 config1
>>>
>>> 2- Then I run
>>> qsub scr.sh
>>>
>>> 3- the jobs state is 'Q' since all cores are busy. That is fine...
>>> 4- while run1 is in 'Q', I reopen scr.sh and change it to
>>>
>>> #PBS -N run2
>>> #PBS -V
>>> #PBS -l nodes=1
>>> #PBS -q long
>>> #PBS -o /home/mahmood/run2.out
>>> #PBS -j oe
>>> cd $PBS_O_WORKDIR
>>> ./run1 config2
>>>
>>>
>>> 5- Then I run
>>> qsub scr.sh
>>>
>>> 6- this job also is in 'Q' until some cores become free.
>>> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'.
>>>
>>> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")?
>>> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1?
>>
>> qsub makes a copy of the submit script and the batch system will
>> execute that. that is the only way how it can consistently feeding
>> submission from standard input.
>>
>> axel.
>>
>>
>>> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing.
>>> Hope that I state the problem correctly.
>>>
>>>
>>> // Naderan *Mahmood;
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
>> http://sites.google.com/site/akohlmey/
>>
>> Institute for Computational Molecular Science
>> Temple University, Philadelphia PA, USA.
>
>
>
> --
> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
> http://sites.google.com/site/akohlmey/
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
>
--
Dr. Axel Kohlmeyer? ? akohlmey at gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
From gus at ldeo.columbia.edu Mon Feb 13 08:53:33 2012
From: gus at ldeo.columbia.edu (Gustavo Correa)
Date: Mon, 13 Feb 2012 10:53:33 -0500
Subject: [torqueusers] Setting up torque on PC to test scripts?
In-Reply-To:
References:
Message-ID:
On Feb 13, 2012, at 6:42 AM, Rainer M Krug wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi
>
> I am thinking of setting up a dummy cluster on my local PC to test my
> submit scripts. Is this easily possible? Or is there even a virtual
> machine to download which has torque installed? That would be the easiest.
>
> Thanks,
>
> Rainer
>
Hi Rainer
We did this on a few machines here, and a lot of people also use Torque in a single
machine, as it is very convenient to schedule and control jobs.
Many Linux distributions have Torque packages.
That is the fast way to install it.
Check with your 'yum', 'apt-get' or similar.
However, I don't think there is Maui also, but I haven't checked this lately.
If you want full control, choose the release, etc,
you can download and install both from source in your standalone machine.
Then run the three daemons, pbs_server, maui [or if you prefer it, pbs_server],
and pbs_mom on that machine.
The setup for queues, etc, is about the same.
It is simpler if you allow the server name just default to 'localhost'.
I hope this helps,
Gus Correa
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk8490EACgkQoYgNqgF2egqJDgCfXYX1gv1i2B/ABAaZjoEtg44C
> awMAniAtzHKb8znKxGmbHJUDOcFs5ohT
> =RUyd
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
From Michael.Zulauf at iberdrolaren.com Mon Feb 13 16:22:36 2012
From: Michael.Zulauf at iberdrolaren.com (Zulauf, Michael)
Date: Mon, 13 Feb 2012 15:22:36 -0800
Subject: [torqueusers] problem with jobs sharing cores
Message-ID:
A big thinks to Ken Nielson, Jim Coyle, and Fotis Georgatos - based on
their help I made some significant progress today. I haven't completely
worked out all the details yet, but I've found that by switching to
OpenMPI, I do not get the same problematic behavior. So it seems most
likely that the source of the problem has something to do with a
configuration detail of our mvapich2 installation.
According to some earlier benchmarking I'd done, mvapich2 seems to offer
better performance across our infiniband interconnect, so I'd like to
see if I can get to the bottom of the issue with that. Alternatively, I
could use mvapich2 for the "large" jobs (which span multiple nodes), and
use OpenMPI for the "small" jobs (which will share a node with other
jobs). I'd prefer to avoid the "dual MPI" alternative, as then we'd
have to build all executables twice, and some of them are a bit tricky.
Still, I suppose it's an option.
In any case, thanks again. Now maybe I can go haunt the mvapich2 lists,
or at least start trying to dig up the solution in that documentation.
Happy computing,
Mike
--
Mike Zulauf
Meteorologist, Lead Senior
Asset Optimization
Iberdrola Renewables
1125 NW Couch, Suite 700
Portland, OR 97209
Office: 503-478-6304 Cell: 503-913-0403
This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120213/450d7a4d/attachment-0001.html
From cholam20 at yahoo.co.in Mon Feb 13 22:57:05 2012
From: cholam20 at yahoo.co.in (revathi ganesh)
Date: Tue, 14 Feb 2012 11:27:05 +0530 (IST)
Subject: [torqueusers] this has been your time to shine...
Message-ID: <1329199025.85043.androidMobile@web137305.mail.in.yahoo.com>
it was so difficult living paycheck to paycheck this is the best thing that ever happened to me I had hit an all time low.