From basv at sara.nl Tue Jan 3 06:02:23 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 3 Jan 2012 14:02:23 +0100 Subject: [torqueusers] ANNOUNCE: New version of torque python interface (4.3.3) Message-ID: =========== 4.3.3 * New generated files for pbs_wrap.c and pbs.py to support python 3.X versions Reported by: Steve Traylen > Author: Bas van der Vlies * Fixed AdvancedParser when using 01-12 range the zero was not appended. Reported by: Ramon Bastiaans Author: Dennis Stam * examples/sara_nodes.py: Catch the PBSQuery error if we can not make a connection with the batch server and exit the program. Author: Bas van der Vlies * Remove the debian dependency on AdvancedParser. We now have our own PBSAdvanceParser so it does not conflict with other SARA packages. Reported by: Ramon Bastiaans (SARA) Fixed by: Bas van der Vlies * Make pbs_torque only dependend on libtorque instead of torque and adjusted the maintainer of the package, closes #30 Reported by: Guillermo Marcus Fixed by: Bas van der Vlies ============================================================== the latest stable pbs_python interface is available from: ftp://ftp.sara.nl/pub/outgoing/pbs_python.tar.gz Information, documentation and reporting bugs for the package: https://subtrac.sara.nl/oss/pbs_python ===== Brief description ========================================= Pbs_python interface is a wrapper class for the TORQUE C LIB API. Now you can write utilities/extensions in Python instead of C. --- Testing the package: The test programs are include as a reference how to use the pbs python module. You have to edit some test programs to reflect your Torque installation. pbsmon.py - ascii xpbsmon rack_pbsmon.py - ascii xpbsmon by rack layout pbsnodes-a.py - pbsnodes -a pbs_version.py - print server version set_property.py - set some node properties resmom_info.py - queries the pbs_mom daemon on the nodes logpbs.py - Shows the usage of the PBS logging routines new_interface.py - Example how to use PBSQuery module PBSQuery.py - python /PBSQuery.py (has builtin demo) sara_nodes.py - We use this program to set the nodes offline/online. when there are no command line arguments. It will list the nodes that are down/offline. For more info see: https://subtrac.sara.nl/oss/pbs_python/wiki/TorqueExamples For more info about SARA see: http://www.sara.nl/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120103/9c41fe06/attachment-0001.html From dbeer at adaptivecomputing.com Tue Jan 3 09:21:10 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 03 Jan 2012 09:21:10 -0700 (MST) Subject: [torqueusers] ANNOUNCE: New version of torque python interface (4.3.3) In-Reply-To: Message-ID: <0de0497f-e7d7-422d-af64-3418a24a162d@mail> Thanks for your work on this Bas! David ----- Original Message ----- > > > > > =========== 4.3.3 > > * New generated files for pbs_wrap.c and pbs.py to support python 3.X > versions > > Reported by: Steve Traylen < steve dot traylen add cern dot ch > > > Author: Bas van der Vlies > > > > * Fixed AdvancedParser when using 01-12 range the zero was not > appended. > > Reported by: Ramon Bastiaans > > Author: Dennis Stam > > > > * examples/sara_nodes.py: Catch the PBSQuery error if we can not make > a > > connection with the batch server and exit the program. > > Author: Bas van der Vlies > > > > * Remove the debian dependency on AdvancedParser. We now have > > our own PBSAdvanceParser so it does not conflict with other SARA > > packages. > > Reported by: Ramon Bastiaans (SARA) > > Fixed by: Bas van der Vlies > > > > * Make pbs_torque only dependend on libtorque instead of torque and > adjusted > > the maintainer of the package, closes #30 > > Reported by: Guillermo Marcus com> > > Fixed by: Bas van der Vlies > > > > ============================================================== > > the latest stable pbs_python interface is available from: > > ftp://ftp.sara.nl/pub/outgoing/pbs_python.tar.gz > > > > Information, documentation and reporting bugs for the package: > > https://subtrac.sara.nl/oss/pbs_python > > > > ===== Brief description ========================================= > > > > Pbs_python interface is a wrapper class for the TORQUE C LIB API. Now > you can write utilities/extensions in Python instead of C. > > > > --- Testing the package: > > > > The test programs are include as a reference how to use the pbs > python module. You have to edit some test programs to reflect your > Torque installation. > > > > pbsmon.py - ascii xpbsmon > > rack_pbsmon.py - ascii xpbsmon by rack layout > > pbsnodes-a.py - pbsnodes -a > > pbs_version.py - print server version > > set_property.py - set some node properties > > resmom_info.py - queries the pbs_mom daemon on the nodes > > logpbs.py - Shows the usage of the PBS logging routines > > new_interface.py - Example how to use PBSQuery module > > PBSQuery.py - python /PBSQuery.py (has builtin demo) > > sara_nodes.py - We use this program to set the nodes offline/online. > > when there are no command line arguments. It will list > > the nodes that are down/offline. For more info see: > > https://subtrac.sara.nl/oss/pbs_python/wiki/TorqueExamples > > > > > > For more info about SARA see: > > http://www.sara.nl/ > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From agarwal1975 at gmail.com Tue Jan 3 09:23:38 2012 From: agarwal1975 at gmail.com (Ashish Agarwal) Date: Tue, 3 Jan 2012 11:23:38 -0500 Subject: [torqueusers] load environment variables before parsing PBS script Message-ID: On our cluster we have Python 2.4.3 installed in the system location, but I would like to write a PBS script using Python 2.7.2. The latter requires me to first load some environment variables. How can I do this within a PBS script? Here's a test script: ---- python.pbs ---- #!/usr/bin/env python #PBS -o /home//$PBS_JOBID.out #PBS -j oe import sys print sys.version ---- Submitting this will print 2.4.3. Now, how can I make this script print 2.7.2, which requires me to run some bash commands first? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120103/90b97e8c/attachment.html From akohlmey at cmm.chem.upenn.edu Tue Jan 3 09:26:03 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Tue, 3 Jan 2012 11:26:03 -0500 Subject: [torqueusers] load environment variables before parsing PBS script In-Reply-To: References: Message-ID: On Tue, Jan 3, 2012 at 11:23 AM, Ashish Agarwal wrote: > On our cluster we have Python 2.4.3 installed in the system location, but I > would like to write a PBS script using Python 2.7.2. The latter requires me > to first load some environment variables. How can I do this within a PBS > script? try: man env axel > > Here's a test script: > > ---- python.pbs ---- > #!/usr/bin/env python > > #PBS -o /home//$PBS_JOBID.out > #PBS -j oe > > import sys > print sys.version > ---- > > Submitting this will print 2.4.3. Now, how can I make this script print > 2.7.2, which requires me to run some bash commands first? > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From samuel at unimelb.edu.au Tue Jan 3 17:19:00 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 04 Jan 2012 11:19:00 +1100 Subject: [torqueusers] load environment variables before parsing PBS script In-Reply-To: References: Message-ID: <4F039AF4.3060609@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/01/12 03:23, Ashish Agarwal wrote: > Submitting this will print 2.4.3. Now, how can I make this script print > 2.7.2, which requires me to run some bash commands first? I would suggest you look at Modules ( http://modules.sf.net/ ) and implement that. Then make your pbs script a bash script which can then do: module load python/2.7.2 python foo.py Hope this helps! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8DmvQACgkQO2KABBYQAh8MHgCePDexK9XUNVOeTd+ocI29Y3iE bOIAn2qktKXHgd1vwmvaYgj4yidPOvZu =k0MS -----END PGP SIGNATURE----- From Gareth.Williams at csiro.au Tue Jan 3 21:35:04 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 4 Jan 2012 15:35:04 +1100 Subject: [torqueusers] torque 3.0.3 on uv Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C8732E86@exvic-mbx04.nexus.csiro.au> HI All, I've started configuring torque 3.0.3 on an SGI UV system (following http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml) and am having problems. I started with a working non-numa 3.0.3 setup as a sanity check. I configured it with -enable-numa-support and made a nodes file: cherax-1 np=48 num_numa_nodes=6 and mom.layout #cpus=0-15 mem=0-1 /boot cpus=16-23 mem=2 cpus=24-31 mem=3 #cpus=32-47 mem=4-5 /user cpus=48-55 mem=6 cpus=55-63 mem=7 cpus=64-71 mem=8 cpus=72-79 mem=9 (note that some of the blades are set aside for io etc. and not all are currently on or configured). 'pbsnodes -a' then reports sensible info about (virtual) nodes cherax-1-0 through cherax-1-5 However then it gets messy. I couldn't submit jobs anymore (ruserok errors). Putting cherax-1 in a .rhosts file allowed me to submit a job which seemed to run ok but it failed to finish cleanly: 01/04/2012 14:41:46 S Reply sent for request type JobObituary on socket 17 01/04/2012 14:41:46 M scan_for_terminated: job 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991 01/04/2012 14:41:46 M job was terminated 01/04/2012 14:41:46 M obit sent to server 01/04/2012 14:41:46 M server rejected job obit - 15001 01/04/2012 14:41:47 M removed job script 01/04/2012 14:41:54 S preparing to send 'a' mail for job 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job does not exist on node) The server log has messages about nodes changing state (I think the state=512 is unexpected): 01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status from node cherax-1 01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting state for node cherax-1-4 - state=512, newstate=0 Is it possible that the name 'cherax-1' is being handled badly with a trailing hyphen-digit, similar to the virtual node designation? Last, I also tried a non-uniform layout with numa_node_str Nodes: cherax-1 numa_node_str=16,8,8,16 (with a compatible mom.layout) and pbs_server crashed: Jan 4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916] trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in pbs_server[400000+58000] Has anyone used such a setup successfully? Regards, Gareth -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120104/94f22b07/attachment-0001.html From dbeer at adaptivecomputing.com Wed Jan 4 09:53:13 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 04 Jan 2012 09:53:13 -0700 (MST) Subject: [torqueusers] torque 3.0.3 on uv In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C8732E86@exvic-mbx04.nexus.csiro.au> Message-ID: ----- Original Message ----- > > > > > HI All, > > > > I?ve started configuring torque 3.0.3 on an SGI UV system (following > http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml ) > and am having problems. > > > > I started with a working non-numa 3.0.3 setup as a sanity check. > > > > I configured it with ?enable-numa-support and made a nodes file: > > cherax-1 np=48 num_numa_nodes=6 > > and mom.layout > > #cpus=0-15 mem=0-1 /boot > > cpus=16-23 mem=2 > > cpus=24-31 mem=3 > > #cpus=32-47 mem=4-5 /user > > cpus=48-55 mem=6 > > cpus=55-63 mem=7 > > cpus=64-71 mem=8 > > cpus=72-79 mem=9 > > (note that some of the blades are set aside for io etc. and not all > are currently on or configured). For me this is the first red flag. I don't know that we have anyone successfully using non-sequential layouts (skipping a blade in the middle). I know we have other sites, in fact it is typical, that skip some at the beginning or end for the boot set, but I don't think anyone is skipping in the middle. Would it be possible to move that user either to the front or to the back? > > > > ?pbsnodes ?a? then reports sensible info about (virtual) nodes > cherax-1-0 through cherax-1-5 > > > > However then it gets messy. > > > > I couldn?t submit jobs anymore (ruserok errors). Putting cherax-1 in > a .rhosts file allowed me to submit a job which seemed to run ok but > it failed to finish cleanly: > > 01/04/2012 14:41:46 S Reply sent for request type JobObituary on > socket 17 > > 01/04/2012 14:41:46 M scan_for_terminated: job > 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991 > > 01/04/2012 14:41:46 M job was terminated > > 01/04/2012 14:41:46 M obit sent to server > > 01/04/2012 14:41:46 M server rejected job obit - 15001 > > 01/04/2012 14:41:47 M removed job script > > 01/04/2012 14:41:54 S preparing to send 'a' mail for job > 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job > does not exist on node) > Can you turn logging up (say to 10 or so) on the mom and the server and then reproduce this and email it to me? (The complete log) > > > The server log has messages about nodes changing state (I think the > state=512 is unexpected): > > 01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status > from node cherax-1 > > 01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting > state for node cherax-1-4 - state=512, newstate=0 > > If you could reproduce this in the same set, that'd be great. Its hard to know if 512 is unexpected or not without knowing why TORQUE set the state to 512. > > Is it possible that the name ?cherax-1? is being handled badly with a > trailing hyphen-digit, similar to the virtual node designation? > It is possible, although I know for a fact that we have a site with the same naming convention which doesn't experience these problems. I would look at the non-sequential blades first. > > > Last, I also tried a non-uniform layout with numa_node_str > > Nodes: cherax-1 numa_node_str=16,8,8,16 > > (with a compatible mom.layout) and pbs_server crashed: > > Jan 4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916] > trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in > pbs_server[400000+58000] > > Has anyone used such a setup successfully? > I can't say for sure (hopefully someone else will chime in) but I thought we had sites using it. Would it be possible for you to enable core dumping and send me the core? If it is too large to email, you can upload it to our scp server. If this is necessary I'll send you the details directly. I would really like to fix this and I'm thinking it should be fairly straightforward. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Wed Jan 4 10:51:12 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 04 Jan 2012 10:51:12 -0700 (MST) Subject: [torqueusers] torque 3.0.3 on uv In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C8732E86@exvic-mbx04.nexus.csiro.au> Message-ID: <419ebb0f-52a4-4228-bb9f-3bfcac59c959@mail> Actually, I was able to reproduce and fix the crash. Let me know if you would like a snapshot build. David ----- Original Message ----- > > > > > HI All, > > > > I?ve started configuring torque 3.0.3 on an SGI UV system (following > http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml ) > and am having problems. > > > > I started with a working non-numa 3.0.3 setup as a sanity check. > > > > I configured it with ?enable-numa-support and made a nodes file: > > cherax-1 np=48 num_numa_nodes=6 > > and mom.layout > > #cpus=0-15 mem=0-1 /boot > > cpus=16-23 mem=2 > > cpus=24-31 mem=3 > > #cpus=32-47 mem=4-5 /user > > cpus=48-55 mem=6 > > cpus=55-63 mem=7 > > cpus=64-71 mem=8 > > cpus=72-79 mem=9 > > (note that some of the blades are set aside for io etc. and not all > are currently on or configured). > > > > ?pbsnodes ?a? then reports sensible info about (virtual) nodes > cherax-1-0 through cherax-1-5 > > > > However then it gets messy. > > > > I couldn?t submit jobs anymore (ruserok errors). Putting cherax-1 in > a .rhosts file allowed me to submit a job which seemed to run ok but > it failed to finish cleanly: > > 01/04/2012 14:41:46 S Reply sent for request type JobObituary on > socket 17 > > 01/04/2012 14:41:46 M scan_for_terminated: job > 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991 > > 01/04/2012 14:41:46 M job was terminated > > 01/04/2012 14:41:46 M obit sent to server > > 01/04/2012 14:41:46 M server rejected job obit - 15001 > > 01/04/2012 14:41:47 M removed job script > > 01/04/2012 14:41:54 S preparing to send 'a' mail for job > 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job > does not exist on node) > > > > The server log has messages about nodes changing state (I think the > state=512 is unexpected): > > 01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status > from node cherax-1 > > 01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting > state for node cherax-1-4 - state=512, newstate=0 > > > > Is it possible that the name ?cherax-1? is being handled badly with a > trailing hyphen-digit, similar to the virtual node designation? > > > > Last, I also tried a non-uniform layout with numa_node_str > > Nodes: cherax-1 numa_node_str=16,8,8,16 > > (with a compatible mom.layout) and pbs_server crashed: > > Jan 4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916] > trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in > pbs_server[400000+58000] > > Has anyone used such a setup successfully? > > > > Regards, > > > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From tbaer at utk.edu Wed Jan 4 11:35:18 2012 From: tbaer at utk.edu (Troy Baer) Date: Wed, 4 Jan 2012 13:35:18 -0500 Subject: [torqueusers] torque 3.0.3 on uv In-Reply-To: References: Message-ID: <1325702118.2542.115.camel@browncoat.jics.utk.edu> On Wed, 2012-01-04 at 09:53 -0700, David Beer wrote: > ----- Original Message ----- > > I?ve started configuring torque 3.0.3 on an SGI UV system (following > > http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml ) > > and am having problems. > > > > I started with a working non-numa 3.0.3 setup as a sanity check. > > > > I configured it with ?enable-numa-support and made a nodes file: > > > > cherax-1 np=48 num_numa_nodes=6 > > > > and mom.layout > > > > #cpus=0-15 mem=0-1 /boot > > cpus=16-23 mem=2 > > cpus=24-31 mem=3 > > #cpus=32-47 mem=4-5 /user > > cpus=48-55 mem=6 > > cpus=55-63 mem=7 > > cpus=64-71 mem=8 > > cpus=72-79 mem=9 > > > > (note that some of the blades are set aside for io etc. and not all > > are currently on or configured). > > For me this is the first red flag. I don't know that we have anyone > successfully using non-sequential layouts (skipping a blade in the > middle). I know we have other sites, in fact it is typical, that skip > some at the beginning or end for the boot set, but I don't think > anyone is skipping in the middle. Would it be possible to move that > user either to the front or to the back? The way I've handled this is to leave the all NUMA nodes in the mom.layout file, but then fence off the one I don't want used by jobs by placing standing reservations on them in Moab and/or marking them offline in TORQUE. --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From dbeer at adaptivecomputing.com Thu Jan 5 10:18:22 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 05 Jan 2012 10:18:22 -0700 (MST) Subject: [torqueusers] torque 3.0.3 on uv In-Reply-To: <1325702118.2542.115.camel@browncoat.jics.utk.edu> Message-ID: Here is a snapshot with the fix I mentioned. http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-3.0.4-snap.201201051014.tar.gz David ----- Original Message ----- > On Wed, 2012-01-04 at 09:53 -0700, David Beer wrote: > > ----- Original Message ----- > > > I?ve started configuring torque 3.0.3 on an SGI UV system > > > (following > > > http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml > > > ) > > > and am having problems. > > > > > > I started with a working non-numa 3.0.3 setup as a sanity check. > > > > > > I configured it with ?enable-numa-support and made a nodes file: > > > > > > cherax-1 np=48 num_numa_nodes=6 > > > > > > and mom.layout > > > > > > #cpus=0-15 mem=0-1 /boot > > > cpus=16-23 mem=2 > > > cpus=24-31 mem=3 > > > #cpus=32-47 mem=4-5 /user > > > cpus=48-55 mem=6 > > > cpus=55-63 mem=7 > > > cpus=64-71 mem=8 > > > cpus=72-79 mem=9 > > > > > > (note that some of the blades are set aside for io etc. and not > > > all > > > are currently on or configured). > > > > For me this is the first red flag. I don't know that we have anyone > > successfully using non-sequential layouts (skipping a blade in the > > middle). I know we have other sites, in fact it is typical, that > > skip > > some at the beginning or end for the boot set, but I don't think > > anyone is skipping in the middle. Would it be possible to move that > > user either to the front or to the back? > > The way I've handled this is to leave the all NUMA nodes in the > mom.layout file, but then fence off the one I don't want used by jobs > by > placing standing reservations on them in Moab and/or marking them > offline in TORQUE. > > --Troy > -- > Troy Baer, HPC System Administrator > National Institute for Computational Sciences, University of > Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 > > > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From cholam20 at yahoo.co.in Thu Jan 5 21:17:57 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Fri, 6 Jan 2012 09:47:57 +0530 (IST) Subject: [torqueusers] Nice opportunity Message-ID: <1325823477.24797.androidMobile@web137305.mail.in.yahoo.com>

Whats up...

I have always worked hard for what I wanted I took my chances with this I wasnt feeling like myself
http://invallid.707.cz/profile/91GeoffreySullivan/ now I can finally advance
you would excell at this
see you later

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120106/5870f468/attachment.html From mrobbert at mines.edu Fri Jan 6 10:04:26 2012 From: mrobbert at mines.edu (Michael Robbert) Date: Fri, 6 Jan 2012 10:04:26 -0700 Subject: [torqueusers] Bug fixes for 2.x branch Message-ID: <3C7383D3-3A79-47CD-9F60-8EDBE680DB40@mines.edu> Is it true that that all development work has moved to 4.0 leaving users with bugs in 2.x waiting for a production release to be ready? I've had a problem with pbs_mom's dieing or getting stuck in a loop since before Thanksgiving and have been working the case through our vendor Penguin Computing. They told me just before christmas that they'd contacted Adaptive about the issue and it looked similar to a known bug and that they were working on a fix. When I contacted Penguin after the break they said that they were now being told that the fix will be in 4.0 and we have to wait for that. I find it a little disturbing that support for a product is being dropped before the next production release is ready. I posted my analysis of the problem to this list on December 13 and didn't get any response so maybe that is the case, but I'd like to hear that from the maintainers of the code. Thanks, Mike Robbert From dbeer at adaptivecomputing.com Fri Jan 6 10:12:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 06 Jan 2012 10:12:42 -0700 (MST) Subject: [torqueusers] Bug fixes for 2.x branch In-Reply-To: <3C7383D3-3A79-47CD-9F60-8EDBE680DB40@mines.edu> Message-ID: <43f7aa2a-a1ec-4de3-aaa6-3b256f195b57@mail> ----- Original Message ----- > Is it true that that all development work has moved to 4.0 leaving > users with bugs in 2.x waiting for a production release to be ready? > I've had a problem with pbs_mom's dieing or getting stuck in a loop > since before Thanksgiving and have been working the case through our > vendor Penguin Computing. They told me just before christmas that > they'd contacted Adaptive about the issue and it looked similar to a > known bug and that they were working on a fix. When I contacted > Penguin after the break they said that they were now being told that > the fix will be in 4.0 and we have to wait for that. I find it a > little disturbing that support for a product is being dropped before > the next production release is ready. I posted my analysis of the > problem to this list on December 13 and didn't get any response so > maybe that is the case, but I'd like to hear that from the > maintainers of the code. > Mike, I think there must've been some signals crossed between our support and your vendor at Penguin. Can you email me directly with the case number and I can make sure to straighten things out? We have had development support for bug fixes throughout the push for 4.0. There have been times where new development has taken some of support's resources, but there have also been times where new development has stopped so we can all work on support. I don't know the details of your support ticket, but I would like to apologize that it has taken so long for you to get a solution, and if you message me the details I will be sure to get things worked out. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Fri Jan 6 10:23:23 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 06 Jan 2012 10:23:23 -0700 (MST) Subject: [torqueusers] pbs_mom stuck in loop In-Reply-To: Message-ID: Mike, To me it would seem that this kind of loop can only happen as a result of some kind of data corruption (probably the same kind that in other cases is causing crashes). Do you have a core file for these crashes? That's probably the best way to find out what is going wrong. As I'm looking at things it occurs to me that 2.5.6 had a pretty huge unprotected data bug (someone added a thread without adding any kind of protection) that could cause this exact problem. It is highly likely that upgrading will resolve this issue. I'm sorry to get you this reply late, as I've been busy and haven't been monitoring the mailing list as much as I often do. David ----- Original Message ----- > We have a cluster of about 64 nodes running Scyld Clusterware 5.6.3 > which ships with Torque 2.5.6 and we are running Maui 3.3.1 as the > scheduler on top of that. We are seeing nodes approximately daily > showing up as down according to Torque, but they otherwise look > normal. Sometimes we find that pbs_mom has crashed, but other times > we find that it is still running and appears to be stuck in a loop. > Yesterday I had a node that was stuck in the loop and I was able to > attach gdb to the process to confirm that. I found that it was stuck > in the function scan_non_child_tasks inside mom_mach.c. I'm able to > step forward to confirm that it keeps running a loop in that > function. I'm not a programmer, but it looks to me like the linked > list that it is attempting to traverse has the same address for both > the next and previous tasks no matter how many time we go through > the loop. Here is some output to demonstrate: > > 3761 in mom_mach.c > (gdb) p *task > $2 = {ti_job = 0xa5e6130, ti_jobtask = {ll_prior = 0xa5e5108, > ll_next = 0xa5e5108, ll_struct = 0xa5e5100}, ti_fd = -1, ti_flags > = 0, > ti_register = 0, ti_obits = {ll_prior = 0xa5e5130, ll_next = > 0xa5e5130, > ll_struct = 0x0}, ti_info = {ll_prior = 0xa5e5148, ll_next = > 0xa5e5148, > ll_struct = 0x0}, ti_qs = { > ti_parentjobid = "199592.mio.mines.edu", '\000' times>, > ti_parentnode = -1, ti_parenttask = 0, ti_task = 1, ti_status = > 3, > ti_sid = 86040, ti_exitstat = 0, ti_u = {ti_hold = { > 0 }}}} > (gdb) n > 3783 in mom_mach.c > (gdb) n > 3761 in mom_mach.c > (gdb) n > 3751 in mom_mach.c > (gdb) n > 3761 in mom_mach.c > (gdb) p *task > $3 = {ti_job = 0xa5e6130, ti_jobtask = {ll_prior = 0xa5e5108, > ll_next = 0xa5e5108, ll_struct = 0xa5e5100}, ti_fd = -1, ti_flags > = 0, > ti_register = 0, ti_obits = {ll_prior = 0xa5e5130, ll_next = > 0xa5e5130, > ll_struct = 0x0}, ti_info = {ll_prior = 0xa5e5148, ll_next = > 0xa5e5148, > ll_struct = 0x0}, ti_qs = { > ti_parentjobid = "199592.mio.mines.edu", '\000' times>, > ti_parentnode = -1, ti_parenttask = 0, ti_task = 1, ti_status = > 3, > ti_sid = 86040, ti_exitstat = 0, ti_u = {ti_hold = { > 0 }}}} > (gdb) bt full > #0 scan_non_child_tasks () at mom_mach.c:3761 > dent = > task = 0xa5e5100 > job = 0xa5e6130 > pdir = 0xa62bd10 > first_time = 0 > #1 0x0000000000416fe9 in main_loop () at mom_main.c:8251 > myla = 2.4703282292062327e-323 > tmpTime = > id = "main_loop" > #2 0x0000000000417221 in main (argc=5, argv=0x7fffa0d03718) > at mom_main.c:8406 > rc = 0 > tmpFD = > (gdb) > > Any thoughts on how we're getting here and better yet how to prevent > it? > > Thanks, > Mike Robbert > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From mrobbert at mines.edu Fri Jan 6 10:26:58 2012 From: mrobbert at mines.edu (Michael Robbert) Date: Fri, 6 Jan 2012 10:26:58 -0700 Subject: [torqueusers] Bug fixes for 2.x branch In-Reply-To: <43f7aa2a-a1ec-4de3-aaa6-3b256f195b57@mail> References: <43f7aa2a-a1ec-4de3-aaa6-3b256f195b57@mail> Message-ID: <0E09F022-2700-4E4A-9D88-A247BCFCF8D4@mines.edu> David, Thank you for your quick response. Penguin Computing has not kept me in the loop very well with what they're doing on the back end so I don't have any knowledge of a support request other than their telling me that it exists. I have forwarded your response to them in hopes that they'll contact you. I am glad to hear that you don't have the support policy of only supporting the latest version like Penguin. Thanks, Mike On Jan 6, 2012, at 10:12 AM, David Beer wrote: > > > ----- Original Message ----- >> Is it true that that all development work has moved to 4.0 leaving >> users with bugs in 2.x waiting for a production release to be ready? >> I've had a problem with pbs_mom's dieing or getting stuck in a loop >> since before Thanksgiving and have been working the case through our >> vendor Penguin Computing. They told me just before christmas that >> they'd contacted Adaptive about the issue and it looked similar to a >> known bug and that they were working on a fix. When I contacted >> Penguin after the break they said that they were now being told that >> the fix will be in 4.0 and we have to wait for that. I find it a >> little disturbing that support for a product is being dropped before >> the next production release is ready. I posted my analysis of the >> problem to this list on December 13 and didn't get any response so >> maybe that is the case, but I'd like to hear that from the >> maintainers of the code. >> > > Mike, > > I think there must've been some signals crossed between our support and your vendor at Penguin. Can you email me directly with the case number and I can make sure to straighten things out? We have had development support for bug fixes throughout the push for 4.0. There have been times where new development has taken some of support's resources, but there have also been times where new development has stopped so we can all work on support. I don't know the details of your support ticket, but I would like to apologize that it has taken so long for you to get a solution, and if you message me the details I will be sure to get things worked out. > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Fri Jan 6 11:19:45 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 06 Jan 2012 11:19:45 -0700 (MST) Subject: [torqueusers] Beta Update In-Reply-To: <0c6c7461-2773-4862-bb64-002d3290864e@mail> Message-ID: <7a2cce2c-fe85-4387-8f4e-89a09036a562@mail> All, We have an updated beta snapshot for 4.0.0. Fixes from the original include: - compiling with numa enabled - having the README files that were missing - only having one license included - fixing some warnings with munge enabled - fixing a deadlock if you're using job logging - adding some more buffer protections (solving some of the high load crashes) - add some error checking if there are disagreements between the mom_hierarchy and nodes files (nodes present in one but not the other) It can be downloaded here: http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201201061112.tar.gz -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From knielson at adaptivecomputing.com Fri Jan 6 13:39:04 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 06 Jan 2012 13:39:04 -0700 (MST) Subject: [torqueusers] [torquedev] Beta Update In-Reply-To: Message-ID: <95126526-64f1-4689-82b6-33120058cf9b@mail> Dominique, Our attorney is the one who edited this license. Again the only things that have changed are the removal of two expired provisions which prevented the commercial distribution of TORQUE prior to December 31, 2001 and an update to the the contact information. Users are not in jeopardy of violating the PBS License agreement because of these clarifications. In spite of the editing work done the license has not changed. Ken Nielson Adaptive Computing ----- Original Message ----- > From: "Dominique Belhachemi" > To: "David Beer" , "Torque Developers mailing list" > Cc: torquedev at adaptivecomputing.com, torqueusers at adaptivecomputing.com > Sent: Friday, January 6, 2012 11:48:11 AM > Subject: Re: [torquedev] Beta Update > > There are still two license files included: > > torque-4.0.0/contrib/PBS_License_2.3.txt > torque-4.0.0/PBS_License.txt > > Please revert all the license changes you made so far. Customers > might > loose the right to use torque if they violate the original license. > > The same applies to 2.5.x and 3.x versions. > > Thanks > -Dominique > > > On Fri, 6 Jan 2012, David Beer wrote: > > > All, > > > > We have an updated beta snapshot for 4.0.0. Fixes from the original > > include: > > > > - compiling with numa enabled > > - having the README files that were missing > > - only having one license included > > - fixing some warnings with munge enabled > > - fixing a deadlock if you're using job logging > > - adding some more buffer protections (solving some of the high > > load crashes) > > - add some error checking if there are disagreements between the > > mom_hierarchy and nodes files (nodes present in one but not the > > other) > > > > It can be downloaded here: > > http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201201061112.tar.gz > > > > -- > > David Beer > > Direct Line: 801-717-3386 | Fax: 801-717-3738 > > Adaptive Computing > > 1712 S East Bay Blvd, Suite 300 > > Provo, UT 84606 > > > > _______________________________________________ > > torquedev mailing list > > torquedev at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torquedev > > > > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev > From Gareth.Williams at csiro.au Sat Jan 7 05:47:54 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 7 Jan 2012 23:47:54 +1100 Subject: [torqueusers] torque 3.0.3 on uv In-Reply-To: <1325702118.2542.115.camel@browncoat.jics.utk.edu> References: <1325702118.2542.115.camel@browncoat.jics.utk.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C8732E9C@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Troy Baer [mailto:tbaer at utk.edu] > Sent: Thursday, 5 January 2012 5:35 AM > To: David Beer; Torque Users Mailing List > Subject: Re: [torqueusers] torque 3.0.3 on uv > > On Wed, 2012-01-04 at 09:53 -0700, David Beer wrote: > > ----- Original Message ----- > > > I?ve started configuring torque 3.0.3 on an SGI UV system > (following > > > http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml ) > > > and am having problems. > > > > > > I started with a working non-numa 3.0.3 setup as a sanity check. > > > > > > I configured it with ?enable-numa-support and made a nodes file: > > > > > > cherax-1 np=48 num_numa_nodes=6 > > > > > > and mom.layout > > > > > > #cpus=0-15 mem=0-1 /boot > > > cpus=16-23 mem=2 > > > cpus=24-31 mem=3 > > > #cpus=32-47 mem=4-5 /user > > > cpus=48-55 mem=6 > > > cpus=55-63 mem=7 > > > cpus=64-71 mem=8 > > > cpus=72-79 mem=9 > > > > > > (note that some of the blades are set aside for io etc. and not all > > > are currently on or configured). > > > > For me this is the first red flag. I don't know that we have anyone > > successfully using non-sequential layouts (skipping a blade in the > > middle). I know we have other sites, in fact it is typical, that skip > > some at the beginning or end for the boot set, but I don't think > > anyone is skipping in the middle. Would it be possible to move that > > user either to the front or to the back? > > The way I've handled this is to leave the all NUMA nodes in the > mom.layout file, but then fence off the one I don't want used by jobs > by > placing standing reservations on them in Moab and/or marking them > offline in TORQUE. > > --Troy Thanks Troy, I've taken your advice for now and am working through the problem with support from David Beer. It looks like I'll be cranking up the debugger on Monday! Gareth > -- > Troy Baer, HPC System Administrator > National Institute for Computational Sciences, University of Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 > > From Gareth.Williams at csiro.au Sun Jan 8 22:16:39 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 9 Jan 2012 16:16:39 +1100 Subject: [torqueusers] torque 3.0.3 on uv In-Reply-To: <1325702118.2542.115.camel@browncoat.jics.utk.edu> References: <1325702118.2542.115.camel@browncoat.jics.utk.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C8732EA4@exvic-mbx04.nexus.csiro.au> Hi All, With David's help, I've found that the problem was triggered by having a hostname that ended with and dash and a digit (like the virtual node name), in combination with the root of this name also being a valid host. I don't suppose anyone else is likely to hit the same problem! Changing the host name avoided the issue (it is a dev system so this was not too nasty). Gareth From cholam20 at yahoo.co.in Sun Jan 8 22:59:19 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Mon, 9 Jan 2012 11:29:19 +0530 (IST) Subject: [torqueusers] I am my own boss Message-ID: <1326088759.62486.androidMobile@web137303.mail.in.yahoo.com>

hi there...

ive always been pressured to be the best now its impossible for me to fall behind I knew I had to make a move fast...
http://brandnewday.eemlandinternet.nl/profile/59CraigEvans/ I love that nobody is in charge of me anymore
this is no joke!
see you later!

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120109/dd44416d/attachment.html From avladim at stanford.edu Fri Jan 6 17:39:00 2012 From: avladim at stanford.edu (Andrey Vladimirov) Date: Fri, 6 Jan 2012 16:39:00 -0800 Subject: [torqueusers] qstat: End of file Message-ID: I have just run into the same problem: deleted some faulty jobs from /var/spool/torque/server_priv/jobs, and the command "qstat" began to pring the message "qstat: end of file", followed by a crash of pbs_server. In my case, the problem was caused by the fact that deleted jobs belonged to a job array. The solution was to delete the respective array files from /var/spool/torque/server_priv/arrays -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120106/dfb43073/attachment.html From dbeer at adaptivecomputing.com Mon Jan 9 09:54:54 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 09 Jan 2012 09:54:54 -0700 (MST) Subject: [torqueusers] qstat: End of file In-Reply-To: Message-ID: <3d35547e-f39e-457c-841d-98c33509d91a@mail> ----- Original Message ----- > > > I have just run into the same problem: deleted some faulty jobs from > /var/spool/torque/server_priv/jobs, > and the command "qstat" began to pring the message "qstat: end of > file", followed by a crash of pbs_server. > > > > In my case, the problem was caused by the fact that deleted jobs > belonged to a job array. > The solution was to delete the respective array files from > /var/spool/torque/server_priv/arrays > As developers, we can't really guarantee that TORQUE will work correctly if you delete the internal files that TORQUE uses. There are many ways to clean up jobs that are faulty, and deleting files should definitely be considered a last resort option. That being said, I'm surprised to hear that it crashed. If you have a core file and a version number we can consider adding some extra checks to prevent the crash. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From jkusznir at gmail.com Mon Jan 9 11:52:09 2012 From: jkusznir at gmail.com (Jim Kusznir) Date: Mon, 9 Jan 2012 10:52:09 -0800 Subject: [torqueusers] Jobs getting "stuck" Message-ID: Hi all: I have an issue where some of my users are running jobs that get "stuck". In this case, "stuck" means that the job ends, but torque doesn't know. It shows the job as still running. Eventually, the walltime runs out and torque tries to kill the job, but does not remove it from the job list. It does send an e-mail to the owner notifying them that the job has exceeded walltime. Then a few minutes later, it e-mails and tries again to kill. This continues until I use the qdel -p command on it. One user seems to have it happen to the majority of his jobs; a few others have had it happen to theirs. I haven't found a pattern yet; some jobs are spawned through OpenMPI (which has torque integration); others are non-MPI jobs (multi-threaded single-process or even just single-process jobs). about 80% of the jobs do end correctly, but there's that rather large percentage that I still have to purge by hand. What causes this? What can I / my users do to fix this? Thanks! --Jim From sergey_bulk at list.ru Tue Jan 10 07:24:41 2012 From: sergey_bulk at list.ru (=?UTF-8?B?U2VyZ2V5IEJ1bGs=?=) Date: Tue, 10 Jan 2012 18:24:41 +0400 Subject: [torqueusers] =?utf-8?q?memory_limit_-l_mem_is_not_working?= In-Reply-To: <242421BFAF465844BE24EB90BB97E221017F05AC@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E221017F05AC@ITSDAG1D.its.iastate.edu> Message-ID: James, thank you for answer. Unfortunately, it seems that neither pmem, nor vmem does not work too. For example, if I run 4 jobs requesting 6CPU and 48GB each on 48G 24-core node #PBS -l pmem=8gb,nodes=node32:ppn=6,vmem=48gb they all are running simultaneously. 30 ??????? 2011, 22:14 ?? "Coyle, James J [ITACD]" : > Sergey, > > There are two options: > > 1) For each queue, set reasonable low defaults for pmem and vmem > e.g. for nodes which have 512gb and 32 processor cores, set to > 512gb/32=16gb > qmgr -c 'set queue large resources_default.pmem = 16gb' > qmgr -c 'set queue large resources_default.vmem = 16gb' > > This will force users to specify pmem= and vmem= > if they want more than this, otherwise they just get > 16gb for both. > > 2) Write a submit filter which scan for mem= (and maybe vmem= > pmem= and ndoes=N:ppn=M > The you can alter the job submitted. > E.g. on a 512GB node with 32 processors (i.e. 16gb per processor), > the submit filter could calculate ceiling(mem/mem_per_processor) > = ceiling(400gb/16gb) > = 25 > Then use that value to change the ppn= to (max(8,25)) > in the job request. This just reserves as many processors as > needed with each getting their share of the memory, unless > they already have more processors reserved. > > - Jim C. > > >-----Original Message----- > >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >bounces at supercluster.org] On Behalf Of Sergey Bulk > >Sent: Thursday, December 29, 2011 6:19 AM > >To: torqueusers at supercluster.org > >Subject: [torqueusers] memory limit -l mem is not working > > > >I have torque 2.5.7-9.el6 from epel repo on SL6. > > > >When requesting resources with > > > >#PBS -l mem=400gb,nodes=node01:ppn=8 > > > >torque does not take mem parameter into account. > > > >So, I users can run 2 jobs requesting 800gb memory in total > >on a 500gb memory node. > > > >How to address this issue? > > > >Thank you, > >SN > >_______________________________________________ > >torqueusers mailing list > >torqueusers at supercluster.org > >http://www.supercluster.org/mailman/listinfo/torqueusers > From toth at fi.muni.cz Tue Jan 10 07:36:08 2012 From: toth at fi.muni.cz (=?UTF-8?B?Ik1nci4gxaBpbW9uIFTDs3RoIg==?=) Date: Tue, 10 Jan 2012 15:36:08 +0100 Subject: [torqueusers] memory limit -l mem is not working In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221017F05AC@ITSDAG1D.its.iastate.edu> Message-ID: <4F0C4CD8.4080905@fi.muni.cz> > Unfortunately, it seems that neither pmem, nor vmem > does not work too. > > For example, if I run 4 jobs requesting 6CPU and 48GB each > on 48G 24-core node > > #PBS -l pmem=8gb,nodes=node32:ppn=6,vmem=48gb > > they all are running simultaneously. Static allocation must be done using a scheduler. The system will only enforce the maximum amount of memory usable per process. And if a process goes over this limit it will be killed. If you want system-level enforcement of resource allocations, you will have to use a modified version of Torque with support for CGROUPS. -- Mgr. Simon Toth From listsarnau at gmail.com Tue Jan 10 09:31:39 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Tue, 10 Jan 2012 17:31:39 +0100 Subject: [torqueusers] memory limit -l mem is not working In-Reply-To: <4F0C4CD8.4080905@fi.muni.cz> References: <242421BFAF465844BE24EB90BB97E221017F05AC@ITSDAG1D.its.iastate.edu> <4F0C4CD8.4080905@fi.muni.cz> Message-ID: <20120110173139.68ab7126@amarrosa.pic.es> On Tue, 10 Jan 2012 15:36:08 +0100 Mgr. ?imon T?th wrote: Hi, > Static allocation must be done using a scheduler. The system will > only enforce the maximum amount of memory usable per process. And if > a process goes over this limit it will be killed. About this afirmation, few weeks ago I asked something about this: http://www.supercluster.org/pipermail/torqueusers/2011-December/013892.html torque was not limiting but reserving. Is there any extra consideration to take int account when talking about limiting resources? Which version are you running? Cheers, Arnau From samuel at unimelb.edu.au Tue Jan 10 17:21:18 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 11 Jan 2012 11:21:18 +1100 Subject: [torqueusers] memory limit -l mem is not working In-Reply-To: References: Message-ID: <4F0CD5FE.5040607@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 29/12/11 23:18, Sergey Bulk wrote: > torque does not take mem parameter into account. You probably want a more sophisticated scheduler like Maui which can handle those sorts of scheduling decisions. All the best, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8M1f4ACgkQO2KABBYQAh834wCfUx4YVKememYcYjt08j1h9tL9 umgAniQSwoVMw9RKzhHbHeCVSSbZLeoj =EMIm -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Jan 10 17:24:38 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 11 Jan 2012 11:24:38 +1100 Subject: [torqueusers] memory limit -l mem is not working In-Reply-To: <4F0C4CD8.4080905@fi.muni.cz> References: <242421BFAF465844BE24EB90BB97E221017F05AC@ITSDAG1D.its.iastate.edu> <4F0C4CD8.4080905@fi.muni.cz> Message-ID: <4F0CD6C6.9080006@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/01/12 01:36, "Mgr. ?imon T?th" wrote: > If you want system-level enforcement of resource allocations, you will > have to use a modified version of Torque with support for CGROUPS. Or the attached patch which sets RLIMIT_AS instead of RLIMIT_DATA for mem and pmem requests. RLIMIT_DATA is only honoured when malloc() in calls brk()/sbrk() for trivial allocations. For any non-trivial allocation it calls mmap() which ignores RLIMIT_DATA and looks at RLIMIT_AS instead. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8M1sYACgkQO2KABBYQAh/GYwCff8ypRhrdK7ozyIhIGQCHNO2s Q/UAmwdTMOvbTGR45vZw2koLXJczs8t3 =9LYU -----END PGP SIGNATURE----- -------------- next part -------------- A non-text attachment was scrubbed... Name: 0007-rlimit-as-not-data.patch Type: text/x-patch Size: 1238 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120111/cc3a03f5/attachment.bin From leggett at mcs.anl.gov Wed Jan 11 09:05:17 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 11 Jan 2012 10:05:17 -0600 Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: References: Message-ID: I finally got around to doing this, but I don't see a core file in /var/spool/torque or in /usr/sbin. Where would the core get dumped? On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Troy Baer" >> To: "Torque Users Mailing List" >> Sent: Tuesday, December 20, 2011 8:59:56 AM >> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting >> >> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: >>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, MOMs >>> keep randomly segfaulting and dying. I see this in the MOM log >>> right before dying: >>> >>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file >>> descriptor (9) in tm_request, comm failed Protocol failure in >>> commit >>> >>> >>> And something similar to this in dmesg: >>> >>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f >>> rsp 00007fff19e96df0 error 4 >> >> We've also seen this on one of our systems and had to fall back to >> 2.5.8 >> on it. >> >> --Troy >> -- >> Troy Baer, HPC System Administrator >> National Institute for Computational Sciences, University of >> Tennessee >> http://www.nics.tennessee.edu/ >> Phone: 865-241-4233 > > Could someone configure TORQUE using --with-debug and then send a stack trace of the crash? > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120111/e0264439/attachment.bin From dbeer at adaptivecomputing.com Wed Jan 11 09:26:12 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 11 Jan 2012 09:26:12 -0700 (MST) Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: Message-ID: <6e9b7d3f-e1d7-41af-a549-1da9d7fa2ff5@mail> ----- Original Message ----- > I finally got around to doing this, but I don't see a core file in > /var/spool/torque or in /usr/sbin. Where would the core get dumped? > A mom's core file would be in /var/spool/torque/mom_priv. You need to make sure ulimit -c is unlimited or set to a very large number. David > On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote: > > > ----- Original Message ----- > >> From: "Troy Baer" > >> To: "Torque Users Mailing List" > >> Sent: Tuesday, December 20, 2011 8:59:56 AM > >> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting > >> > >> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: > >>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, > >>> MOMs > >>> keep randomly segfaulting and dying. I see this in the MOM log > >>> right before dying: > >>> > >>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad > >>> file > >>> descriptor (9) in tm_request, comm failed Protocol failure in > >>> commit > >>> > >>> > >>> And something similar to this in dmesg: > >>> > >>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f > >>> rsp 00007fff19e96df0 error 4 > >> > >> We've also seen this on one of our systems and had to fall back to > >> 2.5.8 > >> on it. > >> > >> --Troy > >> -- > >> Troy Baer, HPC System Administrator > >> National Institute for Computational Sciences, University of > >> Tennessee > >> http://www.nics.tennessee.edu/ > >> Phone: 865-241-4233 > > > > Could someone configure TORQUE using --with-debug and then send a > > stack trace of the crash? > > > > Ken > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From leggett at mcs.anl.gov Wed Jan 11 12:05:25 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 11 Jan 2012 13:05:25 -0600 Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: <6e9b7d3f-e1d7-41af-a549-1da9d7fa2ff5@mail> References: <6e9b7d3f-e1d7-41af-a549-1da9d7fa2ff5@mail> Message-ID: <4D00315C-F7DE-4186-9806-2DE7253D72F6@mcs.anl.gov> torque was configured with --with-debug, "ulimit -c unlimited" is in the init script right before the moms are started like "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing a core file anywhere. On Jan 11, 2012, at 10:26 AM, David Beer wrote: > > > ----- Original Message ----- >> I finally got around to doing this, but I don't see a core file in >> /var/spool/torque or in /usr/sbin. Where would the core get dumped? >> > > A mom's core file would be in /var/spool/torque/mom_priv. You need to make sure ulimit -c is unlimited or set to a very large number. > > David > >> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote: >> >>> ----- Original Message ----- >>>> From: "Troy Baer" >>>> To: "Torque Users Mailing List" >>>> Sent: Tuesday, December 20, 2011 8:59:56 AM >>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting >>>> >>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: >>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, >>>>> MOMs >>>>> keep randomly segfaulting and dying. I see this in the MOM log >>>>> right before dying: >>>>> >>>>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad >>>>> file >>>>> descriptor (9) in tm_request, comm failed Protocol failure in >>>>> commit >>>>> >>>>> >>>>> And something similar to this in dmesg: >>>>> >>>>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f >>>>> rsp 00007fff19e96df0 error 4 >>>> >>>> We've also seen this on one of our systems and had to fall back to >>>> 2.5.8 >>>> on it. >>>> >>>> --Troy >>>> -- >>>> Troy Baer, HPC System Administrator >>>> National Institute for Computational Sciences, University of >>>> Tennessee >>>> http://www.nics.tennessee.edu/ >>>> Phone: 865-241-4233 >>> >>> Could someone configure TORQUE using --with-debug and then send a >>> stack trace of the crash? >>> >>> Ken >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120111/c5272f95/attachment.bin From dbeer at adaptivecomputing.com Wed Jan 11 13:52:14 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 11 Jan 2012 13:52:14 -0700 (MST) Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: <4D00315C-F7DE-4186-9806-2DE7253D72F6@mcs.anl.gov> Message-ID: <1ebb959b-2dde-4ef0-9f8e-089d2ffb5d29@mail> Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens? David ----- Original Message ----- > torque was configured with --with-debug, "ulimit -c unlimited" is in > the init script right before the moms are started like > "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing > a core file anywhere. > > On Jan 11, 2012, at 10:26 AM, David Beer wrote: > > > > > > > ----- Original Message ----- > >> I finally got around to doing this, but I don't see a core file in > >> /var/spool/torque or in /usr/sbin. Where would the core get > >> dumped? > >> > > > > A mom's core file would be in /var/spool/torque/mom_priv. You need > > to make sure ulimit -c is unlimited or set to a very large number. > > > > David > > > >> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote: > >> > >>> ----- Original Message ----- > >>>> From: "Troy Baer" > >>>> To: "Torque Users Mailing List" > >>>> Sent: Tuesday, December 20, 2011 8:59:56 AM > >>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting > >>>> > >>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: > >>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, > >>>>> MOMs > >>>>> keep randomly segfaulting and dying. I see this in the MOM log > >>>>> right before dying: > >>>>> > >>>>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad > >>>>> file > >>>>> descriptor (9) in tm_request, comm failed Protocol failure in > >>>>> commit > >>>>> > >>>>> > >>>>> And something similar to this in dmesg: > >>>>> > >>>>> pbs_mom[22354]: segfault at 0000000000000008 rip > >>>>> 00002b585249ed6f > >>>>> rsp 00007fff19e96df0 error 4 > >>>> > >>>> We've also seen this on one of our systems and had to fall back > >>>> to > >>>> 2.5.8 > >>>> on it. > >>>> > >>>> --Troy > >>>> -- > >>>> Troy Baer, HPC System Administrator > >>>> National Institute for Computational Sciences, University of > >>>> Tennessee > >>>> http://www.nics.tennessee.edu/ > >>>> Phone: 865-241-4233 > >>> > >>> Could someone configure TORQUE using --with-debug and then send a > >>> stack trace of the crash? > >>> > >>> Ken > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > > > -- > > David Beer > > Direct Line: 801-717-3386 | Fax: 801-717-3738 > > Adaptive Computing > > 1712 S East Bay Blvd, Suite 300 > > Provo, UT 84606 > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From cholam20 at yahoo.co.in Thu Jan 12 06:11:06 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Thu, 12 Jan 2012 18:41:06 +0530 (IST) Subject: [torqueusers] I am my own boss... Message-ID: <1326373866.66322.androidMobile@web137302.mail.in.yahoo.com>

Hi there.
my credit was slowly starting to crumble this was the perfect solution ive come to realize that money never sleeps!
http://qbaszek.pdg.pl/breakingnews/40JohnWilliams/ this is proof that miracles do exist
just trying to look out for you

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120112/cf34a8e7/attachment.html From dbeer at adaptivecomputing.com Thu Jan 12 15:24:15 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 12 Jan 2012 15:24:15 -0700 (MST) Subject: [torqueusers] Deleting Old Snapshots In-Reply-To: Message-ID: <2d46b826-5097-4e0f-8385-d9190a1471c6@mail> All, One of the IT guys here asked me about deleting some of the old snapshots in the snapshots directory (just as a maintenance task). Would this be ok with the community? In efforts to provide better support, we definitely want to push people to use actual releases and not snapshots, and it also seems unnecessary to keep snapshots for releases that have been released already. Does anyone have issues with deleting some of these older snapshots? I notice that we have snapshots that are about 4.5 years old. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Thu Jan 12 15:27:37 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 12 Jan 2012 15:27:37 -0700 (MST) Subject: [torqueusers] Release Candidates for 2.5.10 and 3.0.4 In-Reply-To: <56c6aadb-ac76-4c42-a1e0-b0bc990e5efa@mail> Message-ID: <68152ea9-10fd-4297-965a-165fef79c3ae@mail> All, We have two release candidate snapshots that can be downloaded here: 2.5.10 http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-2.5.10-snap.201201121434.tar.gz 3.0.4 http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-3.0.4-snap.201201121518.tar.gz -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From jascha.wang at gmail.com Fri Jan 13 02:10:56 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Fri, 13 Jan 2012 17:10:56 +0800 Subject: [torqueusers] only one processor is used when using qsub -l procs flag Message-ID: my demo torque+maui cluster has one node with np=4 set fot it. i want to submit a job requesting 3 processors, but when it start to run, i see only one processor is used (qstat shows "exec_host = snode02/0"). i use torque 2.5.6 and maui 3.3.1. anyone can help me out, it'll be greatly appreciated the submit script is something like: #!/bin/sh #PBS -N procsjob ##PBS -l procs=3 #PBS -q batch the output of checkjob is : checking job 33 State: Running Creds: user:wangxq group:wangxq class:batch qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Fri Jan 13 17:07:43 (Time Queued Total: 00:00:01 Eligible: 00:00:01) StartTime: Fri Jan 13 17:07:44 Total Tasks: 1 Req[0] TaskCount: 1 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 Utilized Resources Per Task: [NONE] Avg Util Resources Per Task: [NONE] Max Util Resources Per Task: [NONE] NodeAccess: SHARED NodeCount: 0 Allocated Nodes: [snode02:1] Task Distribution: snode02 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '33' (00:00:00 -> 1:00:00 Duration: 1:00:00) PE: 1.00 StartPriority: 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120113/bce74c5a/attachment.html From ngsbioinformatics at gmail.com Fri Jan 13 08:59:25 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Fri, 13 Jan 2012 10:59:25 -0500 Subject: [torqueusers] Do I have to define the ncpus for a compute node? Message-ID: Hi - I have a ROCKS cluster running and installed Torque. I'm able to submit 1 core, 1 cpu jobs without problem. I tried submitting a job that requested 4 cpus on 1 node using #PBS -l nodes=1:ppn=4 in my job submission script. When I submit the job however, I get the error: qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or requested nodes exceed all systems) If I run anodes, I see: compute-0-0 state = free np = 8 ntype = cluster status = rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux compute-0-0.local 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux gpus = 0 All my compute nodes have 8 cpus. Do I need to tell Torque this? I thought Torque could figure this out from np=8 or ncpus=8. Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120113/935525be/attachment.html From akohlmey at cmm.chem.upenn.edu Fri Jan 13 09:18:27 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Fri, 13 Jan 2012 11:18:27 -0500 Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: References: Message-ID: On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar wrote: > Hi - I have a ROCKS cluster running and installed Torque. ?I'm able to > submit 1 core, 1 cpu jobs without problem. ?I tried submitting a job that > requested 4 cpus on 1 node using > > #PBS -l nodes=1:ppn=4 > > in my job submission script. ?When I submit the job however, I get the > error: > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > (nodes file is empty or requested nodes exceed all systems) > > If I run anodes, I see: > > compute-0-0 > ? ? ?state = free > ? ? ?np = 8 > ? ? ?ntype = cluster > ? ? ?status = > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux compute-0-0.local 2.6.18-238.19.1.el5 #1 > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > ? ? ?gpus = 0 > > > All my compute nodes have 8 cpus. ?Do I need to tell Torque this? ?I thought > Torque could figure this out from np=8 or ncpus=8. the error message says that the request exceeds the queue configuration. that is being checked before it looks at any nodes. thus you probably have to adjust the queue configuration. axel. > > Ryan > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From gus at ldeo.columbia.edu Fri Jan 13 09:26:30 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 13 Jan 2012 11:26:30 -0500 Subject: [torqueusers] only one processor is used when using qsub -l procs flag In-Reply-To: References: Message-ID: <2480F82F-48E3-432B-A291-01C0E8DC99F8@ldeo.columbia.edu> Hi Xiangqian Is it a typo in your email or did you comment out this line in your Torque/PBS script? [Note the double hash ##.] > ##PBS -l procs=3 Have you tried this form instead? #PBS -l nodes=1:ppn=3 For more details check 'man qsub' and 'man pbs_resources'. I hope it helps, Gus Correa On Jan 13, 2012, at 4:10 AM, Xiangqian Wang wrote: > my demo torque+maui cluster has one node with np=4 set fot it. i want to submit a job requesting 3 processors, but when it start to run, i see only one processor is used (qstat shows "exec_host = snode02/0"). > > i use torque 2.5.6 and maui 3.3.1. anyone can help me out, it'll be greatly appreciated > > the submit script is something like: > > #!/bin/sh > #PBS -N procsjob > ##PBS -l procs=3 > #PBS -q batch > the output of checkjob is : > > checking job 33 > State: Running > Creds: user:wangxq group:wangxq class:batch qos:DEFAULT > WallTime: 00:00:00 of 1:00:00 > SubmitTime: Fri Jan 13 17:07:43 > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > StartTime: Fri Jan 13 17:07:44 > Total Tasks: 1 > Req[0] TaskCount: 1 Partition: DEFAULT > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 > Utilized Resources Per Task: [NONE] > Avg Util Resources Per Task: [NONE] > Max Util Resources Per Task: [NONE] > NodeAccess: SHARED > NodeCount: 0 > Allocated Nodes: > [snode02:1] > Task Distribution: snode02 > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > Reservation '33' (00:00:00 -> 1:00:00 Duration: 1:00:00) > PE: 1.00 StartPriority: 1 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ngsbioinformatics at gmail.com Fri Jan 13 10:30:57 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Fri, 13 Jan 2012 12:30:57 -0500 Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: References: Message-ID: So that's what's throwing me off. I already configured the queue using: [root at bic database]# qmgr -c 'create queue batch' [root at bic database]# qmgr -c 'set queue batch queue_type = execution' [root at bic database]# qmgr -c 'set queue batch started = true' [root at bic database]# qmgr -c 'set queue batch enabled = true' [root at bic database]# qmgr -c 'set queue batch resources_default.nodes=1:ppn=1' [root at bic database]# qmgr -c "set queue batch keep_completed=120" [root at bic database]# qmgr -c "set server default_queue=batch" [root at bic database]# qmgr -c "set server query_other_jobs = true" I assumed, by default, if the user doesn't specify any resources, a job would consume 1 core on 1 node. My nodes file shows: [root at bic hg19]# cat /var/spool/torque/server_priv/nodes compute-0-0 np=8 compute-0-1 np=8 compute-0-2 np=8 So Torque knows there are 8 cpus per node, and I haven't set a maximum limit to how many resources a job could use. To me, requesting 2 cpus on 1 node should have succeeded. On Fri, Jan 13, 2012 at 11:18 AM, Axel Kohlmeyer < akohlmey at cmm.chem.upenn.edu> wrote: > On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar > wrote: > > Hi - I have a ROCKS cluster running and installed Torque. I'm able to > > submit 1 core, 1 cpu jobs without problem. I tried submitting a job that > > requested 4 cpus on 1 node using > > > > #PBS -l nodes=1:ppn=4 > > > > in my job submission script. When I submit the job however, I get the > > error: > > > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > (nodes file is empty or requested nodes exceed all systems) > > > > If I run anodes, I see: > > > > compute-0-0 > > state = free > > np = 8 > > ntype = cluster > > status = > > > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > > 15201,sessions=? 15201,uname=Linux compute-0-0.local 2.6.18-238.19.1.el5 > #1 > > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > > > All my compute nodes have 8 cpus. Do I need to tell Torque this? I > thought > > Torque could figure this out from np=8 or ncpus=8. > > the error message says that the request exceeds the queue configuration. > that is being checked before it looks at any nodes. thus you probably have > to adjust the queue configuration. > > axel. > > > > > > Ryan > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120113/4c59c9bd/attachment.html From dbeer at adaptivecomputing.com Fri Jan 13 11:23:39 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 13 Jan 2012 11:23:39 -0700 (MST) Subject: [torqueusers] Release Candidates for 2.5.10 and 3.0.4 In-Reply-To: <68152ea9-10fd-4297-965a-165fef79c3ae@mail> Message-ID: <2ed29f8d-83b3-40be-bdb3-6f39e51a52a9@mail> All, We found an additional bug that was added for munge authentication: there was a race condition between reading output of the unmunge command and receiving a SIGCHLD. We block the SIGCHLD during reading to avoid this problem. New snapshots with the fix can be found here: 2.5.10 http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-2.5.10-snap.201201131119.tar.gz 3.0.4 http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-3.0.4-snap.201201131120.tar.gz Cheers, ----- Original Message ----- > All, > > We have two release candidate snapshots that can be downloaded here: > > 2.5.10 > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-2.5.10-snap.201201121434.tar.gz > > 3.0.4 > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-3.0.4-snap.201201121518.tar.gz > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From Z.Rashid at uu.nl Fri Jan 13 14:43:21 2012 From: Z.Rashid at uu.nl (Rashid, Z. (Zahid)) Date: Fri, 13 Jan 2012 21:43:21 +0000 Subject: [torqueusers] TORQUE on Mac OS X 10.7.2 Message-ID: Dear All, I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 Thread model: posix gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. I want to compile TORQUE on this machine. Torque-3.0.3 configured with only "configure" gives the following error when I do "make" gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? make[3]: *** [pbsD_connect.lo] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 With configure --disable-unixsockets --disable-gcc-warnings or configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings or configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings always gives the following error when I run "make" command. if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi u_threadpool.c: In function ?work_thread?: u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) u_threadpool.c:246: error: (Each undeclared identifier is reported only once u_threadpool.c:246: error: for each function it appears in.) make[3]: *** [u_threadpool.o] Error 1 make[3]: *** Waiting for unfinished jobs.... make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 In "configure" it does not show any error or warning message. I also tried the older versions of TORQUE TORQUE-3.0.0 version gives error if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi req_runjob.c: In function ?post_sendmom?: req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) req_runjob.c:1135: error: (Each undeclared identifier is reported only once req_runjob.c:1135: error: for each function it appears in.) req_runjob.c:1135: error: expected ?;? before ?addr? req_runjob.c:1266: error: ?addr? undeclared (first use in this function) make[2]: *** [req_runjob.o] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 while TORQUE-2.5.0 or earlier versions give if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory mom_mach.c: In function ?quota?: mom_mach.c:3002: error: storage size of ?qi? isn?t known mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) mom_mach.c:3166: error: (Each undeclared identifier is reported only once mom_mach.c:3166: error: for each function it appears in.) make[3]: *** [mom_mach.o] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. Regards. Zahid -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120113/cf29b429/attachment-0001.html From glen.beane at gmail.com Fri Jan 13 15:03:24 2012 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Fri, 13 Jan 2012 17:03:24 -0500 Subject: [torqueusers] TORQUE on Mac OS X 10.7.2 In-Reply-To: References: Message-ID: <34FEBE03-F358-48F4-A739-52D14E679354@gmail.com> I know earlier version of 2.5 will work -- I used to do my torque development on OSX but haven't done any in over a year Sent from my iPhone On Jan 13, 2012, at 4:43 PM, "Rashid, Z. (Zahid)" wrote: > Dear All, > > I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 > Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 > Thread model: posix > gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. > > I want to compile TORQUE on this machine. > > Torque-3.0.3 configured with > > only "configure" gives the following error when I do "make" > > gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o > ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: > ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? > ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? > make[3]: *** [pbsD_connect.lo] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > With > > configure --disable-unixsockets --disable-gcc-warnings > > or > > configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings > > or > > configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings > > always gives the following error when I run "make" command. > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ > then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi > u_threadpool.c: In function ?work_thread?: > u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) > u_threadpool.c:246: error: (Each undeclared identifier is reported only once > u_threadpool.c:246: error: for each function it appears in.) > make[3]: *** [u_threadpool.o] Error 1 > make[3]: *** Waiting for unfinished jobs.... > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > In "configure" it does not show any error or warning message. > > I also tried the older versions of TORQUE > TORQUE-3.0.0 version gives error > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ > then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi > req_runjob.c: In function ?post_sendmom?: > req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) > req_runjob.c:1135: error: (Each undeclared identifier is reported only once > req_runjob.c:1135: error: for each function it appears in.) > req_runjob.c:1135: error: expected ?;? before ?addr? > req_runjob.c:1266: error: ?addr? undeclared (first use in this function) > make[2]: *** [req_runjob.o] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > while TORQUE-2.5.0 or earlier versions give > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ > then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi > mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory > mom_mach.c: In function ?quota?: > mom_mach.c:3002: error: storage size of ?qi? isn?t known > mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) > mom_mach.c:3166: error: (Each undeclared identifier is reported only once > mom_mach.c:3166: error: for each function it appears in.) > make[3]: *** [mom_mach.o] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. > With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. > > Regards. > > Zahid > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120113/72ac18e5/attachment.html From jonb at lanl.gov Fri Jan 13 15:02:19 2012 From: jonb at lanl.gov (Jon Bringhurst) Date: Fri, 13 Jan 2012 15:02:19 -0700 Subject: [torqueusers] TORQUE on Mac OS X 10.7.2 In-Reply-To: References: Message-ID: Instead of clock_gettime(...) you need to use mach_absolute_time(...). http://developer.apple.com/library/mac/#qa/qa1398/_index.html Example patch that was used for mysqld: http://lists.mysql.com/commits/70966 -Jon On Jan 13, 2012, at 2:43 PM, Rashid, Z. (Zahid) wrote: > Dear All, > > I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 > Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 > Thread model: posix > gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. > > I want to compile TORQUE on this machine. > > Torque-3.0.3 configured with > > only "configure" gives the following error when I do "make" > > gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o > ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: > ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? > ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? > make[3]: *** [pbsD_connect.lo] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > With > > configure --disable-unixsockets --disable-gcc-warnings > > or > > configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings > > or > > configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings > > always gives the following error when I run "make" command. > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ > then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi > u_threadpool.c: In function ?work_thread?: > u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) > u_threadpool.c:246: error: (Each undeclared identifier is reported only once > u_threadpool.c:246: error: for each function it appears in.) > make[3]: *** [u_threadpool.o] Error 1 > make[3]: *** Waiting for unfinished jobs.... > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > In "configure" it does not show any error or warning message. > > I also tried the older versions of TORQUE > TORQUE-3.0.0 version gives error > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ > then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi > req_runjob.c: In function ?post_sendmom?: > req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) > req_runjob.c:1135: error: (Each undeclared identifier is reported only once > req_runjob.c:1135: error: for each function it appears in.) > req_runjob.c:1135: error: expected ?;? before ?addr? > req_runjob.c:1266: error: ?addr? undeclared (first use in this function) > make[2]: *** [req_runjob.o] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > while TORQUE-2.5.0 or earlier versions give > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ > then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi > mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory > mom_mach.c: In function ?quota?: > mom_mach.c:3002: error: storage size of ?qi? isn?t known > mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) > mom_mach.c:3166: error: (Each undeclared identifier is reported only once > mom_mach.c:3166: error: for each function it appears in.) > make[3]: *** [mom_mach.o] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. > With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. > > Regards. > > Zahid > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jcducom at gmail.com Fri Jan 13 16:03:23 2012 From: jcducom at gmail.com (Jean-Christophe Ducom) Date: Fri, 13 Jan 2012 15:03:23 -0800 Subject: [torqueusers] Missing resource_used info in /var/spool/torque/server_priv/accounting Message-ID: <4F10B83B.4080004@gmail.com> Hi- Our HPC cluster is running Torque+Maui. The RPMs available for Opensuse11.3 have been used: ---------------------------------- # rpm -qa | grep torque torque-server-2.5.9-1.1.x86_64 torque-client-2.5.9-1.1.x86_64 torque-2.5.9-1.1.x86_64 torque-pam-2.5.9-1.1.x86_64 torque-gui-2.5.9-1.1.x86_64 libtorque2-2.5.9-1.1.x86_64 torque-mom-2.5.9-1.1.x86_64 torque-devel-2.5.9-1.1.x86_64 torque-scheduler-2.5.9-1.1.x86_64 # rpm -qa | grep maui maui-client-3.3-21.3.x86_64 maui-3.3-21.3.x86_64 maui-devel-3.3-21.3.x86_64 ----------------------------------- While qstat -f job_id returns correctly an info regarding job e.g. ---------------------------------- # qstat -f 2193248 Job Id: 2193248.garibaldi01-adm.cluster.net Job_Name = test Job_Owner = xxx resources_used.cput = 05:15:42 resources_used.mem = 62968kb resources_used.vmem = 266224kb resources_used.walltime = 04:08:52 job_state = R queue = workq server = garibaldi01-adm.cluster.net Checkpoint = u ctime = Fri Jan 13 10:35:38 2012 [...] ----------------------------------- The /var/spool/torque/server_priv/accounting files are missing all the resource_used.{cput,mem,vmem,walltime} fields: ----------------------------------- #tail /var/spool/torque/server_priv/accounting/20120113 [...] 1/13/2012 14:52:36;S;2292144.garibaldi01-adm.cluster.net;user=sgadvise group=its jobname=SET_169644.job queue=workq ctime=1326494912 qtime=1326494912 etime=1326494912 start=1326495156 owner=sgadvise at node0675.cluster.net exec_host=node0970/6 Resource_List.cput=120:00:00 Resource_List.mem=8gb Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=120:00:00 01/13/2012 14:52:36;Q;2294097.garibaldi01-adm.cluster.net;queue=workq 01/13/2012 14:52:36;E;2292033.garibaldi01-adm.cluster.net;user=sgadvise group=its jobname=SET_169612.job queue=workq ctime=1326494899 qtime=1326494899 etime=1326494899 start=1326495134 owner=sgadvise at node0675.cluster.net exec_host=node0670/1 Resource_List.cput=120:00:00 Resource_List.mem=8gb Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=120:00:00 session=5214 end=1326495156 Exit_status=0 01/13/2012 14:52:36;Q;2294098.garibaldi01-adm.cluster.net;queue=workq 01/13/2012 14:52:36;E;2292123.garibaldi01-adm.cluster.net;user=sgadvise group=its jobname=SET_169638.job queue=workq ctime=1326494910 qtime=1326494910 etime=1326494910 start=1326495153 owner=sgadvise at node0675.cluster.net exec_host=node0948/2 Resource_List.cput=120:00:00 Resource_List.mem=8gb Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=120:00:00 session=6168 end=1326495156 Exit_status=0 [...] ----------------------------------------- I suspect that something in our configuration files is missing but I can not pinpoint it. Thank you in advance for any help Best, JC ----------- Jean-Christophe Ducom, PhD The Scripps Research Institute 10550 N. Torrey Pines Rd La Jolla, CA 92037 Configuration files (maui.cfg, mom config, pbs server) ------------------------------- # cat /var/spool/maui/maui.cfg # maui.cfg 3.3.1 SERVERHOST garibaldi01.scripps.edu # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMTYPE[0] PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE ## default is 10 #AMCFG[bank] TIMEOUT=30 # full parameter docs athttp://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:10 ## specifies the number of scheduling iterations between scheduler initiated node manager queries NODEPOLLFREQUENCY 8 SERVERPORT 40559 SERVERMODE NORMAL USERCFG[DEFAULT] MAXJOB=800 USERCFG[DEFAULT] MAXPROC=800 USERCFG[DEFAULT] MAXNODE=400 USERCFG[DEFAULT] MAXMEM=3000 ## specifies number of time a job will be allowed to fail in its ## start attempts before being deferred. DEFERSTARTCOUNT 2 ## specifies whether or not the scheduler will allow jobs to span more than one node ENABLEMULTINODEJOBS TRUE ## specifies whether or not the scheduler will allow jobs to specify ## multiple independent resource requests ## (i.e., pbs jobs with resource specifications such as '-l nodes=3:fast+1:io') ENABLEMULTIREQJOBS TRUE ## amount of time Maui will allow a job to exceed its wallclock limit ## before it is terminated JOBMAXOVERRUN 2:00:00 # Admin:http://supercluster.org/mauidocs/a.esecurity.html LOGFILE /var/spool/maui/maui.log LOGFILEMAXSIZE 100000000 LOGLEVEL 1 LOGFILEROLLDEPTH 10 # Job Priority:http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare:http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill:http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY BESTFIT # Dec 13, 2011 BACKFILLMETRIC PROCS #RESERVATIONPOLICY CURRENTHIGHEST #RESERVATIONDEPTH 2 RESERVATIONPOLICY NEVER # Node Allocation:http://supercluster.org/mauidocs/5.2nodeallocation.html # NODEALLOCATIONPOLICY MINRESOURCE #NODEALLOCATIONPOLICY PRIORITY #NODECFG[DEFAULT] PRIORITYF='CPROCS + AMEM - 10 * JOBCOUNT' NODEALLOCATIONPOLICY MINRESOURCE #NODEAVAILABILITYPOLICY DEDICATED:PROCS COMBINED:MEM # QOS:http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds:http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR ********************************************** # cat /var/spool/torque/mom_priv/config $pbsserver garibaldi01-adm $log_keep_days 30 $prologalarm 90 $igncput true $rcpcmd /usr/bin/rcp $spool_as_final_name true ********************************************** #qmgr ' p s' Max open servers: 10239 Qmgr: p s # # Create queues and set their attributes. # # # Create and define queue workq # create queue workq set queue workq queue_type = Execution set queue workq resources_max.cput = 200000:00:00 set queue workq resources_max.mem = 3000gb set queue workq resources_max.ncpus = 800 set queue workq resources_max.nodect = 200 set queue workq resources_max.nodes = 90:ppn=8 set queue workq resources_max.walltime = 400:00:00 set queue workq resources_default.cput = 01:00:00 set queue workq resources_default.mem = 2gb set queue workq resources_default.ncpus = 1 set queue workq resources_default.nodect = 1 set queue workq resources_default.nodes = 1 set queue workq resources_default.walltime = 01:00:00 set queue workq max_user_run = 800 set queue workq keep_completed = 0 set queue workq enabled = True set queue workq started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = garibaldi01-adm set server default_queue = workq set server log_events = 511 set server mail_from = hpc_ca set server query_other_jobs = True set server scheduler_iteration = 300 set server node_check_rate = 150 set server tcp_timeout = 6 set server mail_domain = scripps.edu set server allow_node_submit = True set server auto_node_np = True set server next_job_number = 2296229 set server record_job_info = False set server job_log_keep_days = 30 From Gareth.Williams at csiro.au Sat Jan 14 03:08:25 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 14 Jan 2012 21:08:25 +1100 Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C8732ED6@exvic-mbx04.nexus.csiro.au> Hi Ryan, Unset queue batch resources_default.nodes - you don't need that. The nodes resource is fighting with the procs resource. You need to only set one or the other for a given job (neither is OK for serial tasks). Gareth From: Ryan Golhar [mailto:ngsbioinformatics at gmail.com] Sent: Saturday, 14 January 2012 4:31 AM To: Torque Users Mailing List Subject: Re: [torqueusers] Do I have to define the ncpus for a compute node? So that's what's throwing me off. I already configured the queue using: [root at bic database]# qmgr -c 'create queue batch' [root at bic database]# qmgr -c 'set queue batch queue_type = execution' [root at bic database]# qmgr -c 'set queue batch started = true' [root at bic database]# qmgr -c 'set queue batch enabled = true' [root at bic database]# qmgr -c 'set queue batch resources_default.nodes=1:ppn=1' [root at bic database]# qmgr -c "set queue batch keep_completed=120" [root at bic database]# qmgr -c "set server default_queue=batch" [root at bic database]# qmgr -c "set server query_other_jobs = true" I assumed, by default, if the user doesn't specify any resources, a job would consume 1 core on 1 node. My nodes file shows: [root at bic hg19]# cat /var/spool/torque/server_priv/nodes compute-0-0 np=8 compute-0-1 np=8 compute-0-2 np=8 So Torque knows there are 8 cpus per node, and I haven't set a maximum limit to how many resources a job could use. To me, requesting 2 cpus on 1 node should have succeeded. On Fri, Jan 13, 2012 at 11:18 AM, Axel Kohlmeyer > wrote: On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar > wrote: > Hi - I have a ROCKS cluster running and installed Torque. I'm able to > submit 1 core, 1 cpu jobs without problem. I tried submitting a job that > requested 4 cpus on 1 node using > > #PBS -l nodes=1:ppn=4 > > in my job submission script. When I submit the job however, I get the > error: > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > (nodes file is empty or requested nodes exceed all systems) > > If I run anodes, I see: > > compute-0-0 > state = free > np = 8 > ntype = cluster > status = > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux compute-0-0.local 2.6.18-238.19.1.el5 #1 > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > gpus = 0 > > > All my compute nodes have 8 cpus. Do I need to tell Torque this? I thought > Torque could figure this out from np=8 or ncpus=8. the error message says that the request exceeds the queue configuration. that is being checked before it looks at any nodes. thus you probably have to adjust the queue configuration. axel. > > Ryan > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120114/8bd4c452/attachment.html From ngsbioinformatics at gmail.com Sat Jan 14 06:48:18 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Sat, 14 Jan 2012 08:48:18 -0500 Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C8732ED6@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102C8732ED6@exvic-mbx04.nexus.csiro.au> Message-ID: Thanks Gareth. I removed that setting, using qmgr -c 'unset queue batch resources_default.nodes' but I'm still getting the same error. I can submit jobs that request 1-3 ppn, but not 4 ppn. On Sat, Jan 14, 2012 at 5:08 AM, wrote: > Hi Ryan,**** > > ** ** > > Unset queue batch resources_default.nodes ? you don?t need that.**** > > ** ** > > The nodes resource is fighting with the procs resource. You need to only > set one or the other for a given job (neither is OK for serial tasks).**** > > ** ** > > Gareth**** > > ** ** > > *From:* Ryan Golhar [mailto:ngsbioinformatics at gmail.com] > *Sent:* Saturday, 14 January 2012 4:31 AM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] Do I have to define the ncpus for a compute > node?**** > > ** ** > > So that's what's throwing me off. I already configured the queue using:** > ** > > ** ** > > [root at bic database]# qmgr -c 'create queue batch'**** > > [root at bic database]# qmgr -c 'set queue batch queue_type = execution'**** > > [root at bic database]# qmgr -c 'set queue batch started = true'**** > > [root at bic database]# qmgr -c 'set queue batch enabled = true'**** > > [root at bic database]# qmgr -c 'set queue batch > resources_default.nodes=1:ppn=1'**** > > **** > > [root at bic database]# qmgr -c "set queue batch keep_completed=120"**** > > [root at bic database]# qmgr -c "set server default_queue=batch" **** > > [root at bic database]# qmgr -c "set server query_other_jobs = true"**** > > ** ** > > I assumed, by default, if the user doesn't specify any resources, a job > would consume 1 core on 1 node. My nodes file shows:**** > > ** ** > > [root at bic hg19]# cat /var/spool/torque/server_priv/nodes **** > > compute-0-0 np=8**** > > compute-0-1 np=8**** > > compute-0-2 np=8**** > > ** ** > > So Torque knows there are 8 cpus per node, and I haven't set a maximum > limit to how many resources a job could use. To me, requesting 2 cpus on 1 > node should have succeeded. **** > > ** ** > > On Fri, Jan 13, 2012 at 11:18 AM, Axel Kohlmeyer < > akohlmey at cmm.chem.upenn.edu> wrote:**** > > On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar > wrote: > > Hi - I have a ROCKS cluster running and installed Torque. I'm able to > > submit 1 core, 1 cpu jobs without problem. I tried submitting a job that > > requested 4 cpus on 1 node using > > > > #PBS -l nodes=1:ppn=4 > > > > in my job submission script. When I submit the job however, I get the > > error: > > > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > (nodes file is empty or requested nodes exceed all systems) > > > > If I run anodes, I see: > > > > compute-0-0 > > state = free > > np = 8 > > ntype = cluster > > status = > > > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > > 15201,sessions=? 15201,uname=Linux compute-0-0.local 2.6.18-238.19.1.el5 > #1 > > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > > > All my compute nodes have 8 cpus. Do I need to tell Torque this? I > thought > > Torque could figure this out from np=8 or ncpus=8.**** > > the error message says that the request exceeds the queue configuration. > that is being checked before it looks at any nodes. thus you probably have > to adjust the queue configuration. > > axel. > > > > > > Ryan > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120114/1f885475/attachment-0001.html From andre.gemuend at scai.fraunhofer.de Sat Jan 14 08:47:07 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Sat, 14 Jan 2012 16:47:07 +0100 (CET) Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: Message-ID: Are you by any chance using Maui or some other external Scheduler? I think its suspicious that you can run ppn=3, equaling your node count. Perhaps your scheduler allocates seperate nodes. Greetings Andr? ----- Urspr?ngliche Mail ----- > > Thanks Gareth. I removed that setting, using > > > qmgr -c 'unset queue batch resources_default.nodes' > > > but I'm still getting the same error. I can submit jobs that request > 1-3 ppn, but not 4 ppn. > > > > > > > On Sat, Jan 14, 2012 at 5:08 AM, wrote: > > > > > > > Hi Ryan, > > > > Unset queue batch resources_default.nodes ? you don?t need that. > > > > The nodes resource is fighting with the procs resource. You need to > only set one or the other for a given job (neither is OK for serial > tasks). > > > > Gareth > > > > > > > From: Ryan Golhar [mailto: ngsbioinformatics at gmail.com ] > Sent: Saturday, 14 January 2012 4:31 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Do I have to define the ncpus for a > compute node? > > > > > > So that's what's throwing me off. I already configured the queue > using: > > > > > > [root at bic database]# qmgr -c 'create queue batch' > > [root at bic database]# qmgr -c 'set queue batch queue_type = execution' > > [root at bic database]# qmgr -c 'set queue batch started = true' > > [root at bic database]# qmgr -c 'set queue batch enabled = true' > > [root at bic database]# qmgr -c 'set queue batch > resources_default.nodes=1:ppn=1' > > > > [root at bic database]# qmgr -c "set queue batch keep_completed=120" > > [root at bic database]# qmgr -c "set server default_queue=batch" > > [root at bic database]# qmgr -c "set server query_other_jobs = true" > > > > > > I assumed, by default, if the user doesn't specify any resources, a > job would consume 1 core on 1 node. My nodes file shows: > > > > > > [root at bic hg19]# cat /var/spool/torque/server_priv/nodes > > > compute-0-0 np=8 > > > compute-0-1 np=8 > > > compute-0-2 np=8 > > > > > > So Torque knows there are 8 cpus per node, and I haven't set a > maximum limit to how many resources a job could use. To me, > requesting 2 cpus on 1 node should have succeeded. > > > > > > > On Fri, Jan 13, 2012 at 11:18 AM, Axel Kohlmeyer < > akohlmey at cmm.chem.upenn.edu > wrote: > > > > On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar > < ngsbioinformatics at gmail.com > wrote: > > Hi - I have a ROCKS cluster running and installed Torque. I'm able > > to > > submit 1 core, 1 cpu jobs without problem. I tried submitting a job > > that > > requested 4 cpus on 1 node using > > > > #PBS -l nodes=1:ppn=4 > > > > in my job submission script. When I submit the job however, I get > > the > > error: > > > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible > > nodes > > (nodes file is empty or requested nodes exceed all systems) > > > > If I run anodes, I see: > > > > compute-0-0 > > state = free > > np = 8 > > ntype = cluster > > status = > > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > > 15201,sessions=? 15201,uname=Linux compute-0-0.local > > 2.6.18-238.19.1.el5 #1 > > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > > > All my compute nodes have 8 cpus. Do I need to tell Torque this? I > > thought > > Torque could figure this out from np=8 or ncpus=8. > > the error message says that the request exceeds the queue > configuration. > that is being checked before it looks at any nodes. thus you probably > have > to adjust the queue configuration. > > axel. > > > > > > Ryan > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From ngsbioinformatics at gmail.com Sat Jan 14 11:12:10 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Sat, 14 Jan 2012 13:12:10 -0500 Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: References: Message-ID: I only did it as a test. I'm using Torque and nothing else...I can submit jobs requiring 1, 2, and 3 cores. 4 cores doesn't work... 2012/1/14 Andr? Gem?nd > Are you by any chance using Maui or some other external Scheduler? I think > its suspicious that you can run ppn=3, equaling your node count. Perhaps > your scheduler allocates seperate nodes. > > Greetings > Andr? > > ----- Urspr?ngliche Mail ----- > > > > Thanks Gareth. I removed that setting, using > > > > > > qmgr -c 'unset queue batch resources_default.nodes' > > > > > > but I'm still getting the same error. I can submit jobs that request > > 1-3 ppn, but not 4 ppn. > > > > > > > > > > > > > > On Sat, Jan 14, 2012 at 5:08 AM, wrote: > > > > > > > > > > > > > > Hi Ryan, > > > > > > > > Unset queue batch resources_default.nodes ? you don?t need that. > > > > > > > > The nodes resource is fighting with the procs resource. You need to > > only set one or the other for a given job (neither is OK for serial > > tasks). > > > > > > > > Gareth > > > > > > > > > > > > > > From: Ryan Golhar [mailto: ngsbioinformatics at gmail.com ] > > Sent: Saturday, 14 January 2012 4:31 AM > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] Do I have to define the ncpus for a > > compute node? > > > > > > > > > > > > So that's what's throwing me off. I already configured the queue > > using: > > > > > > > > > > > > [root at bic database]# qmgr -c 'create queue batch' > > > > [root at bic database]# qmgr -c 'set queue batch queue_type = execution' > > > > [root at bic database]# qmgr -c 'set queue batch started = true' > > > > [root at bic database]# qmgr -c 'set queue batch enabled = true' > > > > [root at bic database]# qmgr -c 'set queue batch > > resources_default.nodes=1:ppn=1' > > > > > > > > [root at bic database]# qmgr -c "set queue batch keep_completed=120" > > > > [root at bic database]# qmgr -c "set server default_queue=batch" > > > > [root at bic database]# qmgr -c "set server query_other_jobs = true" > > > > > > > > > > > > I assumed, by default, if the user doesn't specify any resources, a > > job would consume 1 core on 1 node. My nodes file shows: > > > > > > > > > > > > [root at bic hg19]# cat /var/spool/torque/server_priv/nodes > > > > > > compute-0-0 np=8 > > > > > > compute-0-1 np=8 > > > > > > compute-0-2 np=8 > > > > > > > > > > > > So Torque knows there are 8 cpus per node, and I haven't set a > > maximum limit to how many resources a job could use. To me, > > requesting 2 cpus on 1 node should have succeeded. > > > > > > > > > > > > > > On Fri, Jan 13, 2012 at 11:18 AM, Axel Kohlmeyer < > > akohlmey at cmm.chem.upenn.edu > wrote: > > > > > > > > On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar > > < ngsbioinformatics at gmail.com > wrote: > > > Hi - I have a ROCKS cluster running and installed Torque. I'm able > > > to > > > submit 1 core, 1 cpu jobs without problem. I tried submitting a job > > > that > > > requested 4 cpus on 1 node using > > > > > > #PBS -l nodes=1:ppn=4 > > > > > > in my job submission script. When I submit the job however, I get > > > the > > > error: > > > > > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible > > > nodes > > > (nodes file is empty or requested nodes exceed all systems) > > > > > > If I run anodes, I see: > > > > > > compute-0-0 > > > state = free > > > np = 8 > > > ntype = cluster > > > status = > > > > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > > > 15201,sessions=? 15201,uname=Linux compute-0-0.local > > > 2.6.18-238.19.1.el5 #1 > > > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > > > gpus = 0 > > > > > > > > > All my compute nodes have 8 cpus. Do I need to tell Torque this? I > > > thought > > > Torque could figure this out from np=8 or ncpus=8. > > > > the error message says that the request exceeds the queue > > configuration. > > that is being checked before it looks at any nodes. thus you probably > > have > > to adjust the queue configuration. > > > > axel. > > > > > > > > > > Ryan > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > > -- > > Dr. Axel Kohlmeyer akohlmey at gmail.com > > http://sites.google.com/site/akohlmey/ > > > > Institute for Computational Molecular Science > > Temple University, Philadelphia PA, USA. > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Andr? Gem?nd > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend at scai.fraunhofer.de > Tel: +49 2241 14-2193 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120114/b7a95767/attachment.html From ngsbioinformatics at gmail.com Sat Jan 14 12:07:51 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Sat, 14 Jan 2012 14:07:51 -0500 Subject: [torqueusers] Do I have to define the ncpus for a compute node? In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C8732ED6@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102C8732ED6@exvic-mbx04.nexus.csiro.au> Message-ID: By removing this I have the problem that all jobs submitted goes to the same core. [ryang at bic ~]$ qstat -f | grep exec_host exec_host = compute-0-0/0 exec_host = compute-0-0/0 exec_host = compute-0-0/0 exec_host = compute-0-0/0 I added that resource default to keep jobs going to different cores. On Sat, Jan 14, 2012 at 5:08 AM, wrote: > Hi Ryan,**** > > ** ** > > Unset queue batch resources_default.nodes ? you don?t need that.**** > > ** ** > > The nodes resource is fighting with the procs resource. You need to only > set one or the other for a given job (neither is OK for serial tasks).**** > > ** ** > > Gareth**** > > ** ** > > *From:* Ryan Golhar [mailto:ngsbioinformatics at gmail.com] > *Sent:* Saturday, 14 January 2012 4:31 AM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] Do I have to define the ncpus for a compute > node?**** > > ** ** > > So that's what's throwing me off. I already configured the queue using:** > ** > > ** ** > > [root at bic database]# qmgr -c 'create queue batch'**** > > [root at bic database]# qmgr -c 'set queue batch queue_type = execution'**** > > [root at bic database]# qmgr -c 'set queue batch started = true'**** > > [root at bic database]# qmgr -c 'set queue batch enabled = true'**** > > [root at bic database]# qmgr -c 'set queue batch > resources_default.nodes=1:ppn=1'**** > > **** > > [root at bic database]# qmgr -c "set queue batch keep_completed=120"**** > > [root at bic database]# qmgr -c "set server default_queue=batch" **** > > [root at bic database]# qmgr -c "set server query_other_jobs = true"**** > > ** ** > > I assumed, by default, if the user doesn't specify any resources, a job > would consume 1 core on 1 node. My nodes file shows:**** > > ** ** > > [root at bic hg19]# cat /var/spool/torque/server_priv/nodes **** > > compute-0-0 np=8**** > > compute-0-1 np=8**** > > compute-0-2 np=8**** > > ** ** > > So Torque knows there are 8 cpus per node, and I haven't set a maximum > limit to how many resources a job could use. To me, requesting 2 cpus on 1 > node should have succeeded. **** > > ** ** > > On Fri, Jan 13, 2012 at 11:18 AM, Axel Kohlmeyer < > akohlmey at cmm.chem.upenn.edu> wrote:**** > > On Fri, Jan 13, 2012 at 10:59 AM, Ryan Golhar > wrote: > > Hi - I have a ROCKS cluster running and installed Torque. I'm able to > > submit 1 core, 1 cpu jobs without problem. I tried submitting a job that > > requested 4 cpus on 1 node using > > > > #PBS -l nodes=1:ppn=4 > > > > in my job submission script. When I submit the job however, I get the > > error: > > > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > (nodes file is empty or requested nodes exceed all systems) > > > > If I run anodes, I see: > > > > compute-0-0 > > state = free > > np = 8 > > ntype = cluster > > status = > > > rectime=1326469800,varattr=,jobs=,state=free,netload=1720539412488,gres=,loadave=0.01,ncpus=8,physmem=16431248kb,availmem=17311704kb,totmem=17451364kb,idletime=339141,nusers=0,nsessions=? > > 15201,sessions=? 15201,uname=Linux compute-0-0.local 2.6.18-238.19.1.el5 > #1 > > SMP Fri Jul 15 07:31:24 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > > > All my compute nodes have 8 cpus. Do I need to tell Torque this? I > thought > > Torque could figure this out from np=8 or ncpus=8.**** > > the error message says that the request exceeds the queue configuration. > that is being checked before it looks at any nodes. thus you probably have > to adjust the queue configuration. > > axel. > > > > > > Ryan > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120114/9b15725f/attachment-0001.html From jascha.wang at gmail.com Sun Jan 15 23:43:42 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Mon, 16 Jan 2012 14:43:42 +0800 Subject: [torqueusers] only one processor is used when using qsub -l procs flag In-Reply-To: <2480F82F-48E3-432B-A291-01C0E8DC99F8@ldeo.columbia.edu> References: <2480F82F-48E3-432B-A291-01C0E8DC99F8@ldeo.columbia.edu> Message-ID: thanks, Gustavo sorry for the misspelling in the previous email, i recheck it and correct it as following: i tested torque 2.5.8 and maui 3.3.1 on a centos 6.0 node, the job script is: #!/bin/sh #PBS -N procsjob #PBS -l procs=3 #PBS -q batch ping localhost -c 100 and qstat output "exec_host = snode02/0". i replace with the new job script, as #!/bin/sh #PBS -N procsjob #PBS -l nodes=1:ppn=3 #PBS -q batch ping localhost -c 100 and qstat output "exec_host = snode02/2+snode02/1+snode02/0". i change maui 3.3.1 to maui 3.2.6p21 and test again, qstat output "exec_host = snode02/2+snode02/1+snode02/0" for both script. maybe it's a maui 3.3.1 problem? 2012/1/14 Gustavo Correa > Hi Xiangqian > > Is it a typo in your email or did you comment out this line in your > Torque/PBS script? > [Note the double hash ##.] > > > ##PBS -l procs=3 > > Have you tried this form instead? > > #PBS -l nodes=1:ppn=3 > > For more details check 'man qsub' and 'man pbs_resources'. > > I hope it helps, > Gus Correa > > On Jan 13, 2012, at 4:10 AM, Xiangqian Wang wrote: > > > my demo torque+maui cluster has one node with np=4 set fot it. i want to > submit a job requesting 3 processors, but when it start to run, i see only > one processor is used (qstat shows "exec_host = snode02/0"). > > > > i use torque 2.5.6 and maui 3.3.1. anyone can help me out, it'll be > greatly appreciated > > > > the submit script is something like: > > > > #!/bin/sh > > #PBS -N procsjob > > ##PBS -l procs=3 > > #PBS -q batch > > the output of checkjob is : > > > > checking job 33 > > State: Running > > Creds: user:wangxq group:wangxq class:batch qos:DEFAULT > > WallTime: 00:00:00 of 1:00:00 > > SubmitTime: Fri Jan 13 17:07:43 > > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > StartTime: Fri Jan 13 17:07:44 > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: DEFAULT > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [NONE] > > Exec: '' ExecSize: 0 ImageSize: 0 > > Dedicated Resources Per Task: PROCS: 1 > > Utilized Resources Per Task: [NONE] > > Avg Util Resources Per Task: [NONE] > > Max Util Resources Per Task: [NONE] > > NodeAccess: SHARED > > NodeCount: 0 > > Allocated Nodes: > > [snode02:1] > > Task Distribution: snode02 > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 1 > > PartitionMask: [ALL] > > Flags: RESTARTABLE > > Reservation '33' (00:00:00 -> 1:00:00 Duration: 1:00:00) > > PE: 1.00 StartPriority: 1 > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120116/2850c848/attachment.html From gus at ldeo.columbia.edu Mon Jan 16 08:21:38 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Mon, 16 Jan 2012 10:21:38 -0500 Subject: [torqueusers] only one processor is used when using qsub -l procs flag In-Reply-To: References: <2480F82F-48E3-432B-A291-01C0E8DC99F8@ldeo.columbia.edu> Message-ID: <25D63447-EB20-48A0-B428-D692BDFE38BB@ldeo.columbia.edu> Hi Xiangqian For what it is worth, I use Maui 3.2.6p21, and I don't have the problem you described. I don't know the behavior in Maui 3.3.1, but as you reported 3.2.6p1 also works correctly for you, with the nodes-1:ppn=3 syntax. I am happy with 3.2.6p21. There is still a chance that a change in maui.cfg 3.3.1 may fix this glitch, but I don't know what it would be. Most likely it has to do with the node allocation policies, and how it translates 'procs' into nodes and ppn. Somebody else more savvy in the list may clarify this point. I confess I prefer the more detailed syntax 'nodes=X:ppn=Y', because it specifies more detail about the resources you are requesting, and apparently avoids the issue that hit you. Have you tried the 'nodes=1:ppn=3' syntax in Maui 3.3.1? I wonder if it would work there too. I hope this helps, Gus Correa On Jan 16, 2012, at 1:43 AM, Xiangqian Wang wrote: > thanks, Gustavo > > sorry for the misspelling in the previous email, i recheck it and correct it as following: > > i tested torque 2.5.8 and maui 3.3.1 on a centos 6.0 node, the job script is: > > #!/bin/sh > #PBS -N procsjob > #PBS -l procs=3 > #PBS -q batch > ping localhost -c 100 > > and qstat output "exec_host = snode02/0". > i replace with the new job script, as > > #!/bin/sh > #PBS -N procsjob > #PBS -l nodes=1:ppn=3 > #PBS -q batch > ping localhost -c 100 > and qstat output "exec_host = snode02/2+snode02/1+snode02/0". > > i change maui 3.3.1 to maui 3.2.6p21 and test again, qstat output "exec_host = snode02/2+snode02/1+snode02/0" for both script. maybe it's a maui 3.3.1 problem? > > > 2012/1/14 Gustavo Correa > Hi Xiangqian > > Is it a typo in your email or did you comment out this line in your Torque/PBS script? > [Note the double hash ##.] > > > ##PBS -l procs=3 > > Have you tried this form instead? > > #PBS -l nodes=1:ppn=3 > > For more details check 'man qsub' and 'man pbs_resources'. > > I hope it helps, > Gus Correa > > On Jan 13, 2012, at 4:10 AM, Xiangqian Wang wrote: > > > my demo torque+maui cluster has one node with np=4 set fot it. i want to submit a job requesting 3 processors, but when it start to run, i see only one processor is used (qstat shows "exec_host = snode02/0"). > > > > i use torque 2.5.6 and maui 3.3.1. anyone can help me out, it'll be greatly appreciated > > > > the submit script is something like: > > > > #!/bin/sh > > #PBS -N procsjob > > ##PBS -l procs=3 > > #PBS -q batch > > the output of checkjob is : > > > > checking job 33 > > State: Running > > Creds: user:wangxq group:wangxq class:batch qos:DEFAULT > > WallTime: 00:00:00 of 1:00:00 > > SubmitTime: Fri Jan 13 17:07:43 > > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > StartTime: Fri Jan 13 17:07:44 > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: DEFAULT > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [NONE] > > Exec: '' ExecSize: 0 ImageSize: 0 > > Dedicated Resources Per Task: PROCS: 1 > > Utilized Resources Per Task: [NONE] > > Avg Util Resources Per Task: [NONE] > > Max Util Resources Per Task: [NONE] > > NodeAccess: SHARED > > NodeCount: 0 > > Allocated Nodes: > > [snode02:1] > > Task Distribution: snode02 > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 1 > > PartitionMask: [ALL] > > Flags: RESTARTABLE > > Reservation '33' (00:00:00 -> 1:00:00 Duration: 1:00:00) > > PE: 1.00 StartPriority: 1 > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Mon Jan 16 08:50:09 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Mon, 16 Jan 2012 10:50:09 -0500 Subject: [torqueusers] only one processor is used when using qsub -l procs flag In-Reply-To: <25D63447-EB20-48A0-B428-D692BDFE38BB@ldeo.columbia.edu> References: <2480F82F-48E3-432B-A291-01C0E8DC99F8@ldeo.columbia.edu> <25D63447-EB20-48A0-B428-D692BDFE38BB@ldeo.columbia.edu> Message-ID: <0DE0CAFE-5F94-4C0A-98DA-07A637AF7C96@ldeo.columbia.edu> PS - Hi Xiangqian. Maybe you need to add this line to your maui.cfg [and restart maui], for the 'proc=Z' syntax to work as you expect: JOBNODEMATCHPOLICY EXACTNODE I *think* the default is JOBNODEMATCHPOLICY EXACTPROC which expects your node to have the exact number of processors you requested [i.e. 3]. See appendix F of the Maui Admininstrator Guide for details. I am not sure, but my recollection is that somebody reported a problem similar to yours in the list before, and the solution suggested was this one. I hope this helps, Gus Correa On Jan 16, 2012, at 10:21 AM, Gustavo Correa wrote: > Hi Xiangqian > > For what it is worth, I use Maui 3.2.6p21, and I don't have the problem you described. > I don't know the behavior in Maui 3.3.1, but as you reported 3.2.6p1 also works correctly for you, > with the nodes-1:ppn=3 syntax. > I am happy with 3.2.6p21. > > There is still a chance that a change in maui.cfg 3.3.1 may fix this glitch, > but I don't know what it would be. Most likely it has to do with the node allocation policies, > and how it translates 'procs' into nodes and ppn. > Somebody else more savvy in the list may clarify this point. > > I confess I prefer the more detailed syntax 'nodes=X:ppn=Y', > because it specifies more detail about the resources you are requesting, > and apparently avoids the issue that hit you. > > Have you tried the 'nodes=1:ppn=3' syntax in Maui 3.3.1? > I wonder if it would work there too. > > I hope this helps, > Gus Correa > > > On Jan 16, 2012, at 1:43 AM, Xiangqian Wang wrote: > >> thanks, Gustavo >> >> sorry for the misspelling in the previous email, i recheck it and correct it as following: >> >> i tested torque 2.5.8 and maui 3.3.1 on a centos 6.0 node, the job script is: >> >> #!/bin/sh >> #PBS -N procsjob >> #PBS -l procs=3 >> #PBS -q batch >> ping localhost -c 100 >> >> and qstat output "exec_host = snode02/0". >> i replace with the new job script, as >> >> #!/bin/sh >> #PBS -N procsjob >> #PBS -l nodes=1:ppn=3 >> #PBS -q batch >> ping localhost -c 100 >> and qstat output "exec_host = snode02/2+snode02/1+snode02/0". >> >> i change maui 3.3.1 to maui 3.2.6p21 and test again, qstat output "exec_host = snode02/2+snode02/1+snode02/0" for both script. maybe it's a maui 3.3.1 problem? >> >> >> 2012/1/14 Gustavo Correa >> Hi Xiangqian >> >> Is it a typo in your email or did you comment out this line in your Torque/PBS script? >> [Note the double hash ##.] >> >>> ##PBS -l procs=3 >> >> Have you tried this form instead? >> >> #PBS -l nodes=1:ppn=3 >> >> For more details check 'man qsub' and 'man pbs_resources'. >> >> I hope it helps, >> Gus Correa >> >> On Jan 13, 2012, at 4:10 AM, Xiangqian Wang wrote: >> >>> my demo torque+maui cluster has one node with np=4 set fot it. i want to submit a job requesting 3 processors, but when it start to run, i see only one processor is used (qstat shows "exec_host = snode02/0"). >>> >>> i use torque 2.5.6 and maui 3.3.1. anyone can help me out, it'll be greatly appreciated >>> >>> the submit script is something like: >>> >>> #!/bin/sh >>> #PBS -N procsjob >>> ##PBS -l procs=3 >>> #PBS -q batch >>> the output of checkjob is : >>> >>> checking job 33 >>> State: Running >>> Creds: user:wangxq group:wangxq class:batch qos:DEFAULT >>> WallTime: 00:00:00 of 1:00:00 >>> SubmitTime: Fri Jan 13 17:07:43 >>> (Time Queued Total: 00:00:01 Eligible: 00:00:01) >>> StartTime: Fri Jan 13 17:07:44 >>> Total Tasks: 1 >>> Req[0] TaskCount: 1 Partition: DEFAULT >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> Dedicated Resources Per Task: PROCS: 1 >>> Utilized Resources Per Task: [NONE] >>> Avg Util Resources Per Task: [NONE] >>> Max Util Resources Per Task: [NONE] >>> NodeAccess: SHARED >>> NodeCount: 0 >>> Allocated Nodes: >>> [snode02:1] >>> Task Distribution: snode02 >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 0 StartCount: 1 >>> PartitionMask: [ALL] >>> Flags: RESTARTABLE >>> Reservation '33' (00:00:00 -> 1:00:00 Duration: 1:00:00) >>> PE: 1.00 StartPriority: 1 >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Z.Rashid at uu.nl Mon Jan 16 09:33:24 2012 From: Z.Rashid at uu.nl (Rashid, Z. (Zahid)) Date: Mon, 16 Jan 2012 16:33:24 +0000 Subject: [torqueusers] TORQUE-3.0.3 on Mac OS X 10.7.2 Message-ID: Instead of using the mach_absolute_time(.....) suggested by Jon, I used gettimeofday(.....) as described on the page; http://www.clusterresources.com/bugzilla/attachment.cgi?id=95&action=diff which obviously works. With another minor change (i.e., changing mom_mach.c:130 #include to #include because the earlier does not work) I get another error during gcc -g -O2 -o .libs/pbs_mom catch_child.o mom_comm.o mom_inter.o mom_main.o mom_server.o prolog.o requests.o start_exec.o checkpoint.o tmsock_recov.o req_quejob.o job_func.o attr_recov.o dis_read.o job_attr_def.o job_recov.o process_request.o reply_send.o resc_def_all.o job_qs_upgrade.o darwin/libmommach.a ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a ../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.dylib -lpthread -lrt ld: library not found for -lrt collect2: ld returned 1 exit status make[3]: *** [pbs_mom] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 I tried Google to get some clue but did not get anywhere yet. Any help/suggestion? Regards. Zahid ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Jon Bringhurst [jonb at lanl.gov] Sent: 13 January 2012 23:02 To: Torque Users Mailing List Subject: Re: [torqueusers] TORQUE on Mac OS X 10.7.2 Instead of clock_gettime(...) you need to use mach_absolute_time(...). http://developer.apple.com/library/mac/#qa/qa1398/_index.html Example patch that was used for mysqld: http://lists.mysql.com/commits/70966 -Jon On Jan 13, 2012, at 2:43 PM, Rashid, Z. (Zahid) wrote: > Dear All, > > I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 > Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 > Thread model: posix > gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. > > I want to compile TORQUE on this machine. > > Torque-3.0.3 configured with > > only "configure" gives the following error when I do "make" > > gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o > ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: > ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? > ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? > make[3]: *** [pbsD_connect.lo] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > With > > configure --disable-unixsockets --disable-gcc-warnings > > or > > configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings > > or > > configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings > > always gives the following error when I run "make" command. > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ > then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi > u_threadpool.c: In function ?work_thread?: > u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) > u_threadpool.c:246: error: (Each undeclared identifier is reported only once > u_threadpool.c:246: error: for each function it appears in.) > make[3]: *** [u_threadpool.o] Error 1 > make[3]: *** Waiting for unfinished jobs.... > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > In "configure" it does not show any error or warning message. > > I also tried the older versions of TORQUE > TORQUE-3.0.0 version gives error > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ > then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi > req_runjob.c: In function ?post_sendmom?: > req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) > req_runjob.c:1135: error: (Each undeclared identifier is reported only once > req_runjob.c:1135: error: for each function it appears in.) > req_runjob.c:1135: error: expected ?;? before ?addr? > req_runjob.c:1266: error: ?addr? undeclared (first use in this function) > make[2]: *** [req_runjob.o] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > while TORQUE-2.5.0 or earlier versions give > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ > then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi > mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory > mom_mach.c: In function ?quota?: > mom_mach.c:3002: error: storage size of ?qi? isn?t known > mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) > mom_mach.c:3166: error: (Each undeclared identifier is reported only once > mom_mach.c:3166: error: for each function it appears in.) > make[3]: *** [mom_mach.o] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. > With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. > > Regards. > > Zahid > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From leggett at mcs.anl.gov Mon Jan 16 09:44:18 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Mon, 16 Jan 2012 10:44:18 -0600 Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: <1ebb959b-2dde-4ef0-9f8e-089d2ffb5d29@mail> References: <1ebb959b-2dde-4ef0-9f8e-089d2ffb5d29@mail> Message-ID: They seem to die immediately. I can't really run them in gdb since it's randomly on nodes and I haven't found a way to trigger the failure. On Jan 11, 2012, at 2:52 PM, David Beer wrote: > Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens? > > David > > ----- Original Message ----- >> torque was configured with --with-debug, "ulimit -c unlimited" is in >> the init script right before the moms are started like >> "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing >> a core file anywhere. >> >> On Jan 11, 2012, at 10:26 AM, David Beer wrote: >> >>> >>> >>> ----- Original Message ----- >>>> I finally got around to doing this, but I don't see a core file in >>>> /var/spool/torque or in /usr/sbin. Where would the core get >>>> dumped? >>>> >>> >>> A mom's core file would be in /var/spool/torque/mom_priv. You need >>> to make sure ulimit -c is unlimited or set to a very large number. >>> >>> David >>> >>>> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote: >>>> >>>>> ----- Original Message ----- >>>>>> From: "Troy Baer" >>>>>> To: "Torque Users Mailing List" >>>>>> Sent: Tuesday, December 20, 2011 8:59:56 AM >>>>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting >>>>>> >>>>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: >>>>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, >>>>>>> MOMs >>>>>>> keep randomly segfaulting and dying. I see this in the MOM log >>>>>>> right before dying: >>>>>>> >>>>>>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad >>>>>>> file >>>>>>> descriptor (9) in tm_request, comm failed Protocol failure in >>>>>>> commit >>>>>>> >>>>>>> >>>>>>> And something similar to this in dmesg: >>>>>>> >>>>>>> pbs_mom[22354]: segfault at 0000000000000008 rip >>>>>>> 00002b585249ed6f >>>>>>> rsp 00007fff19e96df0 error 4 >>>>>> >>>>>> We've also seen this on one of our systems and had to fall back >>>>>> to >>>>>> 2.5.8 >>>>>> on it. >>>>>> >>>>>> --Troy >>>>>> -- >>>>>> Troy Baer, HPC System Administrator >>>>>> National Institute for Computational Sciences, University of >>>>>> Tennessee >>>>>> http://www.nics.tennessee.edu/ >>>>>> Phone: 865-241-4233 >>>>> >>>>> Could someone configure TORQUE using --with-debug and then send a >>>>> stack trace of the crash? >>>>> >>>>> Ken >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >>> -- >>> David Beer >>> Direct Line: 801-717-3386 | Fax: 801-717-3738 >>> Adaptive Computing >>> 1712 S East Bay Blvd, Suite 300 >>> Provo, UT 84606 >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120116/57813c70/attachment.bin From dbeer at adaptivecomputing.com Mon Jan 16 11:34:25 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 16 Jan 2012 11:34:25 -0700 (MST) Subject: [torqueusers] 2.5.10 Is Released In-Reply-To: <5a2434d8-66e1-4d9a-97df-85bda7de9b3f@mail> Message-ID: <3b4d1134-0d57-4918-8b1e-8b7138eaecda@mail> All, TORQUE 2.5.10 is available for download. It can be found here: http://www.adaptivecomputing.com/resources/downloads/torque/torque-2.5.10.tar.gz CHANGELOG: 2.5.10 b - Fixed a problem where pbs_mom will crash of check_pwd returns NULL. This could happen for example if LDAP was down and getpwnam returns NULL. b - Removed a check for Interactive jobs in qsub and the -l flag. This check appeared to be code that was never completed and it prevented the passing of resource arguments. e - Added code to delete a job on the MOM if a job is in the EXITED substate and going through the scan_for_exiting code. This happens when an obit has been sent and the obit reply received by the PBS_BATCH_DeleteJob has not been received from the server on the MOM. This fix allows the MOM to delete the job and free up resources even if the server for some reason does not send the delete job request. c - fix a crash in the dynamic_string.c code (backported from 3.0.3) e - add a mom config option - $ext_pwd_retry - to specify # of retries on checking for password validity. (backported from 3.0.3) b - TRQ-608: Removed code to check for blocking mode in write_nonblocking_socket(). Fixes problem with interactive jobs (qsub -I) exiting prematurely. c - fix a buffer being overrun with nvidia gpus enabled (backported from 3.0.4) b - To fix a problem in 2.5.9 where the job_array structure was modified without changing the version or creating an upgrade path. This made it incompatible with previous versions of TORQUE 2.5 and 3.0. Added new array structure job_array_259. This is the original torque 2.5.9 job_array structure with the num_purged element added in the middle of the structure. job_array_259 was created so users could upgrade from 2.5.9 and 3.0.3 to later versions of TORQUE. The job_array structure was modified by moving the num_purged element to the bottom of the structure. pbsd_init now has an upgrade path for job arrays from version 3 to version 4. However, there is an exceptional case when upgrading from 2.5.9 or 3.0.3 where pbs_server must be started using a new -u option. b - no longer leave zombie processes when munge authenticating. (backported from 3.0.4) b - no longer reject procs if it is the second argument to -l (backported from 3.0.4) b - when having pbs_mom re-read the config file, old servers were kept, and pbs_mom attempted to communicate with those as well. Now they are cleared and only the new server(s) are contacted. (backported from 3.0.4) b - pbsnodes -l can now search on all valid node states (backported from 3.0.4) e - Improvements in munge handling of client connections and authentication. b - block SIGCHLD while reading the munge file to avoid false errors (backported from 3.0.4) -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Mon Jan 16 11:39:19 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 16 Jan 2012 11:39:19 -0700 (MST) Subject: [torqueusers] 3.0.4 Is Now Released In-Reply-To: <7239f124-3987-405d-8ace-dd3c4ab9a9bd@mail> Message-ID: All, 3.0.4 is now available for download here: http://www.adaptivecomputing.com/resources/downloads/torque/torque-3.0.4.tar.gz CHANGELOG: 3.0.4 c - fix a buffer being overrun with nvidia gpus enabled b - no longer leave zombie processes when munge authenticating. b - no longer reject procs if it is the second argument to -l b - when having pbs_mom re-read the config file, old servers were kept, and pbs_mom attempted to communicate with those as well. Now they are cleared and only the new server(s) are contacted. b - pbsnodes -l can now search on all valid node states e - Added functionality that allows the values for the server parameter authorized_users to use wild cards for both the user and host portion. e - Improvements in munge handling of client connections and authentication. b - block SIGCHLD while reading the munge file to avoid false errors. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From jascha.wang at gmail.com Mon Jan 16 23:56:29 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Tue, 17 Jan 2012 14:56:29 +0800 Subject: [torqueusers] only one processor is used when using qsub -l procs flag In-Reply-To: <0DE0CAFE-5F94-4C0A-98DA-07A637AF7C96@ldeo.columbia.edu> References: <2480F82F-48E3-432B-A291-01C0E8DC99F8@ldeo.columbia.edu> <25D63447-EB20-48A0-B428-D692BDFE38BB@ldeo.columbia.edu> <0DE0CAFE-5F94-4C0A-98DA-07A637AF7C96@ldeo.columbia.edu> Message-ID: The processors allocated when request 'nodes=1:ppn=3' is right, both when use maui 3.3.1 and 3.2.6p21. I try to add "JOBNODEMATCHPOLICY EXACTNODE" in maui 3.3.1 config file, but the processor allocation for "procs" syntax is still one. I compare the config file of maui 3.3.1 and maui 3.2.6p21, and see nothing is different. Probably I should use maui 3.2.6p21 for the moment if i want to submit job with "procs" syntax. BTW, i'm concerning the strength and weakness of the usage of "procs", since I don't want to care about the hardware configuration and its current usage, maybe this laziness is at some cost of performance degradation. Thanks for you Gustavo! Xiangqian 2012/1/16 Gustavo Correa > PS - Hi Xiangqian. > > Maybe you need to add this line to your maui.cfg [and restart maui], > for the 'proc=Z' syntax to work as you expect: > > JOBNODEMATCHPOLICY EXACTNODE > > I *think* the default is > > JOBNODEMATCHPOLICY EXACTPROC > > which expects your node to have the exact number of processors you > requested [i.e. 3]. > > See appendix F of the Maui Admininstrator Guide for details. > > I am not sure, but my recollection is that somebody reported a problem > similar to yours > in the list before, and the solution suggested was this one. > > I hope this helps, > Gus Correa > > On Jan 16, 2012, at 10:21 AM, Gustavo Correa wrote: > > > Hi Xiangqian > > > > For what it is worth, I use Maui 3.2.6p21, and I don't have the problem > you described. > > I don't know the behavior in Maui 3.3.1, but as you reported 3.2.6p1 > also works correctly for you, > > with the nodes-1:ppn=3 syntax. > > I am happy with 3.2.6p21. > > > > There is still a chance that a change in maui.cfg 3.3.1 may fix this > glitch, > > but I don't know what it would be. Most likely it has to do with the > node allocation policies, > > and how it translates 'procs' into nodes and ppn. > > Somebody else more savvy in the list may clarify this point. > > > > I confess I prefer the more detailed syntax 'nodes=X:ppn=Y', > > because it specifies more detail about the resources you are requesting, > > and apparently avoids the issue that hit you. > > > > Have you tried the 'nodes=1:ppn=3' syntax in Maui 3.3.1? > > I wonder if it would work there too. > > > > I hope this helps, > > Gus Correa > > > > > > On Jan 16, 2012, at 1:43 AM, Xiangqian Wang wrote: > > > >> thanks, Gustavo > >> > >> sorry for the misspelling in the previous email, i recheck it and > correct it as following: > >> > >> i tested torque 2.5.8 and maui 3.3.1 on a centos 6.0 node, the job > script is: > >> > >> #!/bin/sh > >> #PBS -N procsjob > >> #PBS -l procs=3 > >> #PBS -q batch > >> ping localhost -c 100 > >> > >> and qstat output "exec_host = snode02/0". > >> i replace with the new job script, as > >> > >> #!/bin/sh > >> #PBS -N procsjob > >> #PBS -l nodes=1:ppn=3 > >> #PBS -q batch > >> ping localhost -c 100 > >> and qstat output "exec_host = snode02/2+snode02/1+snode02/0". > >> > >> i change maui 3.3.1 to maui 3.2.6p21 and test again, qstat output > "exec_host = snode02/2+snode02/1+snode02/0" for both script. maybe it's a > maui 3.3.1 problem? > >> > >> > >> 2012/1/14 Gustavo Correa > >> Hi Xiangqian > >> > >> Is it a typo in your email or did you comment out this line in your > Torque/PBS script? > >> [Note the double hash ##.] > >> > >>> ##PBS -l procs=3 > >> > >> Have you tried this form instead? > >> > >> #PBS -l nodes=1:ppn=3 > >> > >> For more details check 'man qsub' and 'man pbs_resources'. > >> > >> I hope it helps, > >> Gus Correa > >> > >> On Jan 13, 2012, at 4:10 AM, Xiangqian Wang wrote: > >> > >>> my demo torque+maui cluster has one node with np=4 set fot it. i want > to submit a job requesting 3 processors, but when it start to run, i see > only one processor is used (qstat shows "exec_host = snode02/0"). > >>> > >>> i use torque 2.5.6 and maui 3.3.1. anyone can help me out, it'll be > greatly appreciated > >>> > >>> the submit script is something like: > >>> > >>> #!/bin/sh > >>> #PBS -N procsjob > >>> ##PBS -l procs=3 > >>> #PBS -q batch > >>> the output of checkjob is : > >>> > >>> checking job 33 > >>> State: Running > >>> Creds: user:wangxq group:wangxq class:batch qos:DEFAULT > >>> WallTime: 00:00:00 of 1:00:00 > >>> SubmitTime: Fri Jan 13 17:07:43 > >>> (Time Queued Total: 00:00:01 Eligible: 00:00:01) > >>> StartTime: Fri Jan 13 17:07:44 > >>> Total Tasks: 1 > >>> Req[0] TaskCount: 1 Partition: DEFAULT > >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] > >>> Exec: '' ExecSize: 0 ImageSize: 0 > >>> Dedicated Resources Per Task: PROCS: 1 > >>> Utilized Resources Per Task: [NONE] > >>> Avg Util Resources Per Task: [NONE] > >>> Max Util Resources Per Task: [NONE] > >>> NodeAccess: SHARED > >>> NodeCount: 0 > >>> Allocated Nodes: > >>> [snode02:1] > >>> Task Distribution: snode02 > >>> > >>> IWD: [NONE] Executable: [NONE] > >>> Bypass: 0 StartCount: 1 > >>> PartitionMask: [ALL] > >>> Flags: RESTARTABLE > >>> Reservation '33' (00:00:00 -> 1:00:00 Duration: 1:00:00) > >>> PE: 1.00 StartPriority: 1 > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120117/55a28089/attachment-0001.html From jonb at lanl.gov Tue Jan 17 09:20:08 2012 From: jonb at lanl.gov (Jon Bringhurst) Date: Tue, 17 Jan 2012 09:20:08 -0700 Subject: [torqueusers] TORQUE-3.0.3 on Mac OS X 10.7.2 In-Reply-To: References: Message-ID: I haven't looked at this particular section of code, so your patch may work just fine. However, I'd just like to point out that gettimeofday is not real-time safe. Assuming something depends on this property, using gettimeofday may cause a (really difficult to debug) race condition. -Jon On Jan 16, 2012, at 9:33 AM, Rashid, Z. (Zahid) wrote: > Instead of using the mach_absolute_time(.....) suggested by Jon, I used gettimeofday(.....) as described on the page; > > http://www.clusterresources.com/bugzilla/attachment.cgi?id=95&action=diff > > which obviously works. > With another minor change (i.e., changing mom_mach.c:130 #include to #include because the earlier does not work) I get another error during > > gcc -g -O2 -o .libs/pbs_mom catch_child.o mom_comm.o mom_inter.o mom_main.o mom_server.o prolog.o requests.o start_exec.o checkpoint.o tmsock_recov.o req_quejob.o job_func.o attr_recov.o dis_read.o job_attr_def.o job_recov.o process_request.o reply_send.o resc_def_all.o job_qs_upgrade.o darwin/libmommach.a ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a ../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.dylib -lpthread -lrt > ld: library not found for -lrt > collect2: ld returned 1 exit status > make[3]: *** [pbs_mom] Error 1 > make[2]: *** [all-recursive] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all-recursive] Error 1 > > I tried Google to get some clue but did not get anywhere yet. Any help/suggestion? > > Regards. > > Zahid > > ________________________________________ > From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Jon Bringhurst [jonb at lanl.gov] > Sent: 13 January 2012 23:02 > To: Torque Users Mailing List > Subject: Re: [torqueusers] TORQUE on Mac OS X 10.7.2 > > Instead of clock_gettime(...) you need to use mach_absolute_time(...). > > http://developer.apple.com/library/mac/#qa/qa1398/_index.html > > Example patch that was used for mysqld: http://lists.mysql.com/commits/70966 > > -Jon > > On Jan 13, 2012, at 2:43 PM, Rashid, Z. (Zahid) wrote: > >> Dear All, >> >> I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 >> Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 >> Thread model: posix >> gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. >> >> I want to compile TORQUE on this machine. >> >> Torque-3.0.3 configured with >> >> only "configure" gives the following error when I do "make" >> >> gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o >> ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: >> ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? >> ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? >> make[3]: *** [pbsD_connect.lo] Error 1 >> make[2]: *** [all-recursive] Error 1 >> make[1]: *** [all-recursive] Error 1 >> make: *** [all-recursive] Error 1 >> >> With >> >> configure --disable-unixsockets --disable-gcc-warnings >> >> or >> >> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings >> >> or >> >> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings >> >> always gives the following error when I run "make" command. >> >> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ >> then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi >> u_threadpool.c: In function ?work_thread?: >> u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) >> u_threadpool.c:246: error: (Each undeclared identifier is reported only once >> u_threadpool.c:246: error: for each function it appears in.) >> make[3]: *** [u_threadpool.o] Error 1 >> make[3]: *** Waiting for unfinished jobs.... >> make[2]: *** [all-recursive] Error 1 >> make[1]: *** [all-recursive] Error 1 >> make: *** [all-recursive] Error 1 >> >> In "configure" it does not show any error or warning message. >> >> I also tried the older versions of TORQUE >> TORQUE-3.0.0 version gives error >> >> if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ >> then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi >> req_runjob.c: In function ?post_sendmom?: >> req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) >> req_runjob.c:1135: error: (Each undeclared identifier is reported only once >> req_runjob.c:1135: error: for each function it appears in.) >> req_runjob.c:1135: error: expected ?;? before ?addr? >> req_runjob.c:1266: error: ?addr? undeclared (first use in this function) >> make[2]: *** [req_runjob.o] Error 1 >> make[1]: *** [all-recursive] Error 1 >> make: *** [all-recursive] Error 1 >> >> while TORQUE-2.5.0 or earlier versions give >> >> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ >> then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi >> mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory >> mom_mach.c: In function ?quota?: >> mom_mach.c:3002: error: storage size of ?qi? isn?t known >> mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) >> mom_mach.c:3166: error: (Each undeclared identifier is reported only once >> mom_mach.c:3166: error: for each function it appears in.) >> make[3]: *** [mom_mach.o] Error 1 >> make[2]: *** [all-recursive] Error 1 >> make[1]: *** [all-recursive] Error 1 >> make: *** [all-recursive] Error 1 >> >> I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. >> With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. >> >> Regards. >> >> Zahid >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jonb at lanl.gov Tue Jan 17 09:25:03 2012 From: jonb at lanl.gov (Jon Bringhurst) Date: Tue, 17 Jan 2012 09:25:03 -0700 Subject: [torqueusers] TORQUE-3.0.3 on Mac OS X 10.7.2 In-Reply-To: References: Message-ID: Oh, I forgot to address your original issue. -lrt is just the realtime lib. So, if you've found a way to safely use gettimeofday, there's (probably) no need for it in the Makefile.am. -Jon On Jan 17, 2012, at 9:20 AM, Jon Bringhurst wrote: > I haven't looked at this particular section of code, so your patch may work just fine. However, I'd just like to point out that gettimeofday is not real-time safe. Assuming something depends on this property, using gettimeofday may cause a (really difficult to debug) race condition. > > -Jon > > On Jan 16, 2012, at 9:33 AM, Rashid, Z. (Zahid) wrote: > >> Instead of using the mach_absolute_time(.....) suggested by Jon, I used gettimeofday(.....) as described on the page; >> >> http://www.clusterresources.com/bugzilla/attachment.cgi?id=95&action=diff >> >> which obviously works. >> With another minor change (i.e., changing mom_mach.c:130 #include to #include because the earlier does not work) I get another error during >> >> gcc -g -O2 -o .libs/pbs_mom catch_child.o mom_comm.o mom_inter.o mom_main.o mom_server.o prolog.o requests.o start_exec.o checkpoint.o tmsock_recov.o req_quejob.o job_func.o attr_recov.o dis_read.o job_attr_def.o job_recov.o process_request.o reply_send.o resc_def_all.o job_qs_upgrade.o darwin/libmommach.a ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a ../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.dylib -lpthread -lrt >> ld: library not found for -lrt >> collect2: ld returned 1 exit status >> make[3]: *** [pbs_mom] Error 1 >> make[2]: *** [all-recursive] Error 1 >> make[1]: *** [all-recursive] Error 1 >> make: *** [all-recursive] Error 1 >> >> I tried Google to get some clue but did not get anywhere yet. Any help/suggestion? >> >> Regards. >> >> Zahid >> >> ________________________________________ >> From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Jon Bringhurst [jonb at lanl.gov] >> Sent: 13 January 2012 23:02 >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] TORQUE on Mac OS X 10.7.2 >> >> Instead of clock_gettime(...) you need to use mach_absolute_time(...). >> >> http://developer.apple.com/library/mac/#qa/qa1398/_index.html >> >> Example patch that was used for mysqld: http://lists.mysql.com/commits/70966 >> >> -Jon >> >> On Jan 13, 2012, at 2:43 PM, Rashid, Z. (Zahid) wrote: >> >>> Dear All, >>> >>> I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 >>> Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 >>> Thread model: posix >>> gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. >>> >>> I want to compile TORQUE on this machine. >>> >>> Torque-3.0.3 configured with >>> >>> only "configure" gives the following error when I do "make" >>> >>> gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o >>> ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: >>> ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? >>> ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? >>> make[3]: *** [pbsD_connect.lo] Error 1 >>> make[2]: *** [all-recursive] Error 1 >>> make[1]: *** [all-recursive] Error 1 >>> make: *** [all-recursive] Error 1 >>> >>> With >>> >>> configure --disable-unixsockets --disable-gcc-warnings >>> >>> or >>> >>> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings >>> >>> or >>> >>> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings >>> >>> always gives the following error when I run "make" command. >>> >>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ >>> then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi >>> u_threadpool.c: In function ?work_thread?: >>> u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) >>> u_threadpool.c:246: error: (Each undeclared identifier is reported only once >>> u_threadpool.c:246: error: for each function it appears in.) >>> make[3]: *** [u_threadpool.o] Error 1 >>> make[3]: *** Waiting for unfinished jobs.... >>> make[2]: *** [all-recursive] Error 1 >>> make[1]: *** [all-recursive] Error 1 >>> make: *** [all-recursive] Error 1 >>> >>> In "configure" it does not show any error or warning message. >>> >>> I also tried the older versions of TORQUE >>> TORQUE-3.0.0 version gives error >>> >>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ >>> then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi >>> req_runjob.c: In function ?post_sendmom?: >>> req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) >>> req_runjob.c:1135: error: (Each undeclared identifier is reported only once >>> req_runjob.c:1135: error: for each function it appears in.) >>> req_runjob.c:1135: error: expected ?;? before ?addr? >>> req_runjob.c:1266: error: ?addr? undeclared (first use in this function) >>> make[2]: *** [req_runjob.o] Error 1 >>> make[1]: *** [all-recursive] Error 1 >>> make: *** [all-recursive] Error 1 >>> >>> while TORQUE-2.5.0 or earlier versions give >>> >>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ >>> then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi >>> mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory >>> mom_mach.c: In function ?quota?: >>> mom_mach.c:3002: error: storage size of ?qi? isn?t known >>> mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) >>> mom_mach.c:3166: error: (Each undeclared identifier is reported only once >>> mom_mach.c:3166: error: for each function it appears in.) >>> make[3]: *** [mom_mach.o] Error 1 >>> make[2]: *** [all-recursive] Error 1 >>> make[1]: *** [all-recursive] Error 1 >>> make: *** [all-recursive] Error 1 >>> >>> I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. >>> With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. >>> >>> Regards. >>> >>> Zahid >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jonb at lanl.gov Tue Jan 17 09:33:13 2012 From: jonb at lanl.gov (Jon Bringhurst) Date: Tue, 17 Jan 2012 09:33:13 -0700 Subject: [torqueusers] TORQUE-3.0.3 on Mac OS X 10.7.2 In-Reply-To: References: Message-ID: One more time. It looks like -lrt is set through the configure.ac. So, just to see if things are working, you can remove it from there and rerun 'autogen.sh && ./configure && make all' For a more permanent patch, it might be a good idea to merge the check into torque/buildutils/acx_pthread.m4 and remove the (redundant?) check from configure.ac. -Jon On Jan 17, 2012, at 9:25 AM, Jon Bringhurst wrote: > Oh, I forgot to address your original issue. -lrt is just the realtime lib. So, if you've found a way to safely use gettimeofday, there's (probably) no need for it in the Makefile.am. > > -Jon > > On Jan 17, 2012, at 9:20 AM, Jon Bringhurst wrote: > >> I haven't looked at this particular section of code, so your patch may work just fine. However, I'd just like to point out that gettimeofday is not real-time safe. Assuming something depends on this property, using gettimeofday may cause a (really difficult to debug) race condition. >> >> -Jon >> >> On Jan 16, 2012, at 9:33 AM, Rashid, Z. (Zahid) wrote: >> >>> Instead of using the mach_absolute_time(.....) suggested by Jon, I used gettimeofday(.....) as described on the page; >>> >>> http://www.clusterresources.com/bugzilla/attachment.cgi?id=95&action=diff >>> >>> which obviously works. >>> With another minor change (i.e., changing mom_mach.c:130 #include to #include because the earlier does not work) I get another error during >>> >>> gcc -g -O2 -o .libs/pbs_mom catch_child.o mom_comm.o mom_inter.o mom_main.o mom_server.o prolog.o requests.o start_exec.o checkpoint.o tmsock_recov.o req_quejob.o job_func.o attr_recov.o dis_read.o job_attr_def.o job_recov.o process_request.o reply_send.o resc_def_all.o job_qs_upgrade.o darwin/libmommach.a ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a ../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.dylib -lpthread -lrt >>> ld: library not found for -lrt >>> collect2: ld returned 1 exit status >>> make[3]: *** [pbs_mom] Error 1 >>> make[2]: *** [all-recursive] Error 1 >>> make[1]: *** [all-recursive] Error 1 >>> make: *** [all-recursive] Error 1 >>> >>> I tried Google to get some clue but did not get anywhere yet. Any help/suggestion? >>> >>> Regards. >>> >>> Zahid >>> >>> ________________________________________ >>> From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Jon Bringhurst [jonb at lanl.gov] >>> Sent: 13 January 2012 23:02 >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] TORQUE on Mac OS X 10.7.2 >>> >>> Instead of clock_gettime(...) you need to use mach_absolute_time(...). >>> >>> http://developer.apple.com/library/mac/#qa/qa1398/_index.html >>> >>> Example patch that was used for mysqld: http://lists.mysql.com/commits/70966 >>> >>> -Jon >>> >>> On Jan 13, 2012, at 2:43 PM, Rashid, Z. (Zahid) wrote: >>> >>>> Dear All, >>>> >>>> I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 >>>> Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 >>>> Thread model: posix >>>> gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. >>>> >>>> I want to compile TORQUE on this machine. >>>> >>>> Torque-3.0.3 configured with >>>> >>>> only "configure" gives the following error when I do "make" >>>> >>>> gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c -fno-common -DPIC -o .libs/pbsD_connect.o >>>> ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: >>>> ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? >>>> ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? >>>> make[3]: *** [pbsD_connect.lo] Error 1 >>>> make[2]: *** [all-recursive] Error 1 >>>> make[1]: *** [all-recursive] Error 1 >>>> make: *** [all-recursive] Error 1 >>>> >>>> With >>>> >>>> configure --disable-unixsockets --disable-gcc-warnings >>>> >>>> or >>>> >>>> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings >>>> >>>> or >>>> >>>> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets --disable-gcc-warnings >>>> >>>> always gives the following error when I run "make" command. >>>> >>>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ >>>> then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi >>>> u_threadpool.c: In function ?work_thread?: >>>> u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) >>>> u_threadpool.c:246: error: (Each undeclared identifier is reported only once >>>> u_threadpool.c:246: error: for each function it appears in.) >>>> make[3]: *** [u_threadpool.o] Error 1 >>>> make[3]: *** Waiting for unfinished jobs.... >>>> make[2]: *** [all-recursive] Error 1 >>>> make[1]: *** [all-recursive] Error 1 >>>> make: *** [all-recursive] Error 1 >>>> >>>> In "configure" it does not show any error or warning message. >>>> >>>> I also tried the older versions of TORQUE >>>> TORQUE-3.0.0 version gives error >>>> >>>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ >>>> then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi >>>> req_runjob.c: In function ?post_sendmom?: >>>> req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) >>>> req_runjob.c:1135: error: (Each undeclared identifier is reported only once >>>> req_runjob.c:1135: error: for each function it appears in.) >>>> req_runjob.c:1135: error: expected ?;? before ?addr? >>>> req_runjob.c:1266: error: ?addr? undeclared (first use in this function) >>>> make[2]: *** [req_runjob.o] Error 1 >>>> make[1]: *** [all-recursive] Error 1 >>>> make: *** [all-recursive] Error 1 >>>> >>>> while TORQUE-2.5.0 or earlier versions give >>>> >>>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ >>>> then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi >>>> mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory >>>> mom_mach.c: In function ?quota?: >>>> mom_mach.c:3002: error: storage size of ?qi? isn?t known >>>> mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) >>>> mom_mach.c:3166: error: (Each undeclared identifier is reported only once >>>> mom_mach.c:3166: error: for each function it appears in.) >>>> make[3]: *** [mom_mach.o] Error 1 >>>> make[2]: *** [all-recursive] Error 1 >>>> make[1]: *** [all-recursive] Error 1 >>>> make: *** [all-recursive] Error 1 >>>> >>>> I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. >>>> With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. >>>> >>>> Regards. >>>> >>>> Zahid >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Tue Jan 17 13:06:04 2012 From: glen.beane at gmail.com (Glen Beane) Date: Tue, 17 Jan 2012 15:06:04 -0500 Subject: [torqueusers] TORQUE-3.0.3 on Mac OS X 10.7.2 In-Reply-To: References: Message-ID: There is a patch that someone posted in bugzilla that fixes this these issues (detects clock_gettime and fails back to gettimeofday, and only links in librt if it is needed). Unfortunately none of the developers have accepted the patch yet... On Tue, Jan 17, 2012 at 11:33 AM, Jon Bringhurst wrote: > One more time. It looks like -lrt is set through the configure.ac. So, just to see if things are working, you can remove it from there and rerun 'autogen.sh && ./configure && make all' > > For a more permanent patch, it might be a good idea to merge the check into torque/buildutils/acx_pthread.m4 and remove the (redundant?) check from configure.ac. > > -Jon > > On Jan 17, 2012, at 9:25 AM, Jon Bringhurst wrote: > >> Oh, I forgot to address your original issue. -lrt is just the realtime lib. So, if you've found a way to safely use gettimeofday, there's (probably) no need for it in the Makefile.am. >> >> -Jon >> >> On Jan 17, 2012, at 9:20 AM, Jon Bringhurst wrote: >> >>> I haven't looked at this particular section of code, so your patch may work just fine. However, I'd just like to point out that gettimeofday is not real-time safe. Assuming something depends on this property, using gettimeofday may cause a (really difficult to debug) race condition. >>> >>> -Jon >>> >>> On Jan 16, 2012, at 9:33 AM, Rashid, Z. (Zahid) wrote: >>> >>>> Instead of using the mach_absolute_time(.....) suggested by Jon, I used gettimeofday(.....) as described on the page; >>>> >>>> http://www.clusterresources.com/bugzilla/attachment.cgi?id=95&action=diff >>>> >>>> which obviously works. >>>> With another minor change (i.e., changing mom_mach.c:130 #include to #include because the earlier does not work) I get another error during >>>> >>>> gcc -g -O2 -o .libs/pbs_mom catch_child.o mom_comm.o mom_inter.o mom_main.o mom_server.o prolog.o requests.o start_exec.o checkpoint.o tmsock_recov.o req_quejob.o job_func.o attr_recov.o dis_read.o job_attr_def.o job_recov.o process_request.o reply_send.o resc_def_all.o job_qs_upgrade.o ?darwin/libmommach.a ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a ../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.dylib -lpthread -lrt >>>> ld: library not found for -lrt >>>> collect2: ld returned 1 exit status >>>> make[3]: *** [pbs_mom] Error 1 >>>> make[2]: *** [all-recursive] Error 1 >>>> make[1]: *** [all-recursive] Error 1 >>>> make: *** [all-recursive] Error 1 >>>> >>>> I tried Google to get some clue but did not get anywhere yet. Any help/suggestion? >>>> >>>> Regards. >>>> >>>> Zahid >>>> >>>> ________________________________________ >>>> From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Jon Bringhurst [jonb at lanl.gov] >>>> Sent: 13 January 2012 23:02 >>>> To: Torque Users Mailing List >>>> Subject: Re: [torqueusers] TORQUE on Mac OS X 10.7.2 >>>> >>>> Instead of clock_gettime(...) you need to use mach_absolute_time(...). >>>> >>>> http://developer.apple.com/library/mac/#qa/qa1398/_index.html >>>> >>>> Example patch that was used for mysqld: http://lists.mysql.com/commits/70966 >>>> >>>> -Jon >>>> >>>> On Jan 13, 2012, at 2:43 PM, Rashid, Z. (Zahid) wrote: >>>> >>>>> Dear All, >>>>> >>>>> I am trying to compile TORQUE on Mac Book Pro (Intel Core i7, NCores = 4) with OS X 10.7.2, Xcode 4.2, and gcc 4.2.1 [Target: i686-apple-darwin11 >>>>> Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~1/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 >>>>> Thread model: posix >>>>> gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] installed. >>>>> >>>>> I want to compile TORQUE on this machine. >>>>> >>>>> Torque-3.0.3 configured with >>>>> >>>>> only "configure" gives the following error when I do "make" >>>>> >>>>> gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../../../src/lib/Libdis -DIFF_PATH=\"/usr/local/torque/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -O2 -MT pbsD_connect.lo -MD -MP -MF .deps/pbsD_connect.Tpo -c ../Libifl/pbsD_connect.c ?-fno-common -DPIC -o .libs/pbsD_connect.o >>>>> ../Libifl/pbsD_connect.c: In function ?send_unix_creds?: >>>>> ../Libifl/pbsD_connect.c:688: error: ?struct ucred? has no member named ?cr_uid? >>>>> ../Libifl/pbsD_connect.c:689: error: ?struct ucred? has no member named ?cr_groups? >>>>> make[3]: *** [pbsD_connect.lo] Error 1 >>>>> make[2]: *** [all-recursive] Error 1 >>>>> make[1]: *** [all-recursive] Error 1 >>>>> make: *** [all-recursive] Error 1 >>>>> >>>>> With >>>>> >>>>> configure --disable-unixsockets --disable-gcc-warnings >>>>> >>>>> or >>>>> >>>>> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --disable-unixsockets --disable-gcc-warnings >>>>> >>>>> or >>>>> >>>>> configure --with-default-server=name --with-server-home=/var/spool/pbs --with-rcp=scp --host=x86_64-apple-darwin11 --build=x86_64-apple-darwin11 --target=x86_64-apple-darwin11 --disable-unixsockets ?--disable-gcc-warnings >>>>> >>>>> always gives the following error when I run "make" command. >>>>> >>>>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include ?-I../../../src/include ? -g -O2 -MT u_dynamic_string.o -MD -MP -MF ".deps/u_dynamic_string.Tpo" -c -o u_dynamic_string.o u_dynamic_string.c; \ >>>>> then mv -f ".deps/u_dynamic_string.Tpo" ".deps/u_dynamic_string.Po"; else rm -f ".deps/u_dynamic_string.Tpo"; exit 1; fi >>>>> u_threadpool.c: In function ?work_thread?: >>>>> u_threadpool.c:246: error: ?CLOCK_REALTIME? undeclared (first use in this function) >>>>> u_threadpool.c:246: error: (Each undeclared identifier is reported only once >>>>> u_threadpool.c:246: error: for each function it appears in.) >>>>> make[3]: *** [u_threadpool.o] Error 1 >>>>> make[3]: *** Waiting for unfinished jobs.... >>>>> make[2]: *** [all-recursive] Error 1 >>>>> make[1]: *** [all-recursive] Error 1 >>>>> make: *** [all-recursive] Error 1 >>>>> >>>>> In "configure" it does not show any error or warning message. >>>>> >>>>> I also tried the older versions of TORQUE >>>>> TORQUE-3.0.0 version gives error >>>>> >>>>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../src/include ?-I../../src/include ?-DPBS_SERVER_HOME=\"/var/spool/pbs\" -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -g -O2 -MT req_runjob.o -MD -MP -MF ".deps/req_runjob.Tpo" -c -o req_runjob.o req_runjob.c; \ >>>>> then mv -f ".deps/req_runjob.Tpo" ".deps/req_runjob.Po"; else rm -f ".deps/req_runjob.Tpo"; exit 1; fi >>>>> req_runjob.c: In function ?post_sendmom?: >>>>> req_runjob.c:1135: error: ?ulong? undeclared (first use in this function) >>>>> req_runjob.c:1135: error: (Each undeclared identifier is reported only once >>>>> req_runjob.c:1135: error: for each function it appears in.) >>>>> req_runjob.c:1135: error: expected ?;? before ?addr? >>>>> req_runjob.c:1266: error: ?addr? undeclared (first use in this function) >>>>> make[2]: *** [req_runjob.o] Error 1 >>>>> make[1]: *** [all-recursive] Error 1 >>>>> make: *** [all-recursive] Error 1 >>>>> >>>>> while TORQUE-2.5.0 or earlier versions give >>>>> >>>>> if gcc -DHAVE_CONFIG_H -I. -I. -I../../../src/include ?-I../../../src/include -DPBS_MOM -DDEMUX=\"/usr/local/torque/sbin/pbs_demux\" ? -g -O2 -MT mom_mach.o -MD -MP -MF ".deps/mom_mach.Tpo" -c -o mom_mach.o mom_mach.c; \ >>>>> then mv -f ".deps/mom_mach.Tpo" ".deps/mom_mach.Po"; else rm -f ".deps/mom_mach.Tpo"; exit 1; fi >>>>> mom_mach.c:130:27: error: ufs/ufs/quota.h: No such file or directory >>>>> mom_mach.c: In function ?quota?: >>>>> mom_mach.c:3002: error: storage size of ?qi? isn?t known >>>>> mom_mach.c:3166: error: ?Q_GETQUOTA? undeclared (first use in this function) >>>>> mom_mach.c:3166: error: (Each undeclared identifier is reported only once >>>>> mom_mach.c:3166: error: for each function it appears in.) >>>>> make[3]: *** [mom_mach.o] Error 1 >>>>> make[2]: *** [all-recursive] Error 1 >>>>> make[1]: *** [all-recursive] Error 1 >>>>> make: *** [all-recursive] Error 1 >>>>> >>>>> I also tried the Intel icc instead of gcc in all cases but it does not seem compiler dependent errors. >>>>> With version 3.0.3, perhaps the "clock_gettime(CLOCK_REALTIME, &ts);" is not available in Max OS X. but how can I get around this? Any help with any of the above versions of TORQUE would be appreciated. >>>>> >>>>> Regards. >>>>> >>>>> Zahid >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lloyd_brown at byu.edu Wed Jan 18 15:39:11 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 18 Jan 2012 15:39:11 -0700 Subject: [torqueusers] Setting up checkpointing Message-ID: <4F174A0F.4080308@byu.edu> Can anyone enlighten me on the current state of BLCR-style checkpointing in Torque? I've been trying to get it to work, and so far, I see that it's invoking my checkpoint script, that script calls cr_checkpoint, and the checkpoint files/directories are created, but something is calling the mom_checkpoint_delete_files function, which in turn calls delete_blcr_files, and the checkpoints get deleted. Also, when I do a "qhold" on my job to try to initiate the checkpoint, is it really supposed to terminate my job? Perhaps that's related, eg. the job is ending so the files get cleaned up. Basically, does anyone have it working, and can give me advice? Thanks, -- Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu From daniel.burbano at gmail.com Wed Jan 18 16:08:02 2012 From: daniel.burbano at gmail.com (Daniel Burbano) Date: Wed, 18 Jan 2012 18:08:02 -0500 Subject: [torqueusers] communication problems between pbs_mom and pbs_server Message-ID: Hello, My name is Daniel Burbano. I am installing a cluster with PBS. I have problem communications when the pbs_mom try to find the pbs_server (timeout). These are the logs: 01/17/2012 17:59:39;0002; pbs_mom;Svr;Log;Log opened 01/17/2012 17:59:39;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3 , loglevel = 0 01/17/2012 17:59:39;0002; pbs_mom;Svr;setpbsserver;bio01 01/17/2012 17:59:39;0002; pbs_mom;Svr;mom_server_add;server bio01 added 01/17/2012 17:59:39;0002; pbs_mom;n/a;initialize;independent 01/17/2012 17:59:39;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 01/17/2012 17:59:39;0002; pbs_mom;Svr;pbs_mom;Is up 01/17/2012 17:59:39;0002; pbs_mom;Svr;setup_program_environment;MOM exec utable path and mtime at launch: /opt/torque/sbin/pbs_mom 1325906867 01/17/2012 17:59:39;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3 , loglevel = 0 01/17/2012 17:59:39;0002; pbs_mom;n/a;mom_server_check_connection;sendin g hello to server bio01 01/17/2012 18:01:09;0002; pbs_mom;n/a;mom_server_check_connection;connec tion to server bio01 timeout [root at bio03 mom_logs]# tail -100 20120117 01/17/2012 18:10:10;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server bio01 01/17/2012 18:11:40;0002; pbs_mom;n/a;mom_server_check_connection;connection to server bio01 timeout 01/17/2012 18:11:40;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server bio01 01/17/2012 18:13:10;0002; pbs_mom;n/a;mom_server_check_connection;connection to server bio01 timeout In the other hand, I don?t have problems when the pbs_mom and the pbs_server are located in the same machine. The firewall are disabled in the machines. The machines are in the same network. The selinux is disabled in the machines. The ssh without password is configured correctly. The servers are virtual machine created in AWS. Any idea? Thanks -- Daniel Burbano, MCpE From samuel at unimelb.edu.au Thu Jan 19 19:44:57 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 20 Jan 2012 13:44:57 +1100 Subject: [torqueusers] limiting resource usage with torque In-Reply-To: <20111215122848.6eab11c0@amarrosa.pic.es> References: <20111215122848.6eab11c0@amarrosa.pic.es> Message-ID: <4F18D529.5050208@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/12/11 22:28, Arnau Bria wrote: > After some debugging we found the source. MAUI was reserving 6gb of mem > for each job. so, 4 jobs*6gb of mem = 24gb. All the mem was reserved > for those 4 jobs and the node is not selected for running more. I'm a bit puzzled as to why you think this is a problem - your jobs were requesting 6GB vmem (swap) each and your node has 24GB swap and you didn't set a queue limit on pvmem to stop jobs requesting 6gb pvmem from being requested. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8Y1SgACgkQO2KABBYQAh9llgCfcIvSL9rUPAutg389Uot0wyoG fL0AnRSXq8JscYvG5shbgA4QQk5UWZDP =AtbM -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Jan 19 19:47:03 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 20 Jan 2012 13:47:03 +1100 Subject: [torqueusers] Problem with password free scp on small cluster In-Reply-To: References: Message-ID: <4F18D5A7.1050900@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/12/11 11:16, Jakob Blomqvist wrote: > The thing is: I would like to have torque simply be able to use all > three machines in mpirun calculations but as I try to set it up I have > serious problems with error messages indicating e.g. scp access problem > (caused by the ssh configuration I'm sure). If your compute nodes all mount the same home directory you can completely eliminate the need for scp to copy files back by adding: $usecp *:/home /home to tell the pbs_mom to just use 'cp' instead of 'scp' to copy files back to users home directories. Hope this helps! Chris (catching up on some email) - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8Y1acACgkQO2KABBYQAh8LdQCeOOO0PDd/A1ECmQ7fba6SuKeC HegAnAlPmuwRHYTbtyOIkk8zySRlJ09h =Zm7R -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Jan 19 19:48:31 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 20 Jan 2012 13:48:31 +1100 Subject: [torqueusers] Jobs getting "stuck" In-Reply-To: References: Message-ID: <4F18D5FF.8050806@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/01/12 05:52, Jim Kusznir wrote: > What causes this? What can I / my users do to fix this? No idea of the cause, but we run with this set with qmgr: set server mom_job_sync = True and we don't see that issue at all. Hope this helps! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8Y1f8ACgkQO2KABBYQAh9h3gCcDjRzVjae099hJ6eOoK4u2WuZ dBgAn2v7ujMT8q2296OMWCIxCzYxgibM =4kwf -----END PGP SIGNATURE----- From listsarnau at gmail.com Fri Jan 20 02:28:12 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Fri, 20 Jan 2012 10:28:12 +0100 Subject: [torqueusers] limiting resource usage with torque In-Reply-To: <4F18D529.5050208@unimelb.edu.au> References: <20111215122848.6eab11c0@amarrosa.pic.es> <4F18D529.5050208@unimelb.edu.au> Message-ID: <20120120102812.01ffcf2b@amarrosa.pic.es> On Fri, 20 Jan 2012 13:44:57 +1100 Christopher Samuel wrote: Hi, > > After some debugging we found the source. MAUI was reserving 6gb of > > mem for each job. so, 4 jobs*6gb of mem = 24gb. All the mem was > > reserved for those 4 jobs and the node is not selected for running > > more. > > I'm a bit puzzled as to why you think this is a problem - your jobs > were requesting 6GB vmem (swap) each and your node has 24GB swap and > you didn't set a queue limit on pvmem to stop jobs requesting 6gb > pvmem from being requested. I don't know if I have understood you or not. My point is that limiting is not the same as reserving. So, if I want torque to limit resource usage (6gb in this case) I don't want torque to tell MAUI to reserve 6 GB for that job. With the reservation, only 4 jobs (4*6=24) can run, but, if jobs behave correctly and they don't use more than 2 o 3 GB, we can run up to 8 jobs. So, I'm trying to tell torque: "ei! if the job usese more than 6gb, kill it"... So, my problem comes from the understanding of limiting/reserving (which are very diferent concepts). -l Defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed ** Or the problem is in my conf. > cheers, > Chris Cheers, Arnau From lindheim at cacr.caltech.edu Mon Jan 23 16:43:08 2012 From: lindheim at cacr.caltech.edu (Jan Lindheim) Date: Mon, 23 Jan 2012 15:43:08 -0800 Subject: [torqueusers] building torque-2.5.10 statically Message-ID: <20120123234306.GT1166@cacr.caltech.edu> If I configure torque-2.5.10 to build statically, it fails when trying to create libtorque.a The reason is two instances of load_config(), one in src/lib/Libifl/torquecfg.c and one in src/cmds/qsub.c The two versions of load_config() are almost identical, one uses the call strncat(home_dir, PBS_SERVER_HOME, MAXPATHLEN); while the other uses strcat(home_dir, PBS_SERVER_HOME); It looks like load_config() can be removed from src/cmds/qsub.c Regards, Jan Lindheim From samuel at unimelb.edu.au Mon Jan 23 17:58:02 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 24 Jan 2012 11:58:02 +1100 Subject: [torqueusers] limiting resource usage with torque In-Reply-To: <20120120102812.01ffcf2b@amarrosa.pic.es> References: <20111215122848.6eab11c0@amarrosa.pic.es> <4F18D529.5050208@unimelb.edu.au> <20120120102812.01ffcf2b@amarrosa.pic.es> Message-ID: <4F1E021A.2080409@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/01/12 20:28, Arnau Bria wrote: > I don't know if I have understood you or not. Yeah, I was wondering if we were talking at cross purposes. > My point is that limiting is not the same as reserving. My take is that the scheduler will try and reserve whatever you ask it for (and defer if it cannot), and the queuing system should enforce your request as a limit (if it can). > So, if I want torque to limit resource usage (6gb in this case) I don't > want torque to tell MAUI to reserve 6 GB for that job. If you don't want Maui to reserve 6GB for a job, then don't tell it to. Tell it to reserve what you think the job will need. It's then Torque's issue to enforce that reservation and prevent the job using more than that. This then triggers the usual worries about mem/pmem setting resource limitations that are not enforced by the kernel (under Linux). > With the reservation, only 4 jobs (4*6=24) can run, but, if jobs behave > correctly and they don't use more than 2 o 3 GB, we can run up to 8 > jobs. So, I'm trying to tell torque: "ei! if the job usese more than > 6gb, kill it"... If the job should never use more than 3GB then set a limit of that plus a margin of error. > So, my problem comes from the understanding of limiting/reserving > (which are very diferent concepts). > > -l > Defines the resources that are required by the job > and establishes a limit to the amount of resource that can be consumed In Torque you tell the system what the upper bound usage that is allowed for a job and that's what the scheduler has to reserve (to avoid overcommitting resources). Hope this helps! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8eAhoACgkQO2KABBYQAh/k1QCffAEKJsBwbDK/1i9yuqSWiCZV 6LoAn00QiB5rn7rDIa1mdctMN8nyR+q9 =zUIq -----END PGP SIGNATURE----- From cholam20 at yahoo.co.in Tue Jan 24 07:37:49 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Tue, 24 Jan 2012 20:07:49 +0530 (IST) Subject: [torqueusers] Fwd: I am my own boss try it out for yourself Message-ID: <1327415869.23228.androidMobile@web137306.mail.in.yahoo.com>

my credit was slowly crumbling theres nothing else that compares with this nothing seemed to work!
http://pracowniaplastykiuzytkowej.home.pl/currentevents/34MarkDoyle/ its crazy how the tables have turned
this will be worth your time

see you later!

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/54b324cb/attachment.html From halmabrazi at idtdna.com Tue Jan 24 09:08:29 2012 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Tue, 24 Jan 2012 16:08:29 +0000 Subject: [torqueusers] specify job id Message-ID: Hi, Is there a way to specify job id when request a job? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/0edff9c3/attachment.html From dbeer at adaptivecomputing.com Tue Jan 24 09:47:14 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 24 Jan 2012 09:47:14 -0700 Subject: [torqueusers] specify job id In-Reply-To: References: Message-ID: If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : qsub script.sh -P -J David On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi wrote: > Hi,**** > > ** ** > > Is there a way to specify job id when request a job?**** > > ** ** > > Thanks**** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/6c996379/attachment-0001.html From halmabrazi at idtdna.com Tue Jan 24 10:33:12 2012 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Tue, 24 Jan 2012 17:33:12 +0000 Subject: [torqueusers] specify job id In-Reply-To: References: Message-ID: Thank you David, I tried this and got an error that I am not a super user. So I looked up who is running the "pbs" and found that it is run under the "root". So I submitted the following (since I am not a root) >sudo qsub script.sh -P root -J 111. No complaining this time but no jobs in the queue when I qstat. I resend the same command as before but then it says Job with requested ID 111 already exist".... I am not sure what to do next. I do not think I can delete it. Anything I should try?... Thank you for your help. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, January 24, 2012 10:47 AM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : qsub script.sh -P -J David On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi > wrote: Hi, Is there a way to specify job id when request a job? Thanks _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/54210af8/attachment.html From JMRUSHTON at qinetiq.com Tue Jan 24 11:01:40 2012 From: JMRUSHTON at qinetiq.com (Rushton Martin) Date: Tue, 24 Jan 2012 18:01:40 -0000 Subject: [torqueusers] UC specify job id In-Reply-To: References: Message-ID: <20120124180043.A7B9583A802B@mail.adaptivecomputing.com> When you issued the command you changed to being root, then submitted the job on behalf of root. What you should have done is: >sudo qsub script.sh -P hakeem -J 111 (obviously, change hakeem to your actual user name) Since the job was submitted to run as root, you will not be able to see, still less delete, it unless you use sudo again. Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrushton at QinetiQ.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: 24 January 2012 17:33 To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Thank you David, I tried this and got an error that I am not a super user. So I looked up who is running the "pbs" and found that it is run under the "root". So I submitted the following (since I am not a root) >sudo qsub script.sh -P root -J 111. No complaining this time but no jobs in the queue when I qstat. I resend the same command as before but then it says Job with requested ID 111 already exist".... I am not sure what to do next. I do not think I can delete it. Anything I should try?... Thank you for your help. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, January 24, 2012 10:47 AM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : qsub script.sh -P -J David On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi wrote: Hi, Is there a way to specify job id when request a job? Thanks _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/78c088a6/attachment-0001.html From halmabrazi at idtdna.com Tue Jan 24 11:00:12 2012 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Tue, 24 Jan 2012 18:00:12 +0000 Subject: [torqueusers] specify job id In-Reply-To: References: Message-ID: David, Ignore my previous email. I resolved the issue by using ">sudo qsub script.sh -P username -J 111". This seems to work. Is there another solution besides being a super user though? If there is then that will be much better solution.... Thanks From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: Tuesday, January 24, 2012 11:33 AM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Thank you David, I tried this and got an error that I am not a super user. So I looked up who is running the "pbs" and found that it is run under the "root". So I submitted the following (since I am not a root) >sudo qsub script.sh -P root -J 111. No complaining this time but no jobs in the queue when I qstat. I resend the same command as before but then it says Job with requested ID 111 already exist".... I am not sure what to do next. I do not think I can delete it. Anything I should try?... Thank you for your help. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, January 24, 2012 10:47 AM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : qsub script.sh -P -J David On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi > wrote: Hi, Is there a way to specify job id when request a job? Thanks _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/adcb05d6/attachment.html From halmabrazi at idtdna.com Tue Jan 24 11:01:47 2012 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Tue, 24 Jan 2012 18:01:47 +0000 Subject: [torqueusers] UC specify job id In-Reply-To: <20120124180043.A7B9583A802B@mail.adaptivecomputing.com> References: <20120124180043.A7B9583A802B@mail.adaptivecomputing.com> Message-ID: Thank you Martin, I did exactly what you said and I responded back but I guess you were faster than me. Thank you for your tip though. Is there a better way than being a super user? From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Rushton Martin Sent: Tuesday, January 24, 2012 12:02 PM To: Torque Users Mailing List Subject: Re: [torqueusers] UC specify job id When you issued the command you changed to being root, then submitted the job on behalf of root. What you should have done is: >sudo qsub script.sh -P hakeem -J 111 (obviously, change hakeem to your actual user name) Since the job was submitted to run as root, you will not be able to see, still less delete, it unless you use sudo again. Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrushton at QinetiQ.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: 24 January 2012 17:33 To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Thank you David, I tried this and got an error that I am not a super user. So I looked up who is running the "pbs" and found that it is run under the "root". So I submitted the following (since I am not a root) >sudo qsub script.sh -P root -J 111. No complaining this time but no jobs in the queue when I qstat. I resend the same command as before but then it says Job with requested ID 111 already exist".... I am not sure what to do next. I do not think I can delete it. Anything I should try?... Thank you for your help. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, January 24, 2012 10:47 AM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : qsub script.sh -P -J David On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi > wrote: Hi, Is there a way to specify job id when request a job? Thanks _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/6f70375b/attachment-0001.html From david at unistra.fr Tue Jan 24 11:08:23 2012 From: david at unistra.fr (R. David) Date: Tue, 24 Jan 2012 19:08:23 +0100 Subject: [torqueusers] UC specify job id In-Reply-To: References: <20120124180043.A7B9583A802B@mail.adaptivecomputing.com> Message-ID: <83C89061-15DA-40DD-BEB3-58F984B713EF@unistra.fr> Hello, Otherwise, if you ever wanted to change job ID because job ID tend to get bigger on systems running since a long time, you could always do a : qmgr set server next_job_number = 111 of course, this, in turns, needs you to have the admin privileges on the pbs server. Regards, Le 24 janv. 2012 ? 19:01, Hakeem Almabrazi a ?crit : > Thank you Martin, > > I did exactly what you said and I responded back but I guess you were faster than me. Thank you for your tip though. > > Is there a better way than being a super user? > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Rushton Martin > Sent: Tuesday, January 24, 2012 12:02 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] UC specify job id > > When you issued the command you changed to being root, then submitted the job on behalf of root. What you should have done is: > > >sudo qsub script.sh -P hakeem -J 111 > (obviously, change hakeem to your actual user name) > > Since the job was submitted to run as root, you will not be able to see, still less delete, it unless you use sudo again. > Martin Rushton > HPC System Manager, Weapons Technologies > Tel: 01959 514777, Mobile: 07939 219057 > email: jmrushton at QinetiQ.com > www.QinetiQ.com > QinetiQ - Delivering customer-focused solutions > > Please consider the environment before printing this email. > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi > Sent: 24 January 2012 17:33 > To: Torque Users Mailing List > Subject: Re: [torqueusers] specify job id > > Thank you David, > > I tried this and got an error that I am not a super user. So I looked up who is running the ?pbs? and found that it is run under the ?root?. So I submitted the following (since I am not a root) > >sudo qsub script.sh ?P root ?J 111. > > No complaining this time but no jobs in the queue when I qstat. > I resend the same command as before but then it says Job with requested ID 111 already exist??. > > I am not sure what to do next. I do not think I can delete it. Anything I should try?? > > Thank you for your help. > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer > Sent: Tuesday, January 24, 2012 10:47 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] specify job id > > If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : > > qsub script.sh -P -J > > David > > On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi wrote: > Hi, > > Is there a way to specify job id when request a job? > > Thanks > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LXhttp://www.qinetiq.com > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --------------------------------------------------------- R. David - david at unistra.fr Responsable du meso-centre UdS / Direction Informatique Tel. : 03 68 85 45 48 --------------------------------------------------------- From JMRUSHTON at qinetiq.com Tue Jan 24 11:16:24 2012 From: JMRUSHTON at qinetiq.com (Rushton Martin) Date: Tue, 24 Jan 2012 18:16:24 -0000 Subject: [torqueusers] UC specify job id In-Reply-To: References: <20120124180043.A7B9583A802B@mail.adaptivecomputing.com> Message-ID: <20120124181528.BD59D83A802B@mail.adaptivecomputing.com> I'm sorry but it was only the sudo error I saw. Setting the job number isn't documented in the online manual, so I'm afraid you're on your own there. http://www.adaptivecomputing.com/resources/docs/torque/4-0/help.htm Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrushton at QinetiQ.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: 24 January 2012 18:02 To: Torque Users Mailing List Subject: Re: [torqueusers] UC specify job id Thank you Martin, I did exactly what you said and I responded back but I guess you were faster than me. Thank you for your tip though. Is there a better way than being a super user? From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Rushton Martin Sent: Tuesday, January 24, 2012 12:02 PM To: Torque Users Mailing List Subject: Re: [torqueusers] UC specify job id When you issued the command you changed to being root, then submitted the job on behalf of root. What you should have done is: >sudo qsub script.sh -P hakeem -J 111 (obviously, change hakeem to your actual user name) Since the job was submitted to run as root, you will not be able to see, still less delete, it unless you use sudo again. Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrushton at QinetiQ.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: 24 January 2012 17:33 To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Thank you David, I tried this and got an error that I am not a super user. So I looked up who is running the "pbs" and found that it is run under the "root". So I submitted the following (since I am not a root) >sudo qsub script.sh -P root -J 111. No complaining this time but no jobs in the queue when I qstat. I resend the same command as before but then it says Job with requested ID 111 already exist".... I am not sure what to do next. I do not think I can delete it. Anything I should try?... Thank you for your help. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, January 24, 2012 10:47 AM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id If you are a root user submitting on behalf of another user (using the proxy submission this can be done. It is done using -P and -J : qsub script.sh -P -J David On Tue, Jan 24, 2012 at 9:08 AM, Hakeem Almabrazi wrote: Hi, Is there a way to specify job id when request a job? Thanks _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com The QinetiQ e-mail privacy policy and company information is detailed elsewhere in the body of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/34f760ed/attachment-0001.html From jjc at iastate.edu Tue Jan 24 12:37:28 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 24 Jan 2012 19:37:28 +0000 Subject: [torqueusers] specify job id In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E221017F2A04@ITSDAG1D.its.iastate.edu> Hakeem, Is there a reason you want the jobid to be something specific? The only thing that I can think of is to use this after the fact to pick up the output filenames when you have not used -e and -o (or to keep track of it for issuing later Torque commands like qdel or qstat. ) This can be done for a jobscript myjob (in sh , ksh or bash) via: qsub myjob QSUB_RC=$? JID=`echo ${QSUB_RC} | sed -e 's/\.*$//' ` Then ${JID} can be used later. E.g. if you had not use -e and -o to name STDERR and STDOUT filenames, these output should be in myjob.e${JID} and myjob.o${JID} - Jim James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: Tuesday, January 24, 2012 10:08 AM To: torqueusers at supercluster.org Subject: [torqueusers] specify job id Hi, Is there a way to specify job id when request a job? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120124/f1983292/attachment.html From samuel at unimelb.edu.au Tue Jan 24 15:41:10 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 25 Jan 2012 09:41:10 +1100 Subject: [torqueusers] specify job id In-Reply-To: References: Message-ID: <4F1F3386.6020603@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 25/01/12 03:08, Hakeem Almabrazi wrote: > Is there a way to specify job id when request a job? I would strongly suggest that you try and avoid doing that and just work with whatever job ID the PBS system gives you. It's the only way to do things portably and every HPC application that interfaces with Torque does this. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8fM4YACgkQO2KABBYQAh8oJQCeMR8+C0zA/ugkef/05quBDp44 Q4gAnAkrNbOdadYXPGA+JgPbEv2XAS3Q =cb6L -----END PGP SIGNATURE----- From halmabrazi at idtdna.com Wed Jan 25 15:18:41 2012 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Wed, 25 Jan 2012 22:18:41 +0000 Subject: [torqueusers] specify job id In-Reply-To: <242421BFAF465844BE24EB90BB97E221017F2A04@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E221017F2A04@ITSDAG1D.its.iastate.edu> Message-ID: Thank you everyone for helping me with this. The only reason why I want to do that is to be able to track that job specifically. I am building a service to submit jobs to Torque. The same service will have to wait till the job is done and return the results back to the call. So the service need to know when the job is completed to load back the results. I thought, the best way is for that service to keep asking the Torque on the status of the submitted job rather than looking up file(s) in the file system. So the service has to assign a number to the job and then force Torque to use that number. Now if the service wants to know the status of that job, it can asks the Torque server using the same job number. It might be over kill way of doing things but that is the requirement. Now there might be a better way of doing that. I would love to hear it if someone else has better approach than this. Thanks . From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Tuesday, January 24, 2012 1:37 PM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Hakeem, Is there a reason you want the jobid to be something specific? The only thing that I can think of is to use this after the fact to pick up the output filenames when you have not used -e and -o (or to keep track of it for issuing later Torque commands like qdel or qstat. ) This can be done for a jobscript myjob (in sh , ksh or bash) via: qsub myjob QSUB_RC=$? JID=`echo ${QSUB_RC} | sed -e 's/\.*$//' ` Then ${JID} can be used later. E.g. if you had not use -e and -o to name STDERR and STDOUT filenames, these output should be in myjob.e${JID} and myjob.o${JID} - Jim James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: Tuesday, January 24, 2012 10:08 AM To: torqueusers at supercluster.org Subject: [torqueusers] specify job id Hi, Is there a way to specify job id when request a job? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120125/1f94ac25/attachment-0001.html From lloyd_brown at byu.edu Wed Jan 25 15:25:06 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 25 Jan 2012 15:25:06 -0700 Subject: [torqueusers] specify job id In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221017F2A04@ITSDAG1D.its.iastate.edu> Message-ID: <4F208142.7010505@byu.edu> I'm a little confused. If all you're doing is adding a frontend to Torque, what's wrong with using the jobid numbers that Torque assigns? They're guaranteed to be unique, and get output via stdout when you call qsub. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 01/25/2012 03:18 PM, Hakeem Almabrazi wrote: > Thank you everyone for helping me with this. > > > > The only reason why I want to do that is to be able to track that job > specifically. I am building a service to submit jobs to Torque. The > same service will have to wait till the job is done and return the > results back to the call. So the service need to know when the job is > completed to load back the results. I thought, the best way is for that > service to keep asking the Torque on the status of the submitted job > rather than looking up file(s) in the file system. > > > > So the service has to assign a number to the job and then force Torque > to use that number. Now if the service wants to know the status of that > job, it can asks the Torque server using the same job number. It might > be over kill way of doing things but that is the requirement. Now > there might be a better way of doing that. I would love to hear it if > someone else has better approach than this. > From akohlmey at cmm.chem.upenn.edu Wed Jan 25 15:51:06 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Wed, 25 Jan 2012 17:51:06 -0500 Subject: [torqueusers] specify job id In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221017F2A04@ITSDAG1D.its.iastate.edu> Message-ID: On Wed, Jan 25, 2012 at 5:18 PM, Hakeem Almabrazi wrote: > Thank you everyone for helping me with this. > > > > The only reason why I want to do that is to be able to track that job > specifically.? I am building a service to submit jobs to Torque.? The same > service will have to wait till the job is done and return the results back > to the call.? So the service need to know when the job is completed to load > back the results.? I thought, the best way is for that service to keep > asking the Torque on the status of the submitted job rather than looking up > file(s) in the file system. > > > > So the service has to assign a number to the job and then force Torque to > use that number.? Now if the service wants to know the status of that job, > it can asks the Torque server using the same job number.? It might be over > kill way of doing things but that is the requirement.? ?Now there might be a > better way of doing that.? I would love to hear it if someone else has > better approach than this. well, as was mentioned before, you can just take advantage of the fact that the qsub command returns a canonicalized job id. if that is too complicated, you could use the -N flag to qsub. this will assign a job name (default is the name of the submitted script unless otherwise specified) and track that one. both methods have the advantage of not requiring any superuser or queue manager privileges and thus pose no risk of corrupting the queue system status. cheers, axel. > > > > Thanks > > > > . > > > > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From nt_mahmood at yahoo.com Thu Jan 26 02:14:55 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 26 Jan 2012 01:14:55 -0800 (PST) Subject: [torqueusers] Setting up checkpointing In-Reply-To: <4F174A0F.4080308@byu.edu> References: <4F174A0F.4080308@byu.edu> Message-ID: <1327569295.68967.YahooMailNeo@web111709.mail.gq1.yahoo.com> If you are using debian based operating system, then you hardly can make BLCR working. BLCR is primarily designed for redhat based operating systems. ? // Naderan *Mahmood; ________________________________ From: Lloyd Brown To: Torque Users Mailing List Sent: Thursday, January 19, 2012 2:09 AM Subject: [torqueusers] Setting up checkpointing Can anyone enlighten me on the current state of BLCR-style checkpointing in Torque?? I've been trying to get it to work, and so far, I see that it's invoking my checkpoint script, that script calls cr_checkpoint, and the checkpoint files/directories are created, but something is calling the mom_checkpoint_delete_files function, which in turn calls delete_blcr_files, and the checkpoints get deleted. Also, when I do a "qhold" on my job to try to initiate the checkpoint, is it really supposed to terminate my job?? Perhaps that's related, eg. the job is ending so the files get cleaned up. Basically, does anyone have it working, and can give me advice? Thanks, -- Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120126/5f4b1e88/attachment.html From lloyd_brown at byu.edu Thu Jan 26 08:05:45 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 26 Jan 2012 08:05:45 -0700 Subject: [torqueusers] Setting up checkpointing In-Reply-To: <1327569295.68967.YahooMailNeo@web111709.mail.gq1.yahoo.com> References: <4F174A0F.4080308@byu.edu> <1327569295.68967.YahooMailNeo@web111709.mail.gq1.yahoo.com> Message-ID: <4F216BC9.3080908@byu.edu> Thanks for your insight. I apologize that I wasn't clear. I have BLCR working, at least manually. My question has more to do with how the integration with Torque works. In the meantime, I'm currently pursuing the approach of having the checkpointing occur within the job, eg. scripting to have the job call cr_run, cr_checkpoint, cr_restart, etc., as needed. Some clever work with signals makes it reasonably easy. The real problem is redirecting or relocating the output spool files, eg. /spool/.{OU,ER}. But if your script is checkpointing something it called, rather than checkpointing itself, and if your script uses shell redirection to files that persist on a central filesystem, that's not too hard. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 01/26/2012 02:14 AM, Mahmood Naderan wrote: > If you are using debian based operating system, then you hardly can make > BLCR working. > BLCR is primarily designed for redhat based operating systems. > > *// Naderan *Mahmood;* > > ------------------------------------------------------------------------ > *From:* Lloyd Brown > *To:* Torque Users Mailing List > *Sent:* Thursday, January 19, 2012 2:09 AM > *Subject:* [torqueusers] Setting up checkpointing > > Can anyone enlighten me on the current state of BLCR-style checkpointing > in Torque? I've been trying to get it to work, and so far, I see that > it's invoking my checkpoint script, that script calls cr_checkpoint, and > the checkpoint files/directories are created, but something is calling > the mom_checkpoint_delete_files function, which in turn calls > delete_blcr_files, and the checkpoints get deleted. > > Also, when I do a "qhold" on my job to try to initiate the checkpoint, > is it really supposed to terminate my job? Perhaps that's related, eg. > the job is ending so the files get cleaned up. > > Basically, does anyone have it working, and can give me advice? > > Thanks, > > -- > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > From j.blank at fz-juelich.de Thu Jan 26 08:56:38 2012 From: j.blank at fz-juelich.de (Joerg Blank) Date: Thu, 26 Jan 2012 16:56:38 +0100 Subject: [torqueusers] Torque4 NUMA Message-ID: Hello, I tried installing Torque4 with NUMA support in a Virtualbox and got stopped by the mom.layout file. The documentation still talks about "cpus" and "mem", but the source code looks like the syntax changed. I tried "nodes=1" (my vbox machine has only one cpu), got the mom to start, but could not get it to connect to pbs_server. How should that file look in Torque4? I also had to deactivate libcpuset support because it could not write into /dev/cpuset/torque, but that may be due to a misconfigured layout. Regards, Joerg Blank From cwebberops at gmail.com Thu Jan 26 09:54:33 2012 From: cwebberops at gmail.com (Christopher Webber) Date: Thu, 26 Jan 2012 08:54:33 -0800 Subject: [torqueusers] Upgrade Path Message-ID: I am looking at upgrading torque as we are currently running 2.5.5. Can I use a staged approach or does it need to be a forklift upgrade? I would imagine that I would upgrade the server first, which hopefully can talk to older clients and then upgrade the clients. Thoughts? -- cwebber From ataufer at adaptivecomputing.com Thu Jan 26 09:55:37 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Thu, 26 Jan 2012 09:55:37 -0700 (MST) Subject: [torqueusers] Setting up checkpointing In-Reply-To: <4F174A0F.4080308@byu.edu> Message-ID: ----- Original Message ----- > Can anyone enlighten me on the current state of BLCR-style > checkpointing > in Torque? I've been trying to get it to work, and so far, I see > that > it's invoking my checkpoint script, that script calls cr_checkpoint, > and > the checkpoint files/directories are created, but something is > calling > the mom_checkpoint_delete_files function, which in turn calls > delete_blcr_files, and the checkpoints get deleted. I hope you are seeing normal behavior. If I remember correctly, when a job gets checkpointed, the checkpoint files remain on the mom until the mom completes the job or until the job is put on hold and is no longer on the mom. At that time the checkpoint files are transferred to the server where they remain until the job is removed from the server. When the job gets restarted, which may or may not be on the original mom node, the checkpoint files are transferred to the mom which can then restart the job from the checkpoint file. > > Also, when I do a "qhold" on my job to try to initiate the > checkpoint, > is it really supposed to terminate my job? Perhaps that's related, > eg. > the job is ending so the files get cleaned up. qhold is behaving as designed and as documented in its man page. If you want to just checkpoint the job and allow it to continue running, use qchkpt. > > Basically, does anyone have it working, and can give me advice? > > Thanks, > > -- > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From lloyd_brown at byu.edu Thu Jan 26 11:03:24 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 26 Jan 2012 11:03:24 -0700 Subject: [torqueusers] Setting up checkpointing In-Reply-To: References: Message-ID: <4F21956C.9040104@byu.edu> Al, Thanks for the update. I guess the use case our users are really after is to have either a one-time or a periodic checkpoint, with the wait time before the checkpoint specified by the user. The "-c interval=" parameter to qsub makes it look like this should work. But when I did that, I couldn't get the job to actually checkpoint without manually calling qhold/qchkpt. Maybe I'm just misinterpreting something, or don't have it set up right, but the idea here is to not require the users to manually checkpoint their job. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 01/26/2012 09:55 AM, Al Taufer wrote: > > ----- Original Message ----- >> Can anyone enlighten me on the current state of BLCR-style >> checkpointing >> in Torque? I've been trying to get it to work, and so far, I see >> that >> it's invoking my checkpoint script, that script calls cr_checkpoint, >> and >> the checkpoint files/directories are created, but something is >> calling >> the mom_checkpoint_delete_files function, which in turn calls >> delete_blcr_files, and the checkpoints get deleted. > > I hope you are seeing normal behavior. If I remember correctly, when a job gets checkpointed, the checkpoint files remain on the mom until the mom completes the job or until the job is put on hold and is no longer on the mom. At that time the checkpoint files are transferred to the server where they remain until the job is removed from the server. When the job gets restarted, which may or may not be on the original mom node, the checkpoint files are transferred to the mom which can then restart the job from the checkpoint file. > >> >> Also, when I do a "qhold" on my job to try to initiate the >> checkpoint, >> is it really supposed to terminate my job? Perhaps that's related, >> eg. >> the job is ending so the files get cleaned up. > > qhold is behaving as designed and as documented in its man page. If you want to just checkpoint the job and allow it to continue running, use qchkpt. > >> >> Basically, does anyone have it working, and can give me advice? >> >> Thanks, >> >> -- >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ataufer at adaptivecomputing.com Thu Jan 26 11:18:43 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Thu, 26 Jan 2012 11:18:43 -0700 (MST) Subject: [torqueusers] Setting up checkpointing In-Reply-To: <4F21956C.9040104@byu.edu> Message-ID: Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x". ----- Original Message ----- > Al, > > Thanks for the update. I guess the use case our users are really > after > is to have either a one-time or a periodic checkpoint, with the wait > time before the checkpoint specified by the user. The "-c interval=" > parameter to qsub makes it look like this should work. But when I > did > that, I couldn't get the job to actually checkpoint without manually > calling qhold/qchkpt. Maybe I'm just misinterpreting something, or > don't have it set up right, but the idea here is to not require the > users to manually checkpoint their job. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 01/26/2012 09:55 AM, Al Taufer wrote: > > > > ----- Original Message ----- > >> Can anyone enlighten me on the current state of BLCR-style > >> checkpointing > >> in Torque? I've been trying to get it to work, and so far, I see > >> that > >> it's invoking my checkpoint script, that script calls > >> cr_checkpoint, > >> and > >> the checkpoint files/directories are created, but something is > >> calling > >> the mom_checkpoint_delete_files function, which in turn calls > >> delete_blcr_files, and the checkpoints get deleted. > > > > I hope you are seeing normal behavior. If I remember correctly, > > when a job gets checkpointed, the checkpoint files remain on the > > mom until the mom completes the job or until the job is put on > > hold and is no longer on the mom. At that time the checkpoint > > files are transferred to the server where they remain until the > > job is removed from the server. When the job gets restarted, > > which may or may not be on the original mom node, the checkpoint > > files are transferred to the mom which can then restart the job > > from the checkpoint file. > > > >> > >> Also, when I do a "qhold" on my job to try to initiate the > >> checkpoint, > >> is it really supposed to terminate my job? Perhaps that's > >> related, > >> eg. > >> the job is ending so the files get cleaned up. > > > > qhold is behaving as designed and as documented in its man page. > > If you want to just checkpoint the job and allow it to continue > > running, use qchkpt. > > > >> > >> Basically, does anyone have it working, and can give me advice? > >> > >> Thanks, > >> > >> -- > >> Lloyd Brown > >> Systems Administrator > >> Fulton Supercomputing Lab > >> Brigham Young University > >> http://marylou.byu.edu > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From sm4082 at nyu.edu Thu Jan 26 11:24:09 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 26 Jan 2012 13:24:09 -0500 Subject: [torqueusers] Setting up checkpointing In-Reply-To: References: Message-ID: <90690D50-5169-4DF3-A617-BD70252BAA8C@nyu.edu> Hi, Try this: qmgr -c 'set queue serial checkpoint_defaults="enabled,shutdown,periodic,interval=1,depth=2"' serial is the queue name. depth doesn't work. You need to change the perl script that comes with blcr package to accommodate this variable. Has anyone modified the checkpoint scripts? Does it work? Thanks, Sreedhar. On Jan 26, 2012, at 1:18 PM, Al Taufer wrote: > Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x". > > ----- Original Message ----- >> Al, >> >> Thanks for the update. I guess the use case our users are really >> after >> is to have either a one-time or a periodic checkpoint, with the wait >> time before the checkpoint specified by the user. The "-c interval=" >> parameter to qsub makes it look like this should work. But when I >> did >> that, I couldn't get the job to actually checkpoint without manually >> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or >> don't have it set up right, but the idea here is to not require the >> users to manually checkpoint their job. >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 01/26/2012 09:55 AM, Al Taufer wrote: >>> >>> ----- Original Message ----- >>>> Can anyone enlighten me on the current state of BLCR-style >>>> checkpointing >>>> in Torque? I've been trying to get it to work, and so far, I see >>>> that >>>> it's invoking my checkpoint script, that script calls >>>> cr_checkpoint, >>>> and >>>> the checkpoint files/directories are created, but something is >>>> calling >>>> the mom_checkpoint_delete_files function, which in turn calls >>>> delete_blcr_files, and the checkpoints get deleted. >>> >>> I hope you are seeing normal behavior. If I remember correctly, >>> when a job gets checkpointed, the checkpoint files remain on the >>> mom until the mom completes the job or until the job is put on >>> hold and is no longer on the mom. At that time the checkpoint >>> files are transferred to the server where they remain until the >>> job is removed from the server. When the job gets restarted, >>> which may or may not be on the original mom node, the checkpoint >>> files are transferred to the mom which can then restart the job >>> from the checkpoint file. >>> >>>> >>>> Also, when I do a "qhold" on my job to try to initiate the >>>> checkpoint, >>>> is it really supposed to terminate my job? Perhaps that's >>>> related, >>>> eg. >>>> the job is ending so the files get cleaned up. >>> >>> qhold is behaving as designed and as documented in its man page. >>> If you want to just checkpoint the job and allow it to continue >>> running, use qchkpt. >>> >>>> >>>> Basically, does anyone have it working, and can give me advice? >>>> >>>> Thanks, >>>> >>>> -- >>>> Lloyd Brown >>>> Systems Administrator >>>> Fulton Supercomputing Lab >>>> Brigham Young University >>>> http://marylou.byu.edu >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Thu Jan 26 11:59:05 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 26 Jan 2012 18:59:05 +0000 Subject: [torqueusers] specify job id : Additional solutions which avoid changing jodib In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221017F2A04@ITSDAG1D.its.iastate.edu> Message-ID: <242421BFAF465844BE24EB90BB97E221017F2DFF@ITSDAG1D.its.iastate.edu> Hakeem, We have two user who do this. One captures the job id in the manner that I describe earlier, and keep that in a file on their webserver to used later. This technique does not require the webserver to hold onto the ssh connection, and survives a submittal node restart. The second (easier) technique submits a job through qsub, and waits for the result. They do this using qsub -I -x (This is available in the 2.5.4 that we run, and likely anything newer.) You can test it on the head node with my example below. E.g. If you make a file names tjob which contains the line date;sleep 10;date;hostname and make it executable with chmod u+x tjob Then you can submit this and wait for the result using qsub -I -x tjob Log: $ date; qsub -I -x tjob; date Thu Jan 26 12:52:39 CST 2012 qsub: waiting for job 154029.hpc5 to start qsub: job 154029.hpc5 ready Thu Jan 26 12:52:41 CST 2012 Thu Jan 26 12:52:51 CST 2012 node178 qsub: job 154029.hpc5 completed Thu Jan 26 12:52:49 CST 2012 James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: Wednesday, January 25, 2012 4:19 PM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Thank you everyone for helping me with this. The only reason why I want to do that is to be able to track that job specifically. I am building a service to submit jobs to Torque. The same service will have to wait till the job is done and return the results back to the call. So the service need to know when the job is completed to load back the results. I thought, the best way is for that service to keep asking the Torque on the status of the submitted job rather than looking up file(s) in the file system. So the service has to assign a number to the job and then force Torque to use that number. Now if the service wants to know the status of that job, it can asks the Torque server using the same job number. It might be over kill way of doing things but that is the requirement. Now there might be a better way of doing that. I would love to hear it if someone else has better approach than this. Thanks . From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Tuesday, January 24, 2012 1:37 PM To: Torque Users Mailing List Subject: Re: [torqueusers] specify job id Hakeem, Is there a reason you want the jobid to be something specific? The only thing that I can think of is to use this after the fact to pick up the output filenames when you have not used -e and -o (or to keep track of it for issuing later Torque commands like qdel or qstat. ) This can be done for a jobscript myjob (in sh , ksh or bash) via: qsub myjob QSUB_RC=$? JID=`echo ${QSUB_RC} | sed -e 's/\.*$//' ` Then ${JID} can be used later. E.g. if you had not use -e and -o to name STDERR and STDOUT filenames, these output should be in myjob.e${JID} and myjob.o${JID} - Jim James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Hakeem Almabrazi Sent: Tuesday, January 24, 2012 10:08 AM To: torqueusers at supercluster.org Subject: [torqueusers] specify job id Hi, Is there a way to specify job id when request a job? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120126/22b70456/attachment-0001.html From lloyd_brown at byu.edu Thu Jan 26 14:30:41 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 26 Jan 2012 14:30:41 -0700 Subject: [torqueusers] Setting up checkpointing In-Reply-To: References: Message-ID: <4F21C601.7000803@byu.edu> Al, I tried a number of combinations of params, but after your last email, I tried it with "-c periodic,interval=x", and I do see the checkpoint being created in the TORQUEMOMHOME/checkpoints directory. I haven't been able to test beyond that, since some other things go up. >From what you've said, though, I have to ask if there's any way to specify where the checkpoint goes, especially when it would otherwise be copied back to the host where pbs_server is running. You see, our use case involves checkpointing some really big-memory (eg. 256 GB) processes, and we simply don't have the space to store that on the pbs_server host. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 01/26/2012 11:18 AM, Al Taufer wrote: > Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x". > > ----- Original Message ----- >> Al, >> >> Thanks for the update. I guess the use case our users are really >> after >> is to have either a one-time or a periodic checkpoint, with the wait >> time before the checkpoint specified by the user. The "-c interval=" >> parameter to qsub makes it look like this should work. But when I >> did >> that, I couldn't get the job to actually checkpoint without manually >> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or >> don't have it set up right, but the idea here is to not require the >> users to manually checkpoint their job. >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 01/26/2012 09:55 AM, Al Taufer wrote: >>> >>> ----- Original Message ----- >>>> Can anyone enlighten me on the current state of BLCR-style >>>> checkpointing >>>> in Torque? I've been trying to get it to work, and so far, I see >>>> that >>>> it's invoking my checkpoint script, that script calls >>>> cr_checkpoint, >>>> and >>>> the checkpoint files/directories are created, but something is >>>> calling >>>> the mom_checkpoint_delete_files function, which in turn calls >>>> delete_blcr_files, and the checkpoints get deleted. >>> >>> I hope you are seeing normal behavior. If I remember correctly, >>> when a job gets checkpointed, the checkpoint files remain on the >>> mom until the mom completes the job or until the job is put on >>> hold and is no longer on the mom. At that time the checkpoint >>> files are transferred to the server where they remain until the >>> job is removed from the server. When the job gets restarted, >>> which may or may not be on the original mom node, the checkpoint >>> files are transferred to the mom which can then restart the job >>> from the checkpoint file. >>> >>>> >>>> Also, when I do a "qhold" on my job to try to initiate the >>>> checkpoint, >>>> is it really supposed to terminate my job? Perhaps that's >>>> related, >>>> eg. >>>> the job is ending so the files get cleaned up. >>> >>> qhold is behaving as designed and as documented in its man page. >>> If you want to just checkpoint the job and allow it to continue >>> running, use qchkpt. >>> >>>> >>>> Basically, does anyone have it working, and can give me advice? >>>> >>>> Thanks, >>>> >>>> -- >>>> Lloyd Brown >>>> Systems Administrator >>>> Fulton Supercomputing Lab >>>> Brigham Young University >>>> http://marylou.byu.edu >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Thu Jan 26 14:43:44 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 26 Jan 2012 16:43:44 -0500 Subject: [torqueusers] Setting up checkpointing In-Reply-To: <4F21C601.7000803@byu.edu> References: <4F21C601.7000803@byu.edu> Message-ID: <1518B413-5DC3-4A41-BCB7-A136E264C315@nyu.edu> Hi, This is what I did to avoid checkpoint images going onto server node. Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like: $remote_checkpoint_dirs /opt/torque/checkpoint Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space. Best, Sreedhar. On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote: > Al, > > I tried a number of combinations of params, but after your last email, I > tried it with "-c periodic,interval=x", and I do see the checkpoint > being created in the TORQUEMOMHOME/checkpoints directory. I haven't > been able to test beyond that, since some other things go up. > >> From what you've said, though, I have to ask if there's any way to > specify where the checkpoint goes, especially when it would otherwise be > copied back to the host where pbs_server is running. You see, our use > case involves checkpointing some really big-memory (eg. 256 GB) > processes, and we simply don't have the space to store that on the > pbs_server host. > > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 01/26/2012 11:18 AM, Al Taufer wrote: >> Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x". >> >> ----- Original Message ----- >>> Al, >>> >>> Thanks for the update. I guess the use case our users are really >>> after >>> is to have either a one-time or a periodic checkpoint, with the wait >>> time before the checkpoint specified by the user. The "-c interval=" >>> parameter to qsub makes it look like this should work. But when I >>> did >>> that, I couldn't get the job to actually checkpoint without manually >>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or >>> don't have it set up right, but the idea here is to not require the >>> users to manually checkpoint their job. >>> >>> Lloyd Brown >>> Systems Administrator >>> Fulton Supercomputing Lab >>> Brigham Young University >>> http://marylou.byu.edu >>> >>> On 01/26/2012 09:55 AM, Al Taufer wrote: >>>> >>>> ----- Original Message ----- >>>>> Can anyone enlighten me on the current state of BLCR-style >>>>> checkpointing >>>>> in Torque? I've been trying to get it to work, and so far, I see >>>>> that >>>>> it's invoking my checkpoint script, that script calls >>>>> cr_checkpoint, >>>>> and >>>>> the checkpoint files/directories are created, but something is >>>>> calling >>>>> the mom_checkpoint_delete_files function, which in turn calls >>>>> delete_blcr_files, and the checkpoints get deleted. >>>> >>>> I hope you are seeing normal behavior. If I remember correctly, >>>> when a job gets checkpointed, the checkpoint files remain on the >>>> mom until the mom completes the job or until the job is put on >>>> hold and is no longer on the mom. At that time the checkpoint >>>> files are transferred to the server where they remain until the >>>> job is removed from the server. When the job gets restarted, >>>> which may or may not be on the original mom node, the checkpoint >>>> files are transferred to the mom which can then restart the job >>>> from the checkpoint file. >>>> >>>>> >>>>> Also, when I do a "qhold" on my job to try to initiate the >>>>> checkpoint, >>>>> is it really supposed to terminate my job? Perhaps that's >>>>> related, >>>>> eg. >>>>> the job is ending so the files get cleaned up. >>>> >>>> qhold is behaving as designed and as documented in its man page. >>>> If you want to just checkpoint the job and allow it to continue >>>> running, use qchkpt. >>>> >>>>> >>>>> Basically, does anyone have it working, and can give me advice? >>>>> >>>>> Thanks, >>>>> >>>>> -- >>>>> Lloyd Brown >>>>> Systems Administrator >>>>> Fulton Supercomputing Lab >>>>> Brigham Young University >>>>> http://marylou.byu.edu >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ataufer at adaptivecomputing.com Thu Jan 26 15:31:37 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Thu, 26 Jan 2012 15:31:37 -0700 (MST) Subject: [torqueusers] Setting up checkpointing In-Reply-To: <1518B413-5DC3-4A41-BCB7-A136E264C315@nyu.edu> Message-ID: <6b070153-4e0f-4ec3-bcf7-69d2c29524e2@mail> This is a good method for accomplishing what is wanted. The only thing I can add is that when you configure the server you could use the --with-servchkptdir option to specify where the server will keep its checkpoint files, which can be a remotely mounted path. ----- Original Message ----- > Hi, > > This is what I did to avoid checkpoint images going onto server node. > > Modify the pbs_mom's config file to specify what checkpointing > directories are remotely mounted. This can be done by adding > something like: > > $remote_checkpoint_dirs /opt/torque/checkpoint > > Here /opt/torque/checkpoint is remotely mounted onto > /opt/torque/checkpoint on each compute node. It doesn't have to be > /opt/torque/checkpoint on server node. It can be any other directory > on server node. I linked /opt/torque/checkpoint on server node to > some other directory with lots of space. > > Best, > Sreedhar. > > > On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote: > > > Al, > > > > I tried a number of combinations of params, but after your last > > email, I > > tried it with "-c periodic,interval=x", and I do see the checkpoint > > being created in the TORQUEMOMHOME/checkpoints directory. I > > haven't > > been able to test beyond that, since some other things go up. > > > >> From what you've said, though, I have to ask if there's any way to > > specify where the checkpoint goes, especially when it would > > otherwise be > > copied back to the host where pbs_server is running. You see, our > > use > > case involves checkpointing some really big-memory (eg. 256 GB) > > processes, and we simply don't have the space to store that on the > > pbs_server host. > > > > > > > > Lloyd Brown > > Systems Administrator > > Fulton Supercomputing Lab > > Brigham Young University > > http://marylou.byu.edu > > > > On 01/26/2012 11:18 AM, Al Taufer wrote: > >> Are you just using the "-c interval=x"? If so that just specifies > >> what the checkpoint interval is but it does not enable the > >> checkpointing. Try changing it to "-c periodic,interval=x". > >> > >> ----- Original Message ----- > >>> Al, > >>> > >>> Thanks for the update. I guess the use case our users are really > >>> after > >>> is to have either a one-time or a periodic checkpoint, with the > >>> wait > >>> time before the checkpoint specified by the user. The "-c > >>> interval=" > >>> parameter to qsub makes it look like this should work. But when > >>> I > >>> did > >>> that, I couldn't get the job to actually checkpoint without > >>> manually > >>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, > >>> or > >>> don't have it set up right, but the idea here is to not require > >>> the > >>> users to manually checkpoint their job. > >>> > >>> Lloyd Brown > >>> Systems Administrator > >>> Fulton Supercomputing Lab > >>> Brigham Young University > >>> http://marylou.byu.edu > >>> > >>> On 01/26/2012 09:55 AM, Al Taufer wrote: > >>>> > >>>> ----- Original Message ----- > >>>>> Can anyone enlighten me on the current state of BLCR-style > >>>>> checkpointing > >>>>> in Torque? I've been trying to get it to work, and so far, I > >>>>> see > >>>>> that > >>>>> it's invoking my checkpoint script, that script calls > >>>>> cr_checkpoint, > >>>>> and > >>>>> the checkpoint files/directories are created, but something is > >>>>> calling > >>>>> the mom_checkpoint_delete_files function, which in turn calls > >>>>> delete_blcr_files, and the checkpoints get deleted. > >>>> > >>>> I hope you are seeing normal behavior. If I remember correctly, > >>>> when a job gets checkpointed, the checkpoint files remain on the > >>>> mom until the mom completes the job or until the job is put on > >>>> hold and is no longer on the mom. At that time the checkpoint > >>>> files are transferred to the server where they remain until the > >>>> job is removed from the server. When the job gets restarted, > >>>> which may or may not be on the original mom node, the checkpoint > >>>> files are transferred to the mom which can then restart the job > >>>> from the checkpoint file. > >>>> > >>>>> > >>>>> Also, when I do a "qhold" on my job to try to initiate the > >>>>> checkpoint, > >>>>> is it really supposed to terminate my job? Perhaps that's > >>>>> related, > >>>>> eg. > >>>>> the job is ending so the files get cleaned up. > >>>> > >>>> qhold is behaving as designed and as documented in its man page. > >>>> If you want to just checkpoint the job and allow it to continue > >>>> running, use qchkpt. > >>>> > >>>>> > >>>>> Basically, does anyone have it working, and can give me advice? > >>>>> > >>>>> Thanks, > >>>>> > >>>>> -- > >>>>> Lloyd Brown > >>>>> Systems Administrator > >>>>> Fulton Supercomputing Lab > >>>>> Brigham Young University > >>>>> http://marylou.byu.edu > >>>>> _______________________________________________ > >>>>> torqueusers mailing list > >>>>> torqueusers at supercluster.org > >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>>> > >>>> _______________________________________________ > >>>> torqueusers mailing list > >>>> torqueusers at supercluster.org > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From samuel at unimelb.edu.au Thu Jan 26 19:34:08 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 27 Jan 2012 13:34:08 +1100 Subject: [torqueusers] Torque4 NUMA In-Reply-To: References: Message-ID: <4F220D20.7050907@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 27/01/12 02:56, Joerg Blank wrote: > I also had to deactivate libcpuset support because it could not write > into /dev/cpuset/torque, but that may be due to a misconfigured layout. Doesn't Torque4 use hwloc now? - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8iDSAACgkQO2KABBYQAh/uEQCeN2GqQ/CA/gdisMqs9YQD7EDt YwIAoJhLGq9uMiGFrHeMbvltbxLuphTC =HSh0 -----END PGP SIGNATURE----- From dbeer at adaptivecomputing.com Fri Jan 27 09:30:12 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 27 Jan 2012 09:30:12 -0700 Subject: [torqueusers] Upgrade Path In-Reply-To: References: Message-ID: Chris, As long as you are only upgrading to the latest 2.5.*, you should be able to do a rolling upgrade, as is defined in the docs: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.eupgrade.php Certainly upgrading the server first won't be a problem in this scenario. David On Thu, Jan 26, 2012 at 9:54 AM, Christopher Webber wrote: > I am looking at upgrading torque as we are currently running 2.5.5. Can I > use a staged approach or does it need to be a forklift upgrade? I would > imagine that I would upgrade the server first, which hopefully can talk to > older clients and then upgrade the clients. > > Thoughts? > > -- cwebber > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120127/9fedb248/attachment-0001.html From dbeer at adaptivecomputing.com Fri Jan 27 09:44:25 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 27 Jan 2012 09:44:25 -0700 Subject: [torqueusers] Torque4 NUMA In-Reply-To: References: Message-ID: Please see my answers inline. On Thu, Jan 26, 2012 at 8:56 AM, Joerg Blank wrote: > Hello, > > I tried installing Torque4 with NUMA support in a Virtualbox and got > stopped by the mom.layout file. The documentation still talks about > "cpus" and "mem", but the source code looks like the syntax changed. > I tried "nodes=1" (my vbox machine has only one cpu), got the mom to > start, but could not get it to connect to pbs_server. > How should that file look in Torque4? > > You are correct in that you should use the nodes=X syntax. I wonder how meaningful your NUMA test can be for a node that has only one cpu and one NUMA node, but that is a separate concern. I'm sorry the docs haven't been updated yet, I will get that updated ASAP. I'm surprised you're having trouble getting the mom to connect to the server. What are the errors you're getting here? > I also had to deactivate libcpuset support because it could not write > into /dev/cpuset/torque, but that may be due to a misconfigured layout. > > This can't have anything to do with the mom.layout file. TORQUE does not edit the permissions of these directories, it just attempts to access them. What was the error? > > Regards, > Joerg Blank > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120127/1d491f41/attachment.html From dbeer at adaptivecomputing.com Fri Jan 27 09:44:51 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 27 Jan 2012 09:44:51 -0700 Subject: [torqueusers] Torque4 NUMA In-Reply-To: <4F220D20.7050907@unimelb.edu.au> References: <4F220D20.7050907@unimelb.edu.au> Message-ID: On Thu, Jan 26, 2012 at 7:34 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 27/01/12 02:56, Joerg Blank wrote: > > > I also had to deactivate libcpuset support because it could not write > > into /dev/cpuset/torque, but that may be due to a misconfigured layout. > > Doesn't Torque4 use hwloc now? > > Yes, TORQUE 4 uses hwloc. > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk8iDSAACgkQO2KABBYQAh/uEQCeN2GqQ/CA/gdisMqs9YQD7EDt > YwIAoJhLGq9uMiGFrHeMbvltbxLuphTC > =HSh0 > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120127/d638c063/attachment.html From lloyd_brown at byu.edu Fri Jan 27 11:04:08 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Fri, 27 Jan 2012 11:04:08 -0700 Subject: [torqueusers] Setting up checkpointing In-Reply-To: <1518B413-5DC3-4A41-BCB7-A136E264C315@nyu.edu> References: <4F21C601.7000803@byu.edu> <1518B413-5DC3-4A41-BCB7-A136E264C315@nyu.edu> Message-ID: <4F22E718.2070200@byu.edu> I have to apologize for being so dense, but it seems I still need a little help. Thanks to Al's and Sreedhar's help, I've been able to get the checkpoint files to be generated (either in TORQUEMOMDIR/checkpoints, or whatever I specify via "-c dir="). When the job ends (qdel, runs out of walltime, etc.), though, it sounds like it should be copied back to the pbs_server host somewhere, either where specified via configure or qmgr, or in PBSSERVERDIR/checkpoint by default. The thing is that while the checkpoints get deleted on the mom, they never show up on the server. This occurs both with and without "qmgr -c 's q queuename checkpoint_dir=..'", as described in the docs. I haven't tried recompiling the server with the config param Al mentioned yet. I'm still deciding whether I like the behavior of torque with respect to checkpointing, and whether it will fit with my users' use case, but right now, I can't replicate the behavior yet. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 01/26/2012 02:43 PM, Sreedhar Manchu wrote: > Hi, > > This is what I did to avoid checkpoint images going onto server node. > > Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like: > > $remote_checkpoint_dirs /opt/torque/checkpoint > > Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space. > > Best, > Sreedhar. > > > On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote: > >> Al, >> >> I tried a number of combinations of params, but after your last email, I >> tried it with "-c periodic,interval=x", and I do see the checkpoint >> being created in the TORQUEMOMHOME/checkpoints directory. I haven't >> been able to test beyond that, since some other things go up. >> >>> From what you've said, though, I have to ask if there's any way to >> specify where the checkpoint goes, especially when it would otherwise be >> copied back to the host where pbs_server is running. You see, our use >> case involves checkpointing some really big-memory (eg. 256 GB) >> processes, and we simply don't have the space to store that on the >> pbs_server host. >> >> >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 01/26/2012 11:18 AM, Al Taufer wrote: >>> Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x". >>> >>> ----- Original Message ----- >>>> Al, >>>> >>>> Thanks for the update. I guess the use case our users are really >>>> after >>>> is to have either a one-time or a periodic checkpoint, with the wait >>>> time before the checkpoint specified by the user. The "-c interval=" >>>> parameter to qsub makes it look like this should work. But when I >>>> did >>>> that, I couldn't get the job to actually checkpoint without manually >>>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or >>>> don't have it set up right, but the idea here is to not require the >>>> users to manually checkpoint their job. >>>> >>>> Lloyd Brown >>>> Systems Administrator >>>> Fulton Supercomputing Lab >>>> Brigham Young University >>>> http://marylou.byu.edu >>>> >>>> On 01/26/2012 09:55 AM, Al Taufer wrote: >>>>> >>>>> ----- Original Message ----- >>>>>> Can anyone enlighten me on the current state of BLCR-style >>>>>> checkpointing >>>>>> in Torque? I've been trying to get it to work, and so far, I see >>>>>> that >>>>>> it's invoking my checkpoint script, that script calls >>>>>> cr_checkpoint, >>>>>> and >>>>>> the checkpoint files/directories are created, but something is >>>>>> calling >>>>>> the mom_checkpoint_delete_files function, which in turn calls >>>>>> delete_blcr_files, and the checkpoints get deleted. >>>>> >>>>> I hope you are seeing normal behavior. If I remember correctly, >>>>> when a job gets checkpointed, the checkpoint files remain on the >>>>> mom until the mom completes the job or until the job is put on >>>>> hold and is no longer on the mom. At that time the checkpoint >>>>> files are transferred to the server where they remain until the >>>>> job is removed from the server. When the job gets restarted, >>>>> which may or may not be on the original mom node, the checkpoint >>>>> files are transferred to the mom which can then restart the job >>>>> from the checkpoint file. >>>>> >>>>>> >>>>>> Also, when I do a "qhold" on my job to try to initiate the >>>>>> checkpoint, >>>>>> is it really supposed to terminate my job? Perhaps that's >>>>>> related, >>>>>> eg. >>>>>> the job is ending so the files get cleaned up. >>>>> >>>>> qhold is behaving as designed and as documented in its man page. >>>>> If you want to just checkpoint the job and allow it to continue >>>>> running, use qchkpt. >>>>> >>>>>> >>>>>> Basically, does anyone have it working, and can give me advice? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -- >>>>>> Lloyd Brown >>>>>> Systems Administrator >>>>>> Fulton Supercomputing Lab >>>>>> Brigham Young University >>>>>> http://marylou.byu.edu >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Fri Jan 27 11:33:55 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 27 Jan 2012 13:33:55 -0500 Subject: [torqueusers] Setting up checkpointing In-Reply-To: <4F22E718.2070200@byu.edu> References: <4F21C601.7000803@byu.edu> <1518B413-5DC3-4A41-BCB7-A136E264C315@nyu.edu> <4F22E718.2070200@byu.edu> Message-ID: Hi Lloyd, I had the same problem. Then I realized the directory with checkpoints for each job in the server checkpoint directory or user mentioned directory through command line with qsub would be deleted as soon as either the job is completed or the job is deleted with qdel. I don't remember exactly, but I think if you keep the job information on the server using set server keep_completed = 300 (300 seconds) then you can restart the checkpoint that is created with qhold. I think if you set up the server to keep the info long enough after the job is done and (then include qhold in the pbs script ) to use qhold just before walltime ends, then you should be able to restart the job with the checkpoint that would be created right before job ends. Otherwise, once job ends torque takes off job information and blcr wouldn't have needed files/information to restart the job. Since I use $spool_as_final_name true parameter in my /opt/torque/mom_priv/config on each compute node, I couldn't make torque restart the jobs from checkpoint images. I think for whatever reasons blcr sees that files are modified from the time they are created. Because of this I never used checkpointing from torque side as for us it is beneficial to have files in user directories at the beginning of the job it self rather than copying from compute nodes after the job is done. Regarding qdel, it deletes all the checkpoint files as it deletes all the job information from server as well (if I am right). Since there is no information from torque, it wouldn't be much helpful to have checkpoint images as blcr ( I mean torque compiled with blcr) needs all the information to restart. I guess issuing qdel is like telling I don't care about this job and so I don't need anything anymore. This is same once the job is also done. Torque sees it as it doesn't have to care about checkpoint images as the job has successfully finished and so there is no need for checkpoint images. This is the reason you need to issue qhold just before walltime ends and torque would keep the job information for the time you mention in the set server keep_completed parameter with qmgr -c. Long time back, I successfully checkpointed and restarted the jobs with torque (simple C executable). The other thing I noticed was that it deletes the first checkpoint file as soon as you create the second checkpoint ( I guess it thinks there is no need for the first checkpoint once we get to next point in time by creating second checkpoint). This is helpful when we consider space usage. You can try out these things. I might be wrong with all these statements. I tried my best for days to make it work the way I wanted. But some how I realized it wasn't working the way I wanted (especially with spool_as_final_name parameter and so gave up. Now I am trying to do it just blcr with some scripts. In a way it would be great if it works with torque. If you succeed in this please let us know how you did it. It would be really helpful if someone helps. I know for sure that there are people using torque with blcr support. Good luck, Sreedhar. On Jan 27, 2012, at 1:04 PM, Lloyd Brown wrote: > I have to apologize for being so dense, but it seems I still need a > little help. > > Thanks to Al's and Sreedhar's help, I've been able to get the checkpoint > files to be generated (either in TORQUEMOMDIR/checkpoints, or whatever I > specify via "-c dir="). When the job ends (qdel, runs out of walltime, > etc.), though, it sounds like it should be copied back to the pbs_server > host somewhere, either where specified via configure or qmgr, or in > PBSSERVERDIR/checkpoint by default. The thing is that while the > checkpoints get deleted on the mom, they never show up on the server. > This occurs both with and without "qmgr -c 's q queuename > checkpoint_dir=..'", as described in the docs. I haven't tried > recompiling the server with the config param Al mentioned yet. > > I'm still deciding whether I like the behavior of torque with respect to > checkpointing, and whether it will fit with my users' use case, but > right now, I can't replicate the behavior yet. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 01/26/2012 02:43 PM, Sreedhar Manchu wrote: >> Hi, >> >> This is what I did to avoid checkpoint images going onto server node. >> >> Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like: >> >> $remote_checkpoint_dirs /opt/torque/checkpoint >> >> Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space. >> >> Best, >> Sreedhar. >> >> >> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote: >> >>> Al, >>> >>> I tried a number of combinations of params, but after your last email, I >>> tried it with "-c periodic,interval=x", and I do see the checkpoint >>> being created in the TORQUEMOMHOME/checkpoints directory. I haven't >>> been able to test beyond that, since some other things go up. >>> >>>> From what you've said, though, I have to ask if there's any way to >>> specify where the checkpoint goes, especially when it would otherwise be >>> copied back to the host where pbs_server is running. You see, our use >>> case involves checkpointing some really big-memory (eg. 256 GB) >>> processes, and we simply don't have the space to store that on the >>> pbs_server host. >>> >>> >>> >>> Lloyd Brown >>> Systems Administrator >>> Fulton Supercomputing Lab >>> Brigham Young University >>> http://marylou.byu.edu >>> >>> On 01/26/2012 11:18 AM, Al Taufer wrote: >>>> Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x". >>>> >>>> ----- Original Message ----- >>>>> Al, >>>>> >>>>> Thanks for the update. I guess the use case our users are really >>>>> after >>>>> is to have either a one-time or a periodic checkpoint, with the wait >>>>> time before the checkpoint specified by the user. The "-c interval=" >>>>> parameter to qsub makes it look like this should work. But when I >>>>> did >>>>> that, I couldn't get the job to actually checkpoint without manually >>>>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or >>>>> don't have it set up right, but the idea here is to not require the >>>>> users to manually checkpoint their job. >>>>> >>>>> Lloyd Brown >>>>> Systems Administrator >>>>> Fulton Supercomputing Lab >>>>> Brigham Young University >>>>> http://marylou.byu.edu >>>>> >>>>> On 01/26/2012 09:55 AM, Al Taufer wrote: >>>>>> >>>>>> ----- Original Message ----- >>>>>>> Can anyone enlighten me on the current state of BLCR-style >>>>>>> checkpointing >>>>>>> in Torque? I've been trying to get it to work, and so far, I see >>>>>>> that >>>>>>> it's invoking my checkpoint script, that script calls >>>>>>> cr_checkpoint, >>>>>>> and >>>>>>> the checkpoint files/directories are created, but something is >>>>>>> calling >>>>>>> the mom_checkpoint_delete_files function, which in turn calls >>>>>>> delete_blcr_files, and the checkpoints get deleted. >>>>>> >>>>>> I hope you are seeing normal behavior. If I remember correctly, >>>>>> when a job gets checkpointed, the checkpoint files remain on the >>>>>> mom until the mom completes the job or until the job is put on >>>>>> hold and is no longer on the mom. At that time the checkpoint >>>>>> files are transferred to the server where they remain until the >>>>>> job is removed from the server. When the job gets restarted, >>>>>> which may or may not be on the original mom node, the checkpoint >>>>>> files are transferred to the mom which can then restart the job >>>>>> from the checkpoint file. >>>>>> >>>>>>> >>>>>>> Also, when I do a "qhold" on my job to try to initiate the >>>>>>> checkpoint, >>>>>>> is it really supposed to terminate my job? Perhaps that's >>>>>>> related, >>>>>>> eg. >>>>>>> the job is ending so the files get cleaned up. >>>>>> >>>>>> qhold is behaving as designed and as documented in its man page. >>>>>> If you want to just checkpoint the job and allow it to continue >>>>>> running, use qchkpt. >>>>>> >>>>>>> >>>>>>> Basically, does anyone have it working, and can give me advice? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -- >>>>>>> Lloyd Brown >>>>>>> Systems Administrator >>>>>>> Fulton Supercomputing Lab >>>>>>> Brigham Young University >>>>>>> http://marylou.byu.edu >>>>>>> _______________________________________________ >>>>>>> torqueusers mailing list >>>>>>> torqueusers at supercluster.org >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>>> >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 From torque4.mailinglist at gmail.com Sun Jan 29 22:54:53 2012 From: torque4.mailinglist at gmail.com (Torque4 User) Date: Mon, 30 Jan 2012 13:54:53 +0800 Subject: [torqueusers] Error in launching pbs_mom on Cygwin Message-ID: I managed to compile Torque 3.0.4 on Cygwin 1.7.9 but unable to launch pbs_mom. Is there anyone successfully run pbs_mom as service in Windows 7. Appreciate any help and pointer. Here is what I did: $ ./configure --disable-daemons --disable-unixsockets --disable-gcc-warnings $ make $ make install $ uname -a CYGWIN_NT-6.1-WOW64 AdaptiveC-THINK 1.7.9(0.237/5/3) 2011-03-29 10:10 i686 Cygwin $ ./contrib/AddPrivileges --add mom mkpasswd (103): [1355] The specified domain either does not exist or could not be contacted. mkgroup (102): [1355] The specified domain either does not exist or could not be contacted. ?? Adaptive .C is a local administrator ?? Reboot your computer that the SeCreateTokenPrivilege has taken effect ?? Passwd&Group files and additional privileges were set successfully ?? Should run 'editrights -l -u UserAdmin' ??? to ascertain of the privileges installation ?Warning!!! You have to understand that the installing of additional privileges ??????????? can decrease your OS security level $ cygrunsrv.exe -I pbs_mom -p /usr/local/sbin/pbs_mom.exe -u Administrator -w password If I tried to start as a service, the Event Properties reported the following: The description for Event ID 0 from source pbs_mom cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. If the event originated on another computer, the display information had to be saved with the event. The following information was included with the event: pbs_mom: PID 6720: service `pbs_mom' failed: signal 6 raised $ /usr/local/sbin/pbs_mom pbs_mom: LOG_ERROR::IamRoot, Cann`t get the name of the primary domain controller MOM is up assertion "stream >= 0" failed: file "../Libdis/diswul.c", line 130, function: diswul Aborted (core dumped) From j.blank at fz-juelich.de Mon Jan 30 06:18:40 2012 From: j.blank at fz-juelich.de (Joerg Blank) Date: Mon, 30 Jan 2012 14:18:40 +0100 Subject: [torqueusers] Torque4 NUMA In-Reply-To: References: Message-ID: Hello, Thanks for your answers. > You are correct in that you should use the nodes=X syntax. I wonder how > meaningful your NUMA test can be for a node that has only one cpu and > one NUMA node, but that is a separate concern. I'm sorry the docs > haven't been updated yet, I will get that updated ASAP. I brew a new Debian package for Torque4 and was testing the installation. I figured I could configure it and see if everything connects, etc. No performance test were done on this virtual machine ;-) > I'm surprised you're having trouble getting the mom to connect to the > server. What are the errors you're getting here? I figured out that the num_numa_nodes parameter was also renamed. Is there a way to have non-NUMA nodes in the same cluster? > This can't have anything to do with the mom.layout file. TORQUE does not > edit the permissions of these directories, it just attempts to access > them. What was the error? I removed --enable-cpuset and --enable-libcpuset from the configure call. Do I still get process binding? Regards, J?rg Blank From listsarnau at gmail.com Mon Jan 30 07:45:20 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Mon, 30 Jan 2012 15:45:20 +0100 Subject: [torqueusers] limiting resource usage with torque In-Reply-To: <4F1E021A.2080409@unimelb.edu.au> References: <20111215122848.6eab11c0@amarrosa.pic.es> <4F18D529.5050208@unimelb.edu.au> <20120120102812.01ffcf2b@amarrosa.pic.es> <4F1E021A.2080409@unimelb.edu.au> Message-ID: <20120130154520.2e1fddb3@amarrosa.pic.es> On Tue, 24 Jan 2012 11:58:02 +1100 Christopher Samuel wrote: Hi Christopher, *Sorry for the delay, busy week! > My take is that the scheduler will try and reserve whatever you ask it > for (and defer if it cannot), and the queuing system should enforce > your request as a limit (if it can). this is the behaviour I'm seeing, but not the desired :-) I don't really want that reservation (giving the job some extra freedom in resource usage). I really appreciate the explanation you give me this mail. I've completely understood how torque/MAUI behaves when adding -l option to qsub. In Torque/MAUI is not possible to kill a job that has used more than X resources whitout MAUI reservation... Many thanks for your reply! Cheers, Arnau From dbeer at adaptivecomputing.com Mon Jan 30 09:32:18 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Jan 2012 09:32:18 -0700 Subject: [torqueusers] Torque4 NUMA In-Reply-To: References: Message-ID: Joerg, See below. On Mon, Jan 30, 2012 at 6:18 AM, Joerg Blank wrote: > Hello, > > Thanks for your answers. > > > You are correct in that you should use the nodes=X syntax. I wonder how > > meaningful your NUMA test can be for a node that has only one cpu and > > one NUMA node, but that is a separate concern. I'm sorry the docs > > haven't been updated yet, I will get that updated ASAP. > > I brew a new Debian package for Torque4 and was testing the > installation. I figured I could configure it and see if everything > connects, etc. No performance test were done on this virtual machine ;-) > > Understood. > > I'm surprised you're having trouble getting the mom to connect to the > > server. What are the errors you're getting here? > > I figured out that the num_numa_nodes parameter was also renamed. > Is there a way to have non-NUMA nodes in the same cluster? > > Sorry about that. I forgot to point that out to you either - it should be num_node_boards. I will also get this updated in the documentation. > > This can't have anything to do with the mom.layout file. TORQUE does not > > edit the permissions of these directories, it just attempts to access > > them. What was the error? > > I removed --enable-cpuset and --enable-libcpuset from the configure > call. Do I still get process binding? > > Yes, you will still get process binding without these lines. > Regards, > > J?rg Blank > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120130/df07879a/attachment.html From dbeer at adaptivecomputing.com Mon Jan 30 13:55:18 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Jan 2012 13:55:18 -0700 Subject: [torqueusers] Beta Update - NUMA Message-ID: All, A new TORQUE 4.0 build was cut today. This build has the most up to date fixes for the beta, and it has especially the important fixes for those of you using the NUMA builds of TORQUE. Two of the fixes include: 1. Fixing a compile bug. 2. Previously, only the first node board was reporting in, but now this is resolved. Special Thanks to Peter Enstrom for helping us through these, also thanks to everyone for your beta feedback. It can be downloaded here: http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201201301347.tar.gz -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120130/c3da725c/attachment.html From sm4082 at nyu.edu Mon Jan 30 15:19:36 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Mon, 30 Jan 2012 17:19:36 -0500 Subject: [torqueusers] Setting up checkpointing In-Reply-To: <6b070153-4e0f-4ec3-bcf7-69d2c29524e2@mail> References: <6b070153-4e0f-4ec3-bcf7-69d2c29524e2@mail> Message-ID: <7EE16CDA-542B-4A1B-BF48-2944647993DA@nyu.edu> Hi Al, Is there a way to make checkpoint files not stay on either compute nodes or head node? I mean I want them to go into users' working directories. We have /scratch space mounted on compute nodes but not on head node. Over all, we have less space on head node as well as on compute nodes. If the jobs are huge I'm afraid checkpoint images might occupy all the space eventually leading to job failures. I know that qsub -c dir= puts the file in the specified path. If we do this, does server still keep the checkpoint image on it ( this directory is remotely mounted on to compute nodes) or it stays just in the path specified next to dir. I appreciate your help. Thanks, Sreedhar. On Jan 26, 2012, at 5:31 PM, Al Taufer wrote: > This is a good method for accomplishing what is wanted. The only thing I can add is that when you configure the server you could use the --with-servchkptdir option to specify where the server will keep its checkpoint files, which can be a remotely mounted path. > > ----- Original Message ----- >> Hi, >> >> This is what I did to avoid checkpoint images going onto server node. >> >> Modify the pbs_mom's config file to specify what checkpointing >> directories are remotely mounted. This can be done by adding >> something like: >> >> $remote_checkpoint_dirs /opt/torque/checkpoint >> >> Here /opt/torque/checkpoint is remotely mounted onto >> /opt/torque/checkpoint on each compute node. It doesn't have to be >> /opt/torque/checkpoint on server node. It can be any other directory >> on server node. I linked /opt/torque/checkpoint on server node to >> some other directory with lots of space. >> >> Best, >> Sreedhar. >> >> >> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote: >> >>> Al, >>> >>> I tried a number of combinations of params, but after your last >>> email, I >>> tried it with "-c periodic,interval=x", and I do see the checkpoint >>> being created in the TORQUEMOMHOME/checkpoints directory. I >>> haven't >>> been able to test beyond that, since some other things go up. >>> >>>> From what you've said, though, I have to ask if there's any way to >>> specify where the checkpoint goes, especially when it would >>> otherwise be >>> copied back to the host where pbs_server is running. You see, our >>> use >>> case involves checkpointing some really big-memory (eg. 256 GB) >>> processes, and we simply don't have the space to store that on the >>> pbs_server host. >>> >>> >>> >>> Lloyd Brown >>> Systems Administrator >>> Fulton Supercomputing Lab >>> Brigham Young University >>> http://marylou.byu.edu >>> >>> On 01/26/2012 11:18 AM, Al Taufer wrote: >>>> Are you just using the "-c interval=x"? If so that just specifies >>>> what the checkpoint interval is but it does not enable the >>>> checkpointing. Try changing it to "-c periodic,interval=x". >>>> >>>> ----- Original Message ----- >>>>> Al, >>>>> >>>>> Thanks for the update. I guess the use case our users are really >>>>> after >>>>> is to have either a one-time or a periodic checkpoint, with the >>>>> wait >>>>> time before the checkpoint specified by the user. The "-c >>>>> interval=" >>>>> parameter to qsub makes it look like this should work. But when >>>>> I >>>>> did >>>>> that, I couldn't get the job to actually checkpoint without >>>>> manually >>>>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, >>>>> or >>>>> don't have it set up right, but the idea here is to not require >>>>> the >>>>> users to manually checkpoint their job. >>>>> >>>>> Lloyd Brown >>>>> Systems Administrator >>>>> Fulton Supercomputing Lab >>>>> Brigham Young University >>>>> http://marylou.byu.edu >>>>> >>>>> On 01/26/2012 09:55 AM, Al Taufer wrote: >>>>>> >>>>>> ----- Original Message ----- >>>>>>> Can anyone enlighten me on the current state of BLCR-style >>>>>>> checkpointing >>>>>>> in Torque? I've been trying to get it to work, and so far, I >>>>>>> see >>>>>>> that >>>>>>> it's invoking my checkpoint script, that script calls >>>>>>> cr_checkpoint, >>>>>>> and >>>>>>> the checkpoint files/directories are created, but something is >>>>>>> calling >>>>>>> the mom_checkpoint_delete_files function, which in turn calls >>>>>>> delete_blcr_files, and the checkpoints get deleted. >>>>>> >>>>>> I hope you are seeing normal behavior. If I remember correctly, >>>>>> when a job gets checkpointed, the checkpoint files remain on the >>>>>> mom until the mom completes the job or until the job is put on >>>>>> hold and is no longer on the mom. At that time the checkpoint >>>>>> files are transferred to the server where they remain until the >>>>>> job is removed from the server. When the job gets restarted, >>>>>> which may or may not be on the original mom node, the checkpoint >>>>>> files are transferred to the mom which can then restart the job >>>>>> from the checkpoint file. >>>>>> >>>>>>> >>>>>>> Also, when I do a "qhold" on my job to try to initiate the >>>>>>> checkpoint, >>>>>>> is it really supposed to terminate my job? Perhaps that's >>>>>>> related, >>>>>>> eg. >>>>>>> the job is ending so the files get cleaned up. >>>>>> >>>>>> qhold is behaving as designed and as documented in its man page. >>>>>> If you want to just checkpoint the job and allow it to continue >>>>>> running, use qchkpt. >>>>>> >>>>>>> >>>>>>> Basically, does anyone have it working, and can give me advice? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -- >>>>>>> Lloyd Brown >>>>>>> Systems Administrator >>>>>>> Fulton Supercomputing Lab >>>>>>> Brigham Young University >>>>>>> http://marylou.byu.edu >>>>>>> _______________________________________________ >>>>>>> torqueusers mailing list >>>>>>> torqueusers at supercluster.org >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>>> >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Tue Jan 31 08:32:40 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 31 Jan 2012 10:32:40 -0500 Subject: [torqueusers] Checkpoint script failed with return value of 13 Message-ID: <69382EE8-FE5F-4D28-99E7-D56E196D7A6C@nyu.edu> Hi, When I try to checkpoint a simple job I see the error Checkpoint script failed with return value of 13 in qstat -f output. I see this in system messages Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13 Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13 I found this in checkpoint_script. # Note also that a request was made to identify whether this script was invoked # by the job's owner or by a system administrator. While this information is # known to pbs_server, it is not propagated to pbs_mom and thus it is not # possible to pass this to the script. Therefore, a workaround is to invoke # qmgr and attempt to set a trivial variable. This will fail if the invoker is # not a manager. Anyone know what exactly do I need to do here? I am not sure what trivial variable I need to set wtih qmgr. Our Server Attributes: # Set server attributes. # set server scheduling = True set server acl_host_enable = False set server acl_hosts = crunch.its.nyu.edu set server acl_hosts += crunch.local set server managers = root at crunch.local set server operators = root at crunch.local set server default_queue = route set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server submit_hosts = login-0-1 set server submit_hosts += login-0-0 set server submit_hosts += login-0-3 set server submit_hosts += login-0-2 set server allow_node_submit = False set server next_job_number = 139165 If anyone knows how to get around this error, please let me know. I'd appreciate your help. Thanks, Sreedhar. --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 From sm4082 at nyu.edu Tue Jan 31 09:24:23 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 31 Jan 2012 11:24:23 -0500 Subject: [torqueusers] Checkpoint script failed with return value of 13 In-Reply-To: <69382EE8-FE5F-4D28-99E7-D56E196D7A6C@nyu.edu> References: <69382EE8-FE5F-4D28-99E7-D56E196D7A6C@nyu.edu> Message-ID: adding no_root_squash to /etc/exports fixed the issue. Sreedhar. On Jan 31, 2012, at 10:32 AM, Sreedhar Manchu wrote: > Hi, > > When I try to checkpoint a simple job I see the error > > Checkpoint script failed with return value of 13 > > in qstat -f output. > > I see this in system messages > > Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner > Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13 > Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner > Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13 > > I found this in checkpoint_script. > > # Note also that a request was made to identify whether this script was invoked > # by the job's owner or by a system administrator. While this information is > # known to pbs_server, it is not propagated to pbs_mom and thus it is not > # possible to pass this to the script. Therefore, a workaround is to invoke > # qmgr and attempt to set a trivial variable. This will fail if the invoker is > # not a manager. > > Anyone know what exactly do I need to do here? I am not sure what trivial variable I need to set wtih qmgr. > > Our Server Attributes: > > # Set server attributes. > # > set server scheduling = True > set server acl_host_enable = False > set server acl_hosts = crunch.its.nyu.edu > set server acl_hosts += crunch.local > set server managers = root at crunch.local > set server operators = root at crunch.local > set server default_queue = route > set server log_events = 511 > set server mail_from = adm > set server query_other_jobs = True > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server submit_hosts = login-0-1 > set server submit_hosts += login-0-0 > set server submit_hosts += login-0-3 > set server submit_hosts += login-0-2 > set server allow_node_submit = False > set server next_job_number = 139165 > > If anyone knows how to get around this error, please let me know. I'd appreciate your help. > > Thanks, > Sreedhar. > > --- > Sreedhar Manchu > HPC Support Specialist > New York University > 251 Mercer Street > New York, NY 10012-1110 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 From ataufer at adaptivecomputing.com Tue Jan 31 09:40:17 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Tue, 31 Jan 2012 09:40:17 -0700 (MST) Subject: [torqueusers] Setting up checkpointing In-Reply-To: <7EE16CDA-542B-4A1B-BF48-2944647993DA@nyu.edu> Message-ID: <96d5236c-1550-443d-8c8c-571b13d0be6a@mail> I am not sure but I don't think its currently possible. The server always wants to transfer the checkpoint file back to its checkpoint directory. If this is a remotely accessible path that the compute nodes are set up to use then the actual transfer will not happen. ----- Original Message ----- > Hi Al, > > Is there a way to make checkpoint files not stay on either compute > nodes or head node? I mean I want them to go into users' working > directories. We have /scratch space mounted on compute nodes but not > on head node. Over all, we have less space on head node as well as > on compute nodes. If the jobs are huge I'm afraid checkpoint images > might occupy all the space eventually leading to job failures. > > I know that qsub -c dir= puts the file in the > specified path. If we do this, does server still keep the checkpoint > image on it ( this directory is remotely mounted on to compute > nodes) or it stays just in the path specified next to dir. > > I appreciate your help. > > Thanks, > Sreedhar. > > On Jan 26, 2012, at 5:31 PM, Al Taufer wrote: > > > This is a good method for accomplishing what is wanted. The only > > thing I can add is that when you configure the server you could > > use the --with-servchkptdir option to specify where the server > > will keep its checkpoint files, which can be a remotely mounted > > path. > > > > ----- Original Message ----- > >> Hi, > >> > >> This is what I did to avoid checkpoint images going onto server > >> node. > >> > >> Modify the pbs_mom's config file to specify what checkpointing > >> directories are remotely mounted. This can be done by adding > >> something like: > >> > >> $remote_checkpoint_dirs /opt/torque/checkpoint > >> > >> Here /opt/torque/checkpoint is remotely mounted onto > >> /opt/torque/checkpoint on each compute node. It doesn't have to be > >> /opt/torque/checkpoint on server node. It can be any other > >> directory > >> on server node. I linked /opt/torque/checkpoint on server node to > >> some other directory with lots of space. > >> > >> Best, > >> Sreedhar. > >> > >> > >> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote: > >> > >>> Al, > >>> > >>> I tried a number of combinations of params, but after your last > >>> email, I > >>> tried it with "-c periodic,interval=x", and I do see the > >>> checkpoint > >>> being created in the TORQUEMOMHOME/checkpoints directory. I > >>> haven't > >>> been able to test beyond that, since some other things go up. > >>> > >>>> From what you've said, though, I have to ask if there's any way > >>>> to > >>> specify where the checkpoint goes, especially when it would > >>> otherwise be > >>> copied back to the host where pbs_server is running. You see, > >>> our > >>> use > >>> case involves checkpointing some really big-memory (eg. 256 GB) > >>> processes, and we simply don't have the space to store that on > >>> the > >>> pbs_server host. > >>> > >>> > >>> > >>> Lloyd Brown > >>> Systems Administrator > >>> Fulton Supercomputing Lab > >>> Brigham Young University > >>> http://marylou.byu.edu > >>> > >>> On 01/26/2012 11:18 AM, Al Taufer wrote: > >>>> Are you just using the "-c interval=x"? If so that just > >>>> specifies > >>>> what the checkpoint interval is but it does not enable the > >>>> checkpointing. Try changing it to "-c periodic,interval=x". > >>>> > >>>> ----- Original Message ----- > >>>>> Al, > >>>>> > >>>>> Thanks for the update. I guess the use case our users are > >>>>> really > >>>>> after > >>>>> is to have either a one-time or a periodic checkpoint, with the > >>>>> wait > >>>>> time before the checkpoint specified by the user. The "-c > >>>>> interval=" > >>>>> parameter to qsub makes it look like this should work. But > >>>>> when > >>>>> I > >>>>> did > >>>>> that, I couldn't get the job to actually checkpoint without > >>>>> manually > >>>>> calling qhold/qchkpt. Maybe I'm just misinterpreting > >>>>> something, > >>>>> or > >>>>> don't have it set up right, but the idea here is to not require > >>>>> the > >>>>> users to manually checkpoint their job. > >>>>> > >>>>> Lloyd Brown > >>>>> Systems Administrator > >>>>> Fulton Supercomputing Lab > >>>>> Brigham Young University > >>>>> http://marylou.byu.edu > >>>>> > >>>>> On 01/26/2012 09:55 AM, Al Taufer wrote: > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> Can anyone enlighten me on the current state of BLCR-style > >>>>>>> checkpointing > >>>>>>> in Torque? I've been trying to get it to work, and so far, I > >>>>>>> see > >>>>>>> that > >>>>>>> it's invoking my checkpoint script, that script calls > >>>>>>> cr_checkpoint, > >>>>>>> and > >>>>>>> the checkpoint files/directories are created, but something > >>>>>>> is > >>>>>>> calling > >>>>>>> the mom_checkpoint_delete_files function, which in turn calls > >>>>>>> delete_blcr_files, and the checkpoints get deleted. > >>>>>> > >>>>>> I hope you are seeing normal behavior. If I remember > >>>>>> correctly, > >>>>>> when a job gets checkpointed, the checkpoint files remain on > >>>>>> the > >>>>>> mom until the mom completes the job or until the job is put on > >>>>>> hold and is no longer on the mom. At that time the checkpoint > >>>>>> files are transferred to the server where they remain until > >>>>>> the > >>>>>> job is removed from the server. When the job gets restarted, > >>>>>> which may or may not be on the original mom node, the > >>>>>> checkpoint > >>>>>> files are transferred to the mom which can then restart the > >>>>>> job > >>>>>> from the checkpoint file. > >>>>>> > >>>>>>> > >>>>>>> Also, when I do a "qhold" on my job to try to initiate the > >>>>>>> checkpoint, > >>>>>>> is it really supposed to terminate my job? Perhaps that's > >>>>>>> related, > >>>>>>> eg. > >>>>>>> the job is ending so the files get cleaned up. > >>>>>> > >>>>>> qhold is behaving as designed and as documented in its man > >>>>>> page. > >>>>>> If you want to just checkpoint the job and allow it to > >>>>>> continue > >>>>>> running, use qchkpt. > >>>>>> > >>>>>>> > >>>>>>> Basically, does anyone have it working, and can give me > >>>>>>> advice? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> -- > >>>>>>> Lloyd Brown > >>>>>>> Systems Administrator > >>>>>>> Fulton Supercomputing Lab > >>>>>>> Brigham Young University > >>>>>>> http://marylou.byu.edu > >>>>>>> _______________________________________________ > >>>>>>> torqueusers mailing list > >>>>>>> torqueusers at supercluster.org > >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> torqueusers mailing list > >>>>>> torqueusers at supercluster.org > >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>>> _______________________________________________ > >>>>> torqueusers mailing list > >>>>> torqueusers at supercluster.org > >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>>> > >>>> _______________________________________________ > >>>> torqueusers mailing list > >>>> torqueusers at supercluster.org > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From dbeer at adaptivecomputing.com Tue Jan 31 13:04:44 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 31 Jan 2012 13:04:44 -0700 Subject: [torqueusers] Beta Update - NUMA In-Reply-To: References: Message-ID: All, There was a problem with the build released yesterday (a non-dist file was referenced in distribution code). This has been corrected, so please use this link instead: http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201201311301.tar.gz On Mon, Jan 30, 2012 at 1:55 PM, David Beer wrote: > All, > > A new TORQUE 4.0 build was cut today. This build has the most up to date > fixes for the beta, and it has especially the important fixes for those of > you using the NUMA builds of TORQUE. Two of the fixes include: > 1. Fixing a compile bug. > 2. Previously, only the first node board was reporting in, but now this is > resolved. > > Special Thanks to Peter Enstrom for helping us through these, also thanks > to everyone for your beta feedback. > > It can be downloaded here: > http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201201301347.tar.gz > > -- > David Beer | Software Engineer > Adaptive Computing > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120131/378bb7ce/attachment.html