From levin108 at gmail.com Sat Dec 1 07:26:59 2012 From: levin108 at gmail.com (levin li) Date: Sat, 01 Dec 2012 22:26:59 +0800 Subject: [torqueusers] [torquedev] Proposed Migration to GitHub In-Reply-To: References: Message-ID: <50BA13B3.8070901@gmail.com> David, I support this idea, github makes developers easy to fork this project, to submit patches, and to submit issues, we have many opensource projects on github, it makes it easier for developers to join in the projects that hosted on github, and I think git is a more powerful tool than subversion, and it provides so many useful functions, I like the method that we submit patches to mail list using 'git send-email', which makes others more easier to review patches. So, I think it's a great idea, if it was hosted on github, I would like it more to submit patches to make torque better. Thanks, Levin On 2012?12?01? 05:54, David Beer wrote: > All, > > We have been evaluating options for better coordinating our development > efforts with the community, and we're very interested in migrating the > TORQUE project to github. Github is a website for hosting open source > projects and coordinating efforts in the open source community. Feel > free to look around and check what the website offers - its essentially > an integration of hosting the project and tracking the issues in the > project, along with many 'social' tools designed to help coordinate > efforts, such as following a developer or a project. > > The big caveat to this is that changing to github would discontinue use > of subversion and switch us to git. (FYI: Discontinue means no new > check-ins, not downing the server) There are many websites that document > the advantages of using git vs. using subversion, but for those that are > unfamiliar with it there aren't always 1-to-1 switches for commands. > However, for the most common things that our community members do - > checking out and creating diffs - it works pretty much the same way. > > What are your thoughts on this switch? We believe that this switch will > greatly benefit the community. > > -- > David Beer | Senior Software Engineer > Adaptive Computing > > > > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev > From basv at sara.nl Sat Dec 1 11:45:24 2012 From: basv at sara.nl (Bas van der Vlies) Date: Sat, 1 Dec 2012 18:45:24 +0000 Subject: [torqueusers] [torquedev] Proposed Migration to GitHub In-Reply-To: References: Message-ID: <74EB35DC444C754DA400390F673C23D6083EF8CD@sara-exch-02.ka.sara.nl> +1, Just a question are the maui source still available with svn or also available at github? On 30 nov. 2012, at 22:54, David Beer wrote: > All, > > We have been evaluating options for better coordinating our development efforts with the community, and we're very interested in migrating the TORQUE project to github. Github is a website for hosting open source projects and coordinating efforts in the open source community. Feel free to look around and check what the website offers - its essentially an integration of hosting the project and tracking the issues in the project, along with many 'social' tools designed to help coordinate efforts, such as following a developer or a project. > > The big caveat to this is that changing to github would discontinue use of subversion and switch us to git. (FYI: Discontinue means no new check-ins, not downing the server) There are many websites that document the advantages of using git vs. using subversion, but for those that are unfamiliar with it there aren't always 1-to-1 switches for commands. However, for the most common things that our community members do - checking out and creating diffs - it works pretty much the same way. > > What are your thoughts on this switch? We believe that this switch will greatly benefit the community. > > -- > David Beer | Senior Software Engineer > Adaptive Computing > > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev -- Bas van der Vlies mail: basv at sara.nl SARA - Academic Computing Services , Amsterdam, The Netherlands From craig.tierney at noaa.gov Sun Dec 2 11:24:00 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Sun, 2 Dec 2012 11:24:00 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat Message-ID: Hello all, I have a question for Torque users regarding the display of completed jobs in qstat. Do others find the listing of completed jobs by default via qstat makes finding things in the output much more difficult and completely unnecessary? Having the completed jobs in qstat can significantly slow down qstat if you have a lot (thousands) of completed jobs which is another hassle. I asking this because I need to be able to get error codes from completed jobs (for minutes to hours after completion). To do this, they have to still be in the queue. This function is very important, but not to anyone who runs qstat by hand. Grid Engine had a way to get completed jobs, but only when asked for. Thanks, Craig -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121202/364bc15c/attachment.html From dbeer at adaptivecomputing.com Mon Dec 3 10:26:57 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 3 Dec 2012 10:26:57 -0700 Subject: [torqueusers] Proposed Migration to GitHub In-Reply-To: <50B95C8F.3060100@unimelb.edu.au> References: <50B95C8F.3060100@unimelb.edu.au> Message-ID: > People do worry, however, about having a centralised service handling > such a number of projects, especially given their history of being > hacked. :-) > Adaptive also plans to keep a local mirror of the repository in addition to the mirrors that all developers here would have. -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121203/8f1dad03/attachment.html From dbeer at adaptivecomputing.com Mon Dec 3 11:09:00 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 3 Dec 2012 11:09:00 -0700 Subject: [torqueusers] [torquedev] Proposed Migration to GitHub In-Reply-To: <74EB35DC444C754DA400390F673C23D6083EF8CD@sara-exch-02.ka.sara.nl> References: <74EB35DC444C754DA400390F673C23D6083EF8CD@sara-exch-02.ka.sara.nl> Message-ID: On Sat, Dec 1, 2012 at 11:45 AM, Bas van der Vlies wrote: > +1, Just a question are the maui source still available with svn or also > available at github? > ASFAIK Maui continues to be done by svn. I'm not very knowledgeable about how Maui is managed. -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121203/65fba85c/attachment.html From dbeer at adaptivecomputing.com Mon Dec 3 11:18:10 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 3 Dec 2012 11:18:10 -0700 Subject: [torqueusers] Migration to GitHub Message-ID: All, Last Friday we proposed that we migrate TORQUE to GitHub, and so far the response has been a resounding yes. Barring major objections between now and then, we are planning to flip the switch this Friday (December 7th) in the morning. We'll send out a corresponding email. In addition to announcing the switch, we'll be publishing: - Contributor Guidelines - Quick Start Guide - Links to Tutorials for git and GitHub Please let us know if there's anything else that you feel would be helpful. We are optimistic that this shift will help us better communicate and coordinate our efforts. Cheers, -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121203/2d677ec0/attachment.html From gus at ldeo.columbia.edu Mon Dec 3 12:12:23 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 03 Dec 2012 14:12:23 -0500 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: <50BCF997.1000900@ldeo.columbia.edu> On 12/02/2012 01:24 PM, Craig Tierney - NOAA Affiliate wrote: > Hello all, > > I have a question for Torque users regarding the display of completed > jobs in qstat. Do others find the listing of completed jobs by default > via qstat makes finding things in the output much more difficult and > completely unnecessary? Having the completed jobs in qstat can > significantly slow down qstat if you have a lot (thousands) of completed > jobs which is another hassle. > > I asking this because I need to be able to get error codes from > completed jobs (for minutes to hours after completion). To do this, > they have to still be in the queue. This function is very important, > but not to anyone who runs qstat by hand. Grid Engine had a way to get > completed jobs, but only when asked for. > > Thanks, > Craig > Hi Craig Well, we keep the completed jobs on the queue for a several hours, qmgr -c 'set server keep_completed = ...' Users here never complained, and seem to like to see queued, running, and completed jobs. The old/default time of 1200 seconds was too short. However, our clusters and the number of users are small, nothing like Zeus, so the clutter caused by keeping completed jobs on the queue for hours is not large. Would 'qstat -u username' or some other filtering help the annoyed users? Gus Correa From bdandrus at nps.edu Mon Dec 3 12:22:40 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Mon, 3 Dec 2012 19:22:40 +0000 Subject: [torqueusers] checkpoints and walltime Message-ID: All, I am seeing something odd. If someone request enough walltime to land in my short queue rather than my tiny queue (default), /etc/profile is not sourced. The setting that seems to make a difference is checkpoint_defaults. If that is enabled, various env variables are not set (HOSTNAME in particular) If it is disabled, everything seems to be set as expected. Is there something about checkpointing that affects environment settings that I am missing? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From craig.tierney at noaa.gov Mon Dec 3 12:27:49 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Mon, 3 Dec 2012 12:27:49 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: <50BCF997.1000900@ldeo.columbia.edu> References: <50BCF997.1000900@ldeo.columbia.edu> Message-ID: On Mon, Dec 3, 2012 at 12:12 PM, Gus Correa wrote: > On 12/02/2012 01:24 PM, Craig Tierney - NOAA Affiliate wrote: > > Hello all, > > > > I have a question for Torque users regarding the display of completed > > jobs in qstat. Do others find the listing of completed jobs by default > > via qstat makes finding things in the output much more difficult and > > completely unnecessary? Having the completed jobs in qstat can > > significantly slow down qstat if you have a lot (thousands) of completed > > jobs which is another hassle. > > > > I asking this because I need to be able to get error codes from > > completed jobs (for minutes to hours after completion). To do this, > > they have to still be in the queue. This function is very important, > > but not to anyone who runs qstat by hand. Grid Engine had a way to get > > completed jobs, but only when asked for. > > > > Thanks, > > Craig > > > > Hi Craig > > Well, we keep the completed jobs on the queue for a several hours, > qmgr -c 'set server keep_completed = ...' > Users here never complained, and seem to like > to see queued, running, and completed jobs. > The old/default time of 1200 seconds was too short. > However, our clusters and the number of users are small, > nothing like Zeus, so the clutter caused by keeping completed > jobs on the queue for hours is not large. > Would 'qstat -u username' or some other filtering > help the annoyed users? > > Gus, We currently have the keep_completed to only 600 seconds, and that is too short. We are running about 40k-50k jobs a day. While using -u username would help, it still seems unnecessary. The jobs are not evenly distributed between users. Some will hundreds in a single workflow (which would be over a few hours). I don't mind retraining users (ex: use the -u), but the first thing I would do as a user would be write a wrapper to hide them, so I figure a better solution is in order. But breaking existing functionality is not usually a good idea which is why I was looking for opinions. I already have a small patch that removes the completed jobs, but added -c to show the completed jobs in case you care. But if the solution isn't generally acceptable, I don't want to be patching my code all the time. Craig > Gus Correa > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121203/f3538d5c/attachment.html From glen.beane at gmail.com Mon Dec 3 12:55:44 2012 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 3 Dec 2012 14:55:44 -0500 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: On Sun, Dec 2, 2012 at 1:24 PM, Craig Tierney - NOAA Affiliate wrote: > Hello all, > > I have a question for Torque users regarding the display of completed jobs > in qstat. Do others find the listing of completed jobs by default via qstat > makes finding things in the output much more difficult and completely > unnecessary? Having the completed jobs in qstat can significantly slow down > qstat if you have a lot (thousands) of completed jobs which is another > hassle. > > I asking this because I need to be able to get error codes from completed > jobs (for minutes to hours after completion). To do this, they have to > still be in the queue. This function is very important, but not to anyone > who runs qstat by hand. Grid Engine had a way to get completed jobs, but > only when asked for. you can use the "tracejob" command to get information, including the exit status, of completed jobs. From craig.tierney at noaa.gov Mon Dec 3 13:00:16 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Mon, 3 Dec 2012 13:00:16 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 12:55 PM, Glen Beane wrote: > On Sun, Dec 2, 2012 at 1:24 PM, Craig Tierney - NOAA Affiliate > wrote: > > Hello all, > > > > I have a question for Torque users regarding the display of completed > jobs > > in qstat. Do others find the listing of completed jobs by default via > qstat > > makes finding things in the output much more difficult and completely > > unnecessary? Having the completed jobs in qstat can significantly slow > down > > qstat if you have a lot (thousands) of completed jobs which is another > > hassle. > > > > I asking this because I need to be able to get error codes from completed > > jobs (for minutes to hours after completion). To do this, they have to > > still be in the queue. This function is very important, but not to > anyone > > who runs qstat by hand. Grid Engine had a way to get completed jobs, but > > only when asked for. > > > you can use the "tracejob" command to get information, including the > exit status, of completed jobs. > Glen, Doesn't this only work if I am root or I change the permissions on the files AND as a regular user I have access to the Torque server? Craig > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121203/077dfa5c/attachment.html From ezellma at ornl.gov Mon Dec 3 13:09:24 2012 From: ezellma at ornl.gov (Ezell, Matthew A.) Date: Mon, 3 Dec 2012 15:09:24 -0500 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: Message-ID: On 12/2/12 1:24 PM, "Craig Tierney - NOAA Affiliate" > wrote: Hello all, I have a question for Torque users regarding the display of completed jobs in qstat. Do others find the listing of completed jobs by default via qstat makes finding things in the output much more difficult and completely unnecessary? Having the completed jobs in qstat can significantly slow down qstat if you have a lot (thousands) of completed jobs which is another hassle. I asking this because I need to be able to get error codes from completed jobs (for minutes to hours after completion). To do this, they have to still be in the queue. This function is very important, but not to anyone who runs qstat by hand. Grid Engine had a way to get completed jobs, but only when asked for. Thanks, Craig Users can run 'qstat -r' to get a list of running jobs or 'qstat -i' to get a list of queued/held/waiting jobs. My understanding is that once a job has been completed for more than keep_completed seconds, pbs_server forgets about it. Then, you have to go look in the logs. Alternatively, you could setup an epilogue to capture the exit code and funnel it to some user-accessible location (the job script, flat-file on a shared FS, database, etc). ~Matt --- Matt Ezell HPC Systems Administrator Oak Ridge National Laboratory From craig.tierney at noaa.gov Mon Dec 3 13:15:55 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Mon, 3 Dec 2012 13:15:55 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 1:09 PM, Ezell, Matthew A. wrote: > On 12/2/12 1:24 PM, "Craig Tierney - NOAA Affiliate" < > craig.tierney at noaa.gov> wrote: > > Hello all, > > I have a question for Torque users regarding the display of completed jobs > in qstat. Do others find the listing of completed jobs by default via > qstat makes finding things in the output much more difficult and completely > unnecessary? Having the completed jobs in qstat can significantly slow > down qstat if you have a lot (thousands) of completed jobs which is another > hassle. > > I asking this because I need to be able to get error codes from completed > jobs (for minutes to hours after completion). To do this, they have to > still be in the queue. This function is very important, but not to anyone > who runs qstat by hand. Grid Engine had a way to get completed jobs, but > only when asked for. > > Thanks, > Craig > > Users can run 'qstat -r' to get a list of running jobs or 'qstat -i' to > get a list of queued/held/waiting jobs. > > Matt, The above it true. It would be nice if you could combine these options. > My understanding is that once a job has been completed for more than > keep_completed seconds, pbs_server forgets about it. Then, you have to go > look in the logs. > > Yes, and I would like to keep the jobs for one day. That would leave 40-50k jobs in completed state. A qstat with 20k completed jobs (from a test on a slow server) showed the qstat time went to about 8 seconds. Alternatively, you could setup an epilogue to capture the exit code and > funnel it to some user-accessible location (the job script, flat-file on a > shared FS, database, etc). > > I know I can do that, and I can ask Moab for the numbers as well. However, the Torque server already has the information and can store it. So why build some other mechanism to do this? Thanks, Craig ~Matt > > --- > Matt Ezell > HPC Systems Administrator > Oak Ridge National Laboratory > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121203/6b03d908/attachment-0001.html From mej at lbl.gov Mon Dec 3 14:29:16 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 3 Dec 2012 13:29:16 -0800 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: <20121203212844.GS8827@lbl.gov> On Sunday, 02 December 2012, at 11:24:00 (-0700), Craig Tierney - NOAA Affiliate wrote: > I have a question for Torque users regarding the display of > completed jobs in qstat. Do others find the listing of completed > jobs by default via qstat makes finding things in the output much > more difficult and completely unnecessary? Having the completed > jobs in qstat can significantly slow down qstat if you have a lot > (thousands) of completed jobs which is another hassle. > > I asking this because I need to be able to get error codes from > completed jobs (for minutes to hours after completion). To do this, > they have to still be in the queue. This function is very > important, but not to anyone who runs qstat by hand. Grid Engine > had a way to get completed jobs, but only when asked for. To answer your question directly, yes, it is occasionally a little annoying to have to sift through completed jobs. It would be nice if there were a switch for it. But it sounds like you're really asking if anyone would object to the default changing such that completed jobs only showed up if requested. To that I say we'd have no issue with that at all. AFAIK, none of our scripts depend on completed jobs showing up, and any that did could be easily modified. But we tend to process accounting records rather than qstat for anything to do with completed jobs. HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From glen.beane at gmail.com Mon Dec 3 14:39:40 2012 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 3 Dec 2012 16:39:40 -0500 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 3:00 PM, Craig Tierney - NOAA Affiliate wrote: > On Mon, Dec 3, 2012 at 12:55 PM, Glen Beane wrote: >> >> On Sun, Dec 2, 2012 at 1:24 PM, Craig Tierney - NOAA Affiliate >> wrote: >> > Hello all, >> > >> > I have a question for Torque users regarding the display of completed >> > jobs >> > in qstat. Do others find the listing of completed jobs by default via >> > qstat >> > makes finding things in the output much more difficult and completely >> > unnecessary? Having the completed jobs in qstat can significantly slow >> > down >> > qstat if you have a lot (thousands) of completed jobs which is another >> > hassle. >> > >> > I asking this because I need to be able to get error codes from >> > completed >> > jobs (for minutes to hours after completion). To do this, they have to >> > still be in the queue. This function is very important, but not to >> > anyone >> > who runs qstat by hand. Grid Engine had a way to get completed jobs, >> > but >> > only when asked for. >> >> >> you can use the "tracejob" command to get information, including the >> exit status, of completed jobs. > > > Glen, > > Doesn't this only work if I am root or I change the permissions on the files > AND as a regular user I have access to the Torque server? > > Craig yes, regular users would need to have read access to TORQUE logs and have access to the TORQUE server or the relevant files would have to be mirrored to your login node(s) and tracejob setup to find them. I don't think there is really anything in the server logs and the accounting logs that couldn't be obtained from qstat, ,etc, so we don't have much of a problem making them readable. I agree it might be nice to be able to query pbs_server itself for some of this information, but I think some of the stability and code quality issues need to be taken care of first. From Gareth.Williams at csiro.au Mon Dec 3 15:34:51 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Tue, 4 Dec 2012 09:34:51 +1100 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Hi Craig, I've seen others' responses buy no-one has mentioned using showjobs to return info on complete jobs. It is a potentially useful piece of the broader puzzle - though is not part of answering your specific question. Gareth From: Craig Tierney - NOAA Affiliate [mailto:craig.tierney at noaa.gov] Sent: Monday, 3 December 2012 5:24 AM To: Torque Users Mailing List Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat Hello all, I have a question for Torque users regarding the display of completed jobs in qstat. Do others find the listing of completed jobs by default via qstat makes finding things in the output much more difficult and completely unnecessary? Having the completed jobs in qstat can significantly slow down qstat if you have a lot (thousands) of completed jobs which is another hassle. I asking this because I need to be able to get error codes from completed jobs (for minutes to hours after completion). To do this, they have to still be in the queue. This function is very important, but not to anyone who runs qstat by hand. Grid Engine had a way to get completed jobs, but only when asked for. Thanks, Craig -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/21b6aef6/attachment.html From chenry at ittc.ku.edu Mon Dec 3 16:05:41 2012 From: chenry at ittc.ku.edu (Charles Henry) Date: Mon, 3 Dec 2012 17:05:41 -0600 (CST) Subject: [torqueusers] checkpoints and walltime In-Reply-To: References: Message-ID: <1059946229.39134.1354575941059.JavaMail.root@ittc.ku.edu> ----- Original Message ----- > From: "Brian Contractor Andrus" > To: "Torque Users Mailing List (torqueusers at supercluster.org)" > Sent: Monday, December 3, 2012 1:22:40 PM > Subject: [torqueusers] checkpoints and walltime > > All, > > I am seeing something odd. > If someone request enough walltime to land in my short queue rather > than my tiny queue (default), /etc/profile is not sourced. > The setting that seems to make a difference is checkpoint_defaults. > If that is enabled, various env variables are not set (HOSTNAME in > particular) > If it is disabled, everything seems to be set as expected. > > Is there something about checkpointing that affects environment > settings that I am missing? Ah-ha! Now, that makes sense. I observed the same problem on our system (just now--I had not previously tested that checkpoint defaults were related to my non-login shell issue). I had previously seen that cluster jobs would not run in a "login shell", except when running them interactively. You can read my correspondence with (the very helpful) torque-dev list here: http://www.supercluster.org/pipermail/torquedev/2012-October/004266.html That helps a lot to narrow it down perhaps. I will re-file my bug report with the updated symptom: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=224 Chuck > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From levin108 at gmail.com Tue Dec 4 01:29:39 2012 From: levin108 at gmail.com (levin li) Date: Tue, 04 Dec 2012 16:29:39 +0800 Subject: [torqueusers] Question about checkpoint for MPI Message-ID: <50BDB473.5090601@gmail.com> Hi,all We're trying to use torque to checkpoint MPI jobs, but it seems that torque can only handle jobs running on a single node, I checked the code and found that when using qhold to checkpoint a job, qhold sends a PBS_BATCH_HoldJob request to pbs server, then pbs server relays this request to master host, and then master host checkpoints the job processes running on itself with BLCR, but not send the request to its sister nodes, so it seems that MPI jobs can not be checkpointable. I'm not sure whether I'm right, is there anybody who can tell me whether it is true with torque, or if I'm right, do you have any plans to make checkpoint for MPI jobs available ? Thanks Levin From knielson at adaptivecomputing.com Tue Dec 4 10:04:42 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 4 Dec 2012 10:04:42 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: On Mon, Dec 3, 2012 at 3:34 PM, wrote: > Hi Craig,**** > > ** ** > > I?ve seen others? responses buy no-one has mentioned using showjobs to > return info on complete jobs. It is a potentially useful piece of the > broader puzzle ? though is not part of answering your specific question.** > ** > > ** ** > > Gareth**** > > ** ** > > *From:* Craig Tierney - NOAA Affiliate [mailto:craig.tierney at noaa.gov] > *Sent:* Monday, 3 December 2012 5:24 AM > *To:* Torque Users Mailing List > *Subject:* [torqueusers] Question to Torque community regarding display > of completed jobs in qstat**** > > ** ** > > Hello all,**** > > ** ** > > I have a question for Torque users regarding the display of completed jobs > in qstat. Do others find the listing of completed jobs by default via > qstat makes finding things in the output much more difficult and completely > unnecessary? Having the completed jobs in qstat can significantly slow > down qstat if you have a lot (thousands) of completed jobs which is another > hassle.**** > > ** ** > > I asking this because I need to be able to get error codes from completed > jobs (for minutes to hours after completion). To do this, they have to > still be in the queue. This function is very important, but not to anyone > who runs qstat by hand. Grid Engine had a way to get completed jobs, but > only when asked for.**** > > ** ** > > Thanks,**** > > Craig**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > Is there a reason why just adding a command line switch to tell qstat not to display completed jobs is not acceptable? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/b56cbb16/attachment-0001.html From craig.tierney at noaa.gov Tue Dec 4 10:58:20 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Tue, 4 Dec 2012 10:58:20 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: On Tue, Dec 4, 2012 at 10:04 AM, Ken Nielson wrote: > > On Mon, Dec 3, 2012 at 3:34 PM, wrote: > >> Hi Craig,**** >> >> ** ** >> >> I?ve seen others? responses buy no-one has mentioned using showjobs to >> return info on complete jobs. It is a potentially useful piece of the >> broader puzzle ? though is not part of answering your specific question.* >> *** >> >> ** ** >> >> Gareth**** >> >> ** ** >> >> *From:* Craig Tierney - NOAA Affiliate [mailto:craig.tierney at noaa.gov] >> *Sent:* Monday, 3 December 2012 5:24 AM >> *To:* Torque Users Mailing List >> *Subject:* [torqueusers] Question to Torque community regarding display >> of completed jobs in qstat**** >> >> ** ** >> >> Hello all,**** >> >> ** ** >> >> I have a question for Torque users regarding the display of completed >> jobs in qstat. Do others find the listing of completed jobs by default via >> qstat makes finding things in the output much more difficult and completely >> unnecessary? Having the completed jobs in qstat can significantly slow >> down qstat if you have a lot (thousands) of completed jobs which is another >> hassle.**** >> >> ** ** >> >> I asking this because I need to be able to get error codes from completed >> jobs (for minutes to hours after completion). To do this, they have to >> still be in the queue. This function is very important, but not to anyone >> who runs qstat by hand. Grid Engine had a way to get completed jobs, but >> only when asked for.**** >> >> ** ** >> >> Thanks,**** >> >> Craig**** >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > Is there a reason why just adding a command line switch to tell qstat not > to display completed jobs is not acceptable? > > Ken > > Ken, This could work. There are lots of things that could work. My point is that the default behavior doesn't have any value (except it already exists). I want the users (and myself) to do as little as possible. I asked the question in a way I hoped would discuss if anyone else is bothered by the default behavior. Maybe I am the only one that cares that "qstat" generates too much information in a way that I think is unnecessary. Craig > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/7c01104f/attachment.html From dbeer at adaptivecomputing.com Tue Dec 4 11:37:58 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 4 Dec 2012 11:37:58 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: > Ken, > > This could work. There are lots of things that could work. My point is > that the default behavior doesn't have any value (except it already > exists). I want the users (and myself) to do as little as possible. I > asked the question in a way I hoped would discuss if anyone else is > bothered by the default behavior. Maybe I am the only one that cares that > "qstat" generates too much information in a way that I think is unnecessary. > > Craig > > One case for not changing the default is that Moab and Maui both depend on completed jobs appearing so that they can harvest appropriate information from them. This doesn't mean we absolutely can't change it - these could be made so that they request appropriately based on TORQUE versions - but it does mean that if we did change it then we'd break backwards compatibility with older versions of Moab/Maui, which is a significant consideration. -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/f17131da/attachment.html From glen.beane at gmail.com Tue Dec 4 11:54:56 2012 From: glen.beane at gmail.com (Glen Beane) Date: Tue, 4 Dec 2012 13:54:56 -0500 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: On Tue, Dec 4, 2012 at 1:37 PM, David Beer wrote: > >> Ken, >> >> This could work. There are lots of things that could work. My point is >> that the default behavior doesn't have any value (except it already exists). >> I want the users (and myself) to do as little as possible. I asked the >> question in a way I hoped would discuss if anyone else is bothered by the >> default behavior. Maybe I am the only one that cares that "qstat" generates >> too much information in a way that I think is unnecessary. >> >> Craig >> > > One case for not changing the default is that Moab and Maui both depend on > completed jobs appearing so that they can harvest appropriate information > from them. This doesn't mean we absolutely can't change it - these could be > made so that they request appropriately based on TORQUE versions - but it > does mean that if we did change it then we'd break backwards compatibility > with older versions of Moab/Maui, which is a significant consideration. > But Maui and Moab don't run the qstat executable. What if the API default were to return all jobs, including complete, but we could pass a flag with the request to the server from qstat so the server knows if the client wants information for completed jobs. We could add a qmgr setting to change the default behavior. qstat would include some extra information that would specify "give me all jobs", "give me everything but complete", or "give me the server default" (which would be the default behavior for qstat). I think most of the API calls allow passing "extra" information (that may not be used by many of the calls). We might be able to use this to convey this information. From knielson at adaptivecomputing.com Tue Dec 4 11:59:18 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 4 Dec 2012 11:59:18 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: On Tue, Dec 4, 2012 at 11:54 AM, Glen Beane wrote: > On Tue, Dec 4, 2012 at 1:37 PM, David Beer > wrote: > > > >> Ken, > >> > >> This could work. There are lots of things that could work. My point is > >> that the default behavior doesn't have any value (except it already > exists). > >> I want the users (and myself) to do as little as possible. I asked the > >> question in a way I hoped would discuss if anyone else is bothered by > the > >> default behavior. Maybe I am the only one that cares that "qstat" > generates > >> too much information in a way that I think is unnecessary. > >> > >> Craig > >> > > > > One case for not changing the default is that Moab and Maui both depend > on > > completed jobs appearing so that they can harvest appropriate information > > from them. This doesn't mean we absolutely can't change it - these could > be > > made so that they request appropriately based on TORQUE versions - but it > > does mean that if we did change it then we'd break backwards > compatibility > > with older versions of Moab/Maui, which is a significant consideration. > > > > But Maui and Moab don't run the qstat executable. What if the API > default were to return all jobs, including complete, but we could pass > a flag with the request to the server from qstat so the server knows > if the client wants information for completed jobs. We could add a > qmgr setting to change the default behavior. qstat would include some > extra information that would specify "give me all jobs", "give me > everything but complete", or "give me the server default" (which would > be the default behavior for qstat). I think most of the API calls > allow passing "extra" information (that may not be used by many of the > calls). We might be able to use this to convey this information. > _______________________________________________ > Glen, Hmm. You are right. qstat always gets all of the jobs regardless of their state and then formats the output based on the command line switches. Even so, changing default behaviour is almost always problematic. What we fix for one person generally breaks someone else. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/71737a04/attachment.html From pankaj.dorlikar at gmail.com Tue Dec 4 12:20:01 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Wed, 5 Dec 2012 00:50:01 +0530 Subject: [torqueusers] support of gpus in maui Message-ID: Hi, We have Maui-3.2.6p20 configured with torque-2.5.8 on rhel 5.2 systems. We have cluster with nodes having 16 cpus and 1 gpu per node. We want to know whether with the existing maui version, gpu suuport is there? if yes, what should be done to integrate it in maui. We could do it in torque and torque detects it. Thanks in advance -- Pankaj V. Dorlikar -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121205/cc4f7351/attachment-0001.html From craig.tierney at noaa.gov Tue Dec 4 14:39:05 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Tue, 4 Dec 2012 14:39:05 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: On Tue, Dec 4, 2012 at 11:59 AM, Ken Nielson wrote: > > > On Tue, Dec 4, 2012 at 11:54 AM, Glen Beane wrote: > >> On Tue, Dec 4, 2012 at 1:37 PM, David Beer >> wrote: >> > >> >> Ken, >> >> >> >> This could work. There are lots of things that could work. My point >> is >> >> that the default behavior doesn't have any value (except it already >> exists). >> >> I want the users (and myself) to do as little as possible. I asked the >> >> question in a way I hoped would discuss if anyone else is bothered by >> the >> >> default behavior. Maybe I am the only one that cares that "qstat" >> generates >> >> too much information in a way that I think is unnecessary. >> >> >> >> Craig >> >> >> > >> > One case for not changing the default is that Moab and Maui both depend >> on >> > completed jobs appearing so that they can harvest appropriate >> information >> > from them. This doesn't mean we absolutely can't change it - these >> could be >> > made so that they request appropriately based on TORQUE versions - but >> it >> > does mean that if we did change it then we'd break backwards >> compatibility >> > with older versions of Moab/Maui, which is a significant consideration. >> > >> >> But Maui and Moab don't run the qstat executable. What if the API >> default were to return all jobs, including complete, but we could pass >> a flag with the request to the server from qstat so the server knows >> if the client wants information for completed jobs. We could add a >> qmgr setting to change the default behavior. qstat would include some >> extra information that would specify "give me all jobs", "give me >> everything but complete", or "give me the server default" (which would >> be the default behavior for qstat). I think most of the API calls >> allow passing "extra" information (that may not be used by many of the >> calls). We might be able to use this to convey this information. >> _______________________________________________ >> > > Glen, > > Hmm. You are right. > > qstat always gets all of the jobs regardless of their state and then > formats the output based on the command line switches. Even so, changing > default behaviour is almost always problematic. What we fix for one person > generally breaks someone else. > > Ken, I had figured out most of the behavior when looking at the code. I had a snippet that would ask for all job states except for "C" instead of using the default. Then I added an option for -c and that would only pass back completed jobs. I didn't go into the server code and change how it worked because I figured that would break Moab. The reason for breaking them out is: 1) It causes (IMO) unnecessary clutter 2) If you (well I) want the completed job to be useful, the keep_completed_jobs needs to be at least an hour, but preferably a day 2b) When you start having thousands of jobs per hour going through the system, the number of complete jobs goes up drastically and slows down the qstat commands when few people really care (see #1) 3) Unless I reachitect our Torque servers, users never have any access to the information to get the exit status from the log files. Plus that still requires parsing ascii log files which is not efficient (where keeping the exit code in memory is efficient). I know it is changing default behavior and isn't something that can be done overnight. My point was to get others to express opinions of the current functionality and is it really the best to do. Maybe the change couldn't be made until 5.0, where you have a chance to break things. Changing it in qstat means it never breaks the server so you don't have compatibility issues there. Thanks, Craig Ken > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/c3bb6daa/attachment.html From dbeer at adaptivecomputing.com Tue Dec 4 14:42:16 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 4 Dec 2012 14:42:16 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: > > > 3) Unless I reachitect our Torque servers, users never have any access to > the information to get the exit status from the log files. Plus that still > requires parsing ascii log files which is not efficient (where keeping the > exit code in memory is efficient). > The qstat -f output for a job is a way to get the exit status for a job. $ qstat -f 192 | grep exit exit_status = 0 -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/9debc976/attachment.html From dbeer at adaptivecomputing.com Tue Dec 4 14:45:23 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 4 Dec 2012 14:45:23 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: I don't think its hard to create either a qstat option that returns only non-complete jobs or make this the default for those running qstat (with an option to disable). If it is going to be the new default, we just need two things: 1. Community consensus. 2. This could be considered optional, but I don't think we should change defaults in a fix release, so it'd have to wait for the next non-fix release. -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121204/0807e920/attachment.html From craig.tierney at noaa.gov Wed Dec 5 09:57:54 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Wed, 5 Dec 2012 09:57:54 -0700 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: On Tue, Dec 4, 2012 at 2:42 PM, David Beer wrote: > >> 3) Unless I reachitect our Torque servers, users never have any access to >> the information to get the exit status from the log files. Plus that still >> requires parsing ascii log files which is not efficient (where keeping the >> exit code in memory is efficient). >> > > The qstat -f output for a job is a way to get the exit status for a job. > > $ qstat -f 192 | grep exit > exit_status = 0 > > This is exactly what I want to do. But that requires setting keep_completed_jobs to 86400. This means I will have 30000+ jobs in completed state, slowing down the response of qstat (and I doubt people want to see the completed jobs most of the time). Hence the point of this discussion. Craig > > -- > David Beer | Senior Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121205/4fa90a35/attachment.html From mej at lbl.gov Wed Dec 5 10:38:01 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 5 Dec 2012 09:38:01 -0800 Subject: [torqueusers] Question about checkpoint for MPI In-Reply-To: <50BDB473.5090601@gmail.com> References: <50BDB473.5090601@gmail.com> Message-ID: <20121205173758.GY8827@lbl.gov> On Tuesday, 04 December 2012, at 16:29:39 (+0800), levin li wrote: > We're trying to use torque to checkpoint MPI jobs, but it seems that > torque can only handle jobs running on a single node, I checked the > code and found that when using qhold to checkpoint a job, qhold > sends a PBS_BATCH_HoldJob request to pbs server, then pbs server > relays this request to master host, and then master host checkpoints > the job processes running on itself with BLCR, but not send the > request to its sister nodes, so it seems that MPI jobs can not be > checkpointable. > > I'm not sure whether I'm right, is there anybody who can tell me > whether it is true with torque, or if I'm right, do you have any > plans to make checkpoint for MPI jobs available ? We work with the BLCR team here at LBNL and have collaborated with them extensively on trying to get MPI job checkpointing to work across nodes. We were able to get it working fairly reliably on a small set of nodes, but the greater the number of nodes in the job, the greater the chance of failure. (I'll skip the details unless someone's specifically interested.) As the project is no longer being funded (AFAIK), the odds of this being resolved by the BLCR team in the immediate future are slim. While I don't claim to speak for them, I can say that we chose to abandon our plans to deploy BLCR as a job preemption solution for our TORQUE/Moab systems as a direct result of the prognosis given to us by the team. The bottom line is, they're about 90-95% of the way there, but the remaining 5-10% is a significant challenge. Without funding, it doesn't seem likely the work will be completed any time soon. If you are interested in working with them or have any ideas for getting the project going again, I'm sure they'd be happy to hear from you at checkpoint at lbl.gov. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From bdandrus at nps.edu Wed Dec 5 14:31:19 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 5 Dec 2012 21:31:19 +0000 Subject: [torqueusers] Question about checkpoint for MPI In-Reply-To: <20121205173758.GY8827@lbl.gov> References: <50BDB473.5090601@gmail.com> <20121205173758.GY8827@lbl.gov> Message-ID: > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Michael Jennings > Sent: Wednesday, December 05, 2012 9:38 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] Question about checkpoint for MPI > > On Tuesday, 04 December 2012, at 16:29:39 (+0800), levin li wrote: > > > We're trying to use torque to checkpoint MPI jobs, but it seems that > > torque can only handle jobs running on a single node, I checked the > > code and found that when using qhold to checkpoint a job, qhold sends > > a PBS_BATCH_HoldJob request to pbs server, then pbs server relays this > > request to master host, and then master host checkpoints the job > > processes running on itself with BLCR, but not send the request to its > > sister nodes, so it seems that MPI jobs can not be checkpointable. > > > > I'm not sure whether I'm right, is there anybody who can tell me > > whether it is true with torque, or if I'm right, do you have any plans > > to make checkpoint for MPI jobs available ? > > We work with the BLCR team here at LBNL and have collaborated with them > extensively on trying to get MPI job checkpointing to work across nodes. > We were able to get it working fairly reliably on a small set of nodes, but the > greater the number of nodes in the job, the greater the chance of failure. > (I'll skip the details unless someone's specifically interested.) > > As the project is no longer being funded (AFAIK), the odds of this being > resolved by the BLCR team in the immediate future are slim. > While I don't claim to speak for them, I can say that we chose to abandon our > plans to deploy BLCR as a job preemption solution for our TORQUE/Moab > systems as a direct result of the prognosis given to us by the team. The > bottom line is, they're about 90-95% of the way there, but the remaining 5- > 10% is a significant challenge. Without funding, it doesn't seem likely the > work will be completed any time soon. > > If you are interested in working with them or have any ideas for getting the > project going again, I'm sure they'd be happy to hear from you at > checkpoint at lbl.gov. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Well, That is sad news. What are the options out there for checkpoint/restart of a job then? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From chenry at ittc.ku.edu Wed Dec 5 14:51:16 2012 From: chenry at ittc.ku.edu (Charles Henry) Date: Wed, 5 Dec 2012 15:51:16 -0600 (CST) Subject: [torqueusers] Question about checkpoint for MPI In-Reply-To: References: <50BDB473.5090601@gmail.com> <20121205173758.GY8827@lbl.gov> Message-ID: <1836470402.47115.1354744276574.JavaMail.root@ittc.ku.edu> ----- Original Message ----- > From: "Brian Contractor Andrus" > To: "Torque Users Mailing List" > Sent: Wednesday, December 5, 2012 3:31:19 PM > Subject: Re: [torqueusers] Question about checkpoint for MPI .... > > Well, That is sad news. > What are the options out there for checkpoint/restart of a job then? > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 BLCR still works for many jobs. We are using Torque+Maui+BLCR, but we are not finished with our configurations. We see this as our solution for the throuput -vs- availability -vs- turnaround time dilemma. We have a mixture of jobs--some researchers need access right away on interactive jobs, some run MPI jobs, and some run lots of small single-core jobs. The solution here is to organize queues (torque) with different QOS definitions (maui). This lets the interactive jobs preempt some other jobs (maui). We lose useful cycles unless we have a working checkpoint/restart scheme (BLCR). The main caveat I'm seeing here is: We need to create another queue+QOS for large MPI jobs so they cannot be preempted. I'd still be interested in knowing other alternatives, but at least for now, we are moving forward with this combination. Chuck From levin108 at gmail.com Wed Dec 5 19:20:51 2012 From: levin108 at gmail.com (levin li) Date: Thu, 06 Dec 2012 10:20:51 +0800 Subject: [torqueusers] Question about checkpoint for MPI In-Reply-To: <20121205173758.GY8827@lbl.gov> References: <50BDB473.5090601@gmail.com> <20121205173758.GY8827@lbl.gov> Message-ID: <50C00103.1000608@gmail.com> On 2012?12?06? 01:38, Michael Jennings wrote: > On Tuesday, 04 December 2012, at 16:29:39 (+0800), > levin li wrote: > >> We're trying to use torque to checkpoint MPI jobs, but it seems that >> torque can only handle jobs running on a single node, I checked the >> code and found that when using qhold to checkpoint a job, qhold >> sends a PBS_BATCH_HoldJob request to pbs server, then pbs server >> relays this request to master host, and then master host checkpoints >> the job processes running on itself with BLCR, but not send the >> request to its sister nodes, so it seems that MPI jobs can not be >> checkpointable. >> >> I'm not sure whether I'm right, is there anybody who can tell me >> whether it is true with torque, or if I'm right, do you have any >> plans to make checkpoint for MPI jobs available ? > > We work with the BLCR team here at LBNL and have collaborated with > them extensively on trying to get MPI job checkpointing to work across > nodes. We were able to get it working fairly reliably on a small set > of nodes, but the greater the number of nodes in the job, the greater > the chance of failure. (I'll skip the details unless someone's > specifically interested.) > > As the project is no longer being funded (AFAIK), the odds of this > being resolved by the BLCR team in the immediate future are slim. > While I don't claim to speak for them, I can say that we chose to > abandon our plans to deploy BLCR as a job preemption solution for our > TORQUE/Moab systems as a direct result of the prognosis given to us by > the team. The bottom line is, they're about 90-95% of the way there, > but the remaining 5-10% is a significant challenge. Without funding, > it doesn't seem likely the work will be completed any time soon. > > If you are interested in working with them or have any ideas for > getting the project going again, I'm sure they'd be happy to hear from > you at checkpoint at lbl.gov. > > Michael > That's really a bad news, I haven't find a reliable way to checkpoint MPI jobs so far, and I'd like to contribute to any project that aims to make this possible. As I know, BLCR is an opensource project, there're many people who wants to contributes their effort to make things better, I'm one of them, but I can not find where BLCR is hosted, I can only find the source package, I think hosting the project in github may makes it easier for us to work for it. Thanks, Levin From samuel at unimelb.edu.au Wed Dec 5 19:37:02 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 06 Dec 2012 13:37:02 +1100 Subject: [torqueusers] Question about checkpoint for MPI In-Reply-To: References: <50BDB473.5090601@gmail.com> <20121205173758.GY8827@lbl.gov> Message-ID: <50C004CE.7080705@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/12/12 08:31, Andrus, Brian Contractor wrote: > Well, That is sad news. Indeed. > What are the options out there for checkpoint/restart of a job > then? It's worth noting that the kernel community is following a completely different checkpoint/restart path, that of the OpenVZ developers "heckpoint/restore in user space" project (CRIU). You can read more about it here: https://lwn.net/Articles/525675/ The CRIU website is here: http://criu.org/ It will also be up for discussion at LCA2013 in Canberra this year (though I won't be there). I'd suggest it's worth bringing up on the openmpi-devel list, I must just do that now. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDABM4ACgkQO2KABBYQAh8QNQCggjPN3aItrtAgukZ2OJE4bSHT GjMAoIdB8EuOhzAhGMlVk3a4rFesONHO =o5/N -----END PGP SIGNATURE----- From wytsang at clustertech.com Wed Dec 5 20:51:56 2012 From: wytsang at clustertech.com (Clotho Tsang) Date: Thu, 6 Dec 2012 11:51:56 +0800 Subject: [torqueusers] [Mauiusers] support of gpus in maui In-Reply-To: References: Message-ID: Maui does not support GPU. Moab and pbs_sched support. I think there is no way to integrate GPU support in Maui? To enable GPU in torque, add "-enable-nvidia-gpus" in configure option. On 5 December 2012 03:20, pankaj dorlikar wrote: > Hi, > > We have Maui-3.2.6p20 configured with torque-2.5.8 on rhel 5.2 systems. > We have cluster with nodes having 16 cpus and 1 gpu per node. We want to > know whether with the existing maui version, gpu suuport is there? if yes, > what should be done to integrate it in maui. > We could do it in torque and torque detects it. > > > Thanks in advance > > > -- > Pankaj V. Dorlikar > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers > > -- > Clotho Tsang > Senior Software Engineer > Cluster Technology Limited > Email: > clotho at clustertech.com > Tel: (852) 2655-6129 > Fax: (852) 2994-2101 > Website: www.clustertech.com > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121206/59c7f0ec/attachment.html From jjc at iastate.edu Fri Dec 7 11:38:08 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 7 Dec 2012 18:38:08 +0000 Subject: [torqueusers] Mix MAUI and MOAB with a single TORQUE pbs_server? Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1A25C@ITSDAG1D.its.iastate.edu> I am working with a set of researchers who would like to have a large cluster with Torque as the scheduler with MAUI for the majority of the nodes, and MOAB for about 32 GPU nodes to make use of the GPU scheduling capabilities of MOAB. 1) Is this necessary? Is MOAB needed for a 2GPU node? 2) Can this be done from a single Torque install (some nodes scheduled under MAUI, some under MOAB) or will this be like two clusters? Thanks, James Coyle, PhD High Performance Computing Group 115 Durham Center Iowa State Univ. phone: (515)-294-2099 Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121207/add82bae/attachment-0001.html From craig.tierney at noaa.gov Fri Dec 7 11:47:55 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Fri, 7 Dec 2012 11:47:55 -0700 Subject: [torqueusers] Mix MAUI and MOAB with a single TORQUE pbs_server? In-Reply-To: <242421BFAF465844BE24EB90BB97E2210FA1A25C@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210FA1A25C@ITSDAG1D.its.iastate.edu> Message-ID: On Fri, Dec 7, 2012 at 11:38 AM, Coyle, James J [ITACD] wrote: > ** ** > > I am working with a set of researchers who would like to have a large > cluster with**** > > Torque as the scheduler with MAUI for the majority of the nodes, and MOAB* > *** > > for about 32 GPU nodes to make use of the GPU scheduling capabilities of > MOAB.**** > > ** ** > > **1) **Is this necessary? Is MOAB needed for a 2GPU node?**** > > ** > Maui does not support GPUs. If you have Moab, you can get rid of Maui. ** > > **2) **Can this be done from a single Torque install (some nodes > scheduled under MAUI,**** > > some under MOAB) or will this be like two clusters?**** > > ** > I have never done this, but I can't see how two different schedulers (Moab and Maui) could support scheduling on the same Torque Instance. Again, if you already are going to buy Moab, you can use that for your whole cluster, and not use Maui. Craig > ** > > Thanks,**** > > James Coyle, PhD**** > > High Performance Computing Group **** > > 115 Durham Center **** > > Iowa State Univ. phone: (515)-294-2099**** > > Ames, Iowa 50011 web: http://jjc.public.iastate.edu/**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121207/0dbd2fcd/attachment.html From chenry at ittc.ku.edu Fri Dec 7 12:48:30 2012 From: chenry at ittc.ku.edu (Charles Henry) Date: Fri, 7 Dec 2012 13:48:30 -0600 (CST) Subject: [torqueusers] Mix MAUI and MOAB with a single TORQUE pbs_server? In-Reply-To: References: <242421BFAF465844BE24EB90BB97E2210FA1A25C@ITSDAG1D.its.iastate.edu> Message-ID: <1910050252.72866.1354909710346.JavaMail.root@ittc.ku.edu> ----- Original Message ----- > From: "Craig Tierney - NOAA Affiliate" > To: "Torque Users Mailing List" > Sent: Friday, December 7, 2012 12:47:55 PM > Subject: Re: [torqueusers] Mix MAUI and MOAB with a single TORQUE pbs_server? > > On Fri, Dec 7, 2012 at 11:38 AM, Coyle, James J [ITACD] < > jjc at iastate.edu > wrote: > > > > > > > > I am working with a set of researchers who would like to have a large > cluster with > > Torque as the scheduler with MAUI for the majority of the nodes, and > MOAB > > for about 32 GPU nodes to make use of the GPU scheduling capabilities > of MOAB. > > > > 1) Is this necessary? Is MOAB needed for a 2GPU node? > > > Maui does not support GPUs. If you have Moab, you can get rid of > Maui. MOAB has a per node licensing cost--I'm guessing that's the reason for not wanting to use it on the whole cluster. Anyway--the fact that maui itself doesn't have GPU support is not a big deal. You just use torque's consumable resources and gpu support. It takes some custom configuration. > 2) Can this be done from a single Torque install (some nodes > scheduled under MAUI, > > some under MOAB) or will this be like two clusters? > > > > I have never done this, but I can't see how two different schedulers > (Moab and Maui) could support scheduling on the same Torque > Instance. Again, if you already are going to buy Moab, you can use > that for your whole cluster, and not use Maui. > > Craig Right--you can't have two schedulers with the same resource manager. So, if you go forward with two schedulers, you will need to have two torque resource managers running. You can put both on the same machine, with different ports and different installation directories. It can be done, but it's a huge headache. If you're committed to using each Maui and MOAB (e.g. you've paid for it already), you will essentially have two clusters. > > > Thanks, > > James Coyle, PhD > > High Performance Computing Group > > 115 Durham Center > > Iowa State Univ. phone: (515)-294-2099 > > Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From dbeer at adaptivecomputing.com Fri Dec 7 16:11:39 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 7 Dec 2012 16:11:39 -0700 Subject: [torqueusers] [torquedev] Migration to GitHub In-Reply-To: References: Message-ID: All, The contributor guidelines can be found here: https://github.com/adaptivecomputing/torque/wiki/How-To-Contribute Cheers, David On Fri, Dec 7, 2012 at 4:01 PM, Daniel Hardman < dhardman at adaptivecomputing.com> wrote: > Fellow Torquers: > > The migration to GitHub is complete. TORQUE's new home is: > https://github.com/adaptivecomputing/torque. This is true for all TORQUE > branches that Adaptive is shepherding. > > Some additional details are on our blog: > http://www.adaptivecomputing.com/blog/torque-now-on-github/ > > If you do not already have a GitHub account, now would be a good time to > create one. > > I believe David Beer will be sharing some contributor guidelines to the > torquedev mailing list shortly. We'll also post these on TORQUE's github > home for easy reference. If you've contributed in the past, these > guidelines are nothing new; they just describe how to submit patches with > maximum chance for adoption. > > On a related note, we have made some great progress in the past week or > two on the TORQUE quality initiative. > Some long-standing burrs are out from under the saddle, and more exciting > progress is on the horizon. > > --Daniel > > On Mon, Dec 3, 2012 at 11:18 AM, David Beer wrote: > >> All, >> >> Last Friday we proposed that we migrate TORQUE to GitHub, and so far the >> response has been a resounding yes. Barring major objections between now >> and then, we are planning to flip the switch this Friday (December 7th) in >> the morning. We'll send out a corresponding email. In addition to >> announcing the switch, we'll be publishing: >> >> - Contributor Guidelines >> - Quick Start Guide >> - Links to Tutorials for git and GitHub >> >> Please let us know if there's anything else that you feel would be >> helpful. We are optimistic that this shift will help us better communicate >> and coordinate our efforts. >> >> Cheers, >> >> -- >> David Beer | Senior Software Engineer >> Adaptive Computing >> >> >> _______________________________________________ >> torquedev mailing list >> torquedev at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torquedev >> >> > > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev > > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121207/37dd591d/attachment.html From chris.hunter at yale.edu Fri Dec 7 11:13:40 2012 From: chris.hunter at yale.edu (Chris Hunter) Date: Fri, 07 Dec 2012 13:13:40 -0500 Subject: [torqueusers] ubmod comments? Message-ID: <50C231D4.2000004@yale.edu> Some of our faculty are asking about using ubmod (http://ubmod.sourceforge.net/) to report usage summaries for TORQUE. We have previously used pbstools & pbsacct. We are currently working with gold & mam. 1) Anyone on the list care to share their ubmod experiences ? 2) What is the scope of your ubmod deployment (eg. for admins only, for management staff, for all users, etc.) ? 3) Are there any "killer" features that makes ubmod superior to similar projects ? 4) Any "best practice" recommendations? Lessons learned? thank-you in advance, chris hunter yale hpc group From jjc at iastate.edu Fri Dec 7 16:46:33 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 7 Dec 2012 23:46:33 +0000 Subject: [torqueusers] Mix MAUI and MOAB with a single TORQUE pbs_server? In-Reply-To: <1910050252.72866.1354909710346.JavaMail.root@ittc.ku.edu> References: <242421BFAF465844BE24EB90BB97E2210FA1A25C@ITSDAG1D.its.iastate.edu> <1910050252.72866.1354909710346.JavaMail.root@ittc.ku.edu> Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1A36D@ITSDAG1D.its.iastate.edu> Craig, Thanks for suggesting consumable resources. I had not known of them. I will ask for a sample configuration on the mauiusers list. >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Charles Henry >Sent: Friday, December 07, 2012 1:49 PM >To: Torque Users Mailing List >Subject: Re: [torqueusers] Mix MAUI and MOAB with a single TORQUE >pbs_server? > > > >----- Original Message ----- >> From: "Craig Tierney - NOAA Affiliate" >> To: "Torque Users Mailing List" >> Sent: Friday, December 7, 2012 12:47:55 PM >> Subject: Re: [torqueusers] Mix MAUI and MOAB with a single TORQUE >pbs_server? >> >> On Fri, Dec 7, 2012 at 11:38 AM, Coyle, James J [ITACD] < >> jjc at iastate.edu > wrote: >> >> >> >> >> >> >> >> I am working with a set of researchers who would like to have a >large >> cluster with >> >> Torque as the scheduler with MAUI for the majority of the nodes, >and >> MOAB >> >> for about 32 GPU nodes to make use of the GPU scheduling >capabilities >> of MOAB. >> >> >> >> 1) Is this necessary? Is MOAB needed for a 2GPU node? >> >> >> Maui does not support GPUs. If you have Moab, you can get rid of >Maui. > >MOAB has a per node licensing cost--I'm guessing that's the reason >for not wanting to use it on the whole cluster. Anyway--the fact >that maui itself doesn't have GPU support is not a big deal. You >just use torque's consumable resources and gpu support. It takes >some custom configuration. > >> 2) Can this be done from a single Torque install (some nodes >scheduled >> under MAUI, >> >> some under MOAB) or will this be like two clusters? >> >> >> >> I have never done this, but I can't see how two different >schedulers >> (Moab and Maui) could support scheduling on the same Torque >Instance. >> Again, if you already are going to buy Moab, you can use that for >your >> whole cluster, and not use Maui. >> >> Craig > >Right--you can't have two schedulers with the same resource manager. >So, if you go forward with two schedulers, you will need to have two >torque resource managers running. You can put both on the same >machine, with different ports and different installation >directories. It can be done, but it's a huge headache. If you're >committed to using each Maui and MOAB (e.g. you've paid for it >already), you will essentially have two clusters. > > >> >> >> Thanks, >> >> James Coyle, PhD >> >> High Performance Computing Group >> >> 115 Durham Center >> >> Iowa State Univ. phone: (515)-294-2099 >> >> Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Fri Dec 7 17:07:33 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Sat, 08 Dec 2012 11:07:33 +1100 Subject: [torqueusers] [torquedev] Migration to GitHub In-Reply-To: References: Message-ID: <50C284C5.6070603@unimelb.edu.au> On 08/12/12 10:01, Daniel Hardman wrote: > Fellow Torquers: > > The migration to GitHub is complete. TORQUE's new home is: > https://github.com/adaptivecomputing/torque. This is true for all > TORQUE branches that Adaptive is shepherding. Fantabulous, thanks to all the team for doing this, much appreciated! Thanks also to David for the contributor guidelines - clear, common sense and useful. All the best, and have a good weekend all, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From dbeer at adaptivecomputing.com Fri Dec 7 17:14:08 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 7 Dec 2012 17:14:08 -0700 Subject: [torqueusers] [torquedev] Migration to GitHub In-Reply-To: <50C284C5.6070603@unimelb.edu.au> References: <50C284C5.6070603@unimelb.edu.au> Message-ID: On Fri, Dec 7, 2012 at 5:07 PM, Christopher Samuel wrote: > On 08/12/12 10:01, Daniel Hardman wrote: > > > Fellow Torquers: > > > > The migration to GitHub is complete. TORQUE's new home is: > > https://github.com/adaptivecomputing/torque. This is true for all > > TORQUE branches that Adaptive is shepherding. > > Fantabulous, thanks to all the team for doing this, much appreciated! > > Thanks also to David for the contributor guidelines - clear, common > sense and useful. > > I'm glad to hear that. Feedback is welcome if anyone feels something can be made better. Thanks -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121207/8f7a3835/attachment.html From wytsang at clustertech.com Sun Dec 9 20:06:24 2012 From: wytsang at clustertech.com (Clotho Tsang) Date: Mon, 10 Dec 2012 11:06:24 +0800 Subject: [torqueusers] "qstat -f fail when hostname start with capital letter In-Reply-To: References: Message-ID: More information: this bug is not found at Torque 2.x On 30 November 2012 10:32, Clotho Tsang wrote: > I am using Torque 4.1.3. > > If the master node has hostname started with capital letter, qstat -f ID> fail to locate the job. > > However, "qstat -f" (without JOB ID) and checkjob is success. > > [root at M-30 ~]# qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 10.M-30 prop2 test3 00:00:00 R batch > [root at M-30 ~]# qstat 10 > qstat: Unknown Job Id Error 10.m-30.chess.ct > [root at M-30 ~]# qstat 10.M-30 > qstat: Unknown Job Id Error 10.m-30.chess.ct > [root at M-30 ~]# qstat 10.m-30.chess.ct > qstat: Unknown Job Id Error 10.m-30.chess.ct > [root at M-30 ~]# qstat 10.m-30 > qstat: Unknown Job Id Error 10.m-30.chess.ct > [root at M-30 ~]# qstat 10.M-30.chess.ct > qstat: Unknown Job Id Error 10.m-30.chess.ct > [root at M-30 ~]# checkjob 10 > > checking job 10 > > State: Running > Creds: user:test3 group:test3 class:batch qos:DEFAULT > WallTime: 00:02:12 of 1:00:00 > SubmitTime: Fri Nov 30 10:13:17 > (Time Queued Total: 00:07:20 Eligible: 00:07:20) > > StartTime: Fri Nov 30 10:20:37 > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: DEFAULT > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [bigmem] > Allocated Nodes: > [M-30.chess.ct:1] > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '10' (-00:01:52 -> 00:58:08 Duration: 1:00:00) > PE: 1.00 StartPriority: 7 > [root at M-30 ~]# qstat -f > Job Id: 10.M-30.chess.ct > Job_Name = j1.sh > Job_Owner = test3 at M-30.chess.ct > : > : > : > > > > -- > Clotho Tsang > Senior Software Engineer > Cluster Technology Limited > Email: clotho at clustertech.com > Tel: (852) 2655-6129 > Fax: (852) 2994-2101 > Website: www.clustertech.com > > -- > Clotho Tsang > Senior Software Engineer > Cluster Technology Limited > Email: clotho at clustertech.com > Tel: (852) 2655-6129 > Fax: (852) 2994-2101 > Website: www.clustertech.com > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121210/6de2162b/attachment.html From nieroda.lech at uni-koeln.de Mon Dec 10 02:28:22 2012 From: nieroda.lech at uni-koeln.de (Lech Nieroda) Date: Mon, 10 Dec 2012 10:28:22 +0100 Subject: [torqueusers] removal of "stray jobs" Message-ID: <50C5AB36.9050706@uni-koeln.de> Dear list, we are currently running Torque 4.1.3 with Maui 3.3.1. The option "mom_job_sync" is on. However, we get "stray" jobs quite often - these are jobs that remain in an "EXITING" state for whatever reason and their .JB files are often left lying around. Our workaround: at first we've tried to delete the JB files and restart the pbs_mom daemon but it turns out that a simple "momctl -h -c " does the job as well. An appropriate script runs now daily with cron and removes such jobs. So, when the server discovers a "stray job" he has the means to send a "cleaning" command to the pbs_mom but apparently doesn't do it and we have to do it manually. Any option to fix that? Is it a bug? Regards, Lech Nieroda -- Dipl.-Wirt.-Inf. Lech Nieroda Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) Universit?t zu K?ln Weyertal 121 Raum 309 (3. Etage) D-50931 K?ln Deutschland Tel.: +49 (221) 470-89606 E-Mail: nieroda.lech at uni-koeln.de From dbeer at adaptivecomputing.com Mon Dec 10 09:26:50 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 10 Dec 2012 09:26:50 -0700 Subject: [torqueusers] removal of "stray jobs" In-Reply-To: <50C5AB36.9050706@uni-koeln.de> References: <50C5AB36.9050706@uni-koeln.de> Message-ID: There is a fix for this in place that will be released with 4.1.4. I'm not sure exactly how it happens, but we added some functionality that makes the mom retry sending obits for jobs that are stuck in the exiting state on the mom. David On Mon, Dec 10, 2012 at 2:28 AM, Lech Nieroda wrote: > Dear list, > > we are currently running Torque 4.1.3 with Maui 3.3.1. The option > "mom_job_sync" is on. However, we get "stray" jobs quite often - these > are jobs that remain in an "EXITING" state for whatever reason and their > .JB files are often left lying around. > > Our workaround: at first we've tried to delete the JB files and restart > the pbs_mom daemon but it turns out that a simple "momctl -h -c > " does the job as well. An appropriate script runs now daily with > cron and removes such jobs. > > So, when the server discovers a "stray job" he has the means to send a > "cleaning" command to the pbs_mom but apparently doesn't do it and we > have to do it manually. > > Any option to fix that? Is it a bug? > > Regards, > Lech Nieroda > > -- > Dipl.-Wirt.-Inf. Lech Nieroda > Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) > Universit?t zu K?ln > Weyertal 121 > Raum 309 (3. Etage) > D-50931 K?ln > Deutschland > > Tel.: +49 (221) 470-89606 > E-Mail: nieroda.lech at uni-koeln.de > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121210/dc2dadc0/attachment.html From luiz at if.usp.br Mon Dec 10 11:39:21 2012 From: luiz at if.usp.br (Luiz Carlos dos Santos) Date: Mon, 10 Dec 2012 16:39:21 -0200 Subject: [torqueusers] pbs_server Message-ID: <011501cdd705$ab3fb5e0$01bf21a0$@if.usp.br> I use red hat linux. How to boot pbs_server automatically with the operating system? Thanks, Luiz Carlos dos Santos Analista de Sistemas ? IFUSP/FMT Instituto de F?sica da USP Departamento de F?sica dos Materiais e Mec?nica P?a. do Oceanogr?fico - Trav E, s/n? Edif?cio Alessandro Volta, Bloco C - sala 112 CEP 05508-120 ? S?o Paulo SP Fone: (11) 3091-6784 / Fax: (11) 3091-6831 E-mail: luiz at if.usp.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121210/261c9b66/attachment-0001.html From gas5x at yahoo.com Mon Dec 10 12:15:13 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Mon, 10 Dec 2012 11:15:13 -0800 (PST) Subject: [torqueusers] pbs_server In-Reply-To: <011501cdd705$ab3fb5e0$01bf21a0$@if.usp.br> Message-ID: <1355166913.75162.YahooMailClassic@web140905.mail.bf1.yahoo.com> Hi Luiz, There are sample pbs_server, pbs_mom scripts in the Torque source's ./contrib directory. One has to edit it according to your settings (Torque path and such) and then to put into /etc/rc.d/init.d/ directory. Then to run chkconfig for the pbs_server script. -- Grigory Shamov --- On Mon, 12/10/12, Luiz Carlos dos Santos wrote: From: Luiz Carlos dos Santos Subject: [torqueusers] pbs_server To: torqueusers at supercluster.org Cc: luiz at if.usp.br Date: Monday, December 10, 2012, 10:39 AM I use red hat linux. How to boot pbs_server automatically with the operating system? ? ?Thanks, ?Luiz Carlos dos Santos Analista de Sistemas ? IFUSP/FMTInstituto de F?sica da USPDepartamento de F?sica dos Materiais e Mec?nica P?a. do Oceanogr?fico - Trav E, s/n? Edif?cio Alessandro Volta, Bloco C - sala 112 CEP 05508-120 ? S?o Paulo SP Fone: (11) 3091-6784 / Fax: (11) 3091-6831E-mail: luiz at if.usp.br ? -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121210/cf8ef485/attachment.html From bdandrus at nps.edu Mon Dec 10 12:20:02 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Mon, 10 Dec 2012 19:20:02 +0000 Subject: [torqueusers] hwloc error torque 4.1.3 Message-ID: This seems to have crept back into torque 4.1.3: ./configure --enable-cpuset produces: ----------snip--------------------- checking for HWLOC... configure: error: cpuset support requires the hwloc package This can be solved by configuring with --with-hwloc-path=. This path should be the path to the directory containing the lib/ and include/ directories for your version of hwloc. Another option is adding the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. If you have done these and still get this error, try running ./autogen.sh and then configuring again. [root at build torque-4.1.3]# ls -l $PKG_CONFIG_PATH/hwloc.pc -rw-r--r-- 1 root root 257 Jul 19 2011 /usr/lib64/pkgconfig//hwloc.pc ----------------------- Tried using all the usual stuff, to no avail. Stock hwloc and hwloc-devel installed on CentOS 6.3 Any ideas out there? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From mej at lbl.gov Mon Dec 10 12:25:42 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 10 Dec 2012 11:25:42 -0800 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: References: Message-ID: <20121210192541.GM8827@lbl.gov> On Monday, 10 December 2012, at 19:20:02 (+0000), Andrus, Brian Contractor wrote: > This seems to have crept back into torque 4.1.3: > > ./configure --enable-cpuset produces: > ----------snip--------------------- > checking for HWLOC... configure: error: cpuset support requires the hwloc package Check config.log. There's probably something else going on. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From bdandrus at nps.edu Mon Dec 10 12:37:18 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Mon, 10 Dec 2012 19:37:18 +0000 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <20121210192541.GM8827@lbl.gov> References: <20121210192541.GM8827@lbl.gov> Message-ID: Same thing in config.log: ------------------- configure:22377: checking for HWLOC configure:22476: error: cpuset support requires the hwloc package This can be solved by configuring with --with-hwloc-path=. This path should be the path to the directory containing the lib/ and include/ directories for your version of hwloc. Another option is adding the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. If you have done these and still get this error, try running ./autogen.sh and then configuring again. -------------------------- I do see just above that: ------------------ configure:21928: checking mach/shared_region.h presence configure:21938: gcc -E conftest.c conftest.c:64:32: error: mach/shared_region.h: No such file or directory ----------------------- But it sure that matters. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Michael Jennings > Sent: Monday, December 10, 2012 11:26 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] hwloc error torque 4.1.3 > > On Monday, 10 December 2012, at 19:20:02 (+0000), Andrus, Brian Contractor > wrote: > > > This seems to have crept back into torque 4.1.3: > > > > ./configure --enable-cpuset produces: > > ----------snip--------------------- > > checking for HWLOC... configure: error: cpuset support requires the > > hwloc package > > Check config.log. There's probably something else going on. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Mon Dec 10 14:36:14 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 10 Dec 2012 13:36:14 -0800 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: References: <20121210192541.GM8827@lbl.gov> Message-ID: <20121210213613.GN8827@lbl.gov> On Monday, 10 December 2012, at 19:37:18 (+0000), Andrus, Brian Contractor wrote: > Same thing in config.log: > ------------------- > configure:22377: checking for HWLOC > configure:22476: error: cpuset support requires the hwloc package It's a pretty straight-forward test; either pkgconfig finds hwloc >= 1.1, or it doesn't. For me, installing hwloc-devel.x86_64 worked just fine. I, too, am using a rebuild of RHEL 6.3 (Scientific Linux). What does this command tell you? pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From dbeer at adaptivecomputing.com Mon Dec 10 14:45:31 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 10 Dec 2012 14:45:31 -0700 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <20121210213613.GN8827@lbl.gov> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> Message-ID: Have you tried setting --with-hwloc-path= appropriately? On Mon, Dec 10, 2012 at 2:36 PM, Michael Jennings wrote: > On Monday, 10 December 2012, at 19:37:18 (+0000), > Andrus, Brian Contractor wrote: > > > Same thing in config.log: > > ------------------- > > configure:22377: checking for HWLOC > > configure:22476: error: cpuset support requires the hwloc package > > It's a pretty straight-forward test; either pkgconfig finds hwloc >= > 1.1, or it doesn't. > > For me, installing hwloc-devel.x86_64 worked just fine. I, too, am > using a rebuild of RHEL 6.3 (Scientific Linux). > > What does this command tell you? > > pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." > > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121210/b38e2562/attachment-0001.html From chris.hunter at yale.edu Mon Dec 10 14:47:52 2012 From: chris.hunter at yale.edu (Chris Hunter) Date: Mon, 10 Dec 2012 16:47:52 -0500 Subject: [torqueusers] ubmod accounting log filters ? Message-ID: <50C65888.7070007@yale.edu> Some faculty are asking about using ubmod (http://ubmod.sourceforge.net/) to report usage summaries for TORQUE. We have previously used pbstools & pbsacct. We are currently working with gold & mam. 1) Anyone on the list care to share their ubmod experiences ? 2) Are there any "killer" features that makes ubmod superior to similar projects ? thank-you in advance, chris hunter yale hpc group From samuel at unimelb.edu.au Mon Dec 10 16:08:32 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 11 Dec 2012 10:08:32 +1100 Subject: [torqueusers] removal of "stray jobs" In-Reply-To: <50C5AB36.9050706@uni-koeln.de> References: <50C5AB36.9050706@uni-koeln.de> Message-ID: <50C66B70.1050905@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/12/12 20:28, Lech Nieroda wrote: > we are currently running Torque 4.1.3 with Maui 3.3.1. The option > "mom_job_sync" is on. However, we get "stray" jobs quite often - > these are jobs that remain in an "EXITING" state for whatever > reason and their .JB files are often left lying around. We see this also on our IBM iDataplex running RHEL 5.8 with Torque 2.4.x, though not on our SGI Altix XE running CentOS 5.8 with the exact same build (install tree rsync'd from SGI -> IBM). I suspect it's something due to the different mix of users on the two systems, but it's proved impossible to pin down, other than to note that it always seems to affect jobs where pbs_server has sent a second message to start a job on a node, resulting in a log message of a successful start followed by a log message saying that it rejected another attempt. For example: 11/09/2012 13:36:00;0008; pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;JOIN JOB as node 1 11/09/2012 13:36:00;0008; pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;start_process: task started, tid 2, sid 27290, cmd orted 11/09/2012 13:36:01;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0) in req_quejob, cannot queue new job, job exists and is running 11/09/2012 13:36:01;0080; pbs_mom;Req;req_reject;Reject reply code=15009(Job with requested ID already exists MSG=job is running), aux=0, type=QueueJob, from PBS_Server at merri-m.pcf.vlsci.unimelb.edu.au I logged it in Bugzilla here: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=218 cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDGa3AACgkQO2KABBYQAh/BnwCfYMzTtrCAuv0EeJypV/AVoCF5 qFwAoIxoVpxgDVqVdk17Lbz0KIynZ5BX =fbtO -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Mon Dec 10 16:18:11 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 11 Dec 2012 10:18:11 +1100 Subject: [torqueusers] Question to Torque community regarding display of completed jobs in qstat In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620123B26F53C3@exvic-mbx04.nexus.csiro.au> Message-ID: <50C66DB3.1020904@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/12/12 03:57, Craig Tierney - NOAA Affiliate wrote: > This is exactly what I want to do. But that requires setting > keep_completed_jobs to 86400. This means I will have 30000+ jobs > in completed state, slowing down the response of qstat (and I doubt > people want to see the completed jobs most of the time). Hence the > point of this discussion. For what it's worth I think we'd be fine with qstat doing the filtering, but not with altering the API to not return completed jobs (and hence cause issues for Maui/Moab and anything else that talks to the pbs_server that way). cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDGbbIACgkQO2KABBYQAh90VwCfWVFiOuq6cnMS5msYnLBIxp6G ggAAmwb5xkD4ILLDoFJvX8lwv0p7kqdK =2YED -----END PGP SIGNATURE----- From olivier.lahaye at cea.fr Wed Dec 12 08:03:45 2012 From: olivier.lahaye at cea.fr (LAHAYE Olivier) Date: Wed, 12 Dec 2012 15:03:45 +0000 Subject: [torqueusers] Wired bug in torque 4.1.3 and 4.1.4 (blcr module not loaded) Message-ID: <01EDEEAE5DA8584CAAD628AA5B06179A0909488E@EXDAG0-A3.intra.cea.fr> Hi, I think I've trigguered a realy wired bug in torque. I've built torque with blcr support. If the blcr module is not loaded on a node where a job is scheduled to run, the job hangs with various errors ranging from (nothing) to unable to setup IO or the like. What helped me is that pbsdsh issued a warning in the job error log and once I fixed the blcr issue (started the blcr service that was reponsible of modprobing the module), the whole pbs system was running fine. I'm not skilled enough to find the exact problem, but if that can help, at least it's better than nothing. See below the --about (config options) and after that, the log from pbs_mom on the executing node: /opt/pbs/sbin/pbs_server --about package: torque 4.1.4 sourcedir: /root/rpmbuild/BUILD/torque-4.1.4 configure: '--prefix=/opt/pbs' '--mandir=/opt/pbs/man' '--libdir=/opt/pbs/lib64' '--includedir=/opt/pbs/include' '--with-server-home=/var/lib/torque' '--with-pam=/lib64/security' '--with-sendmail=/usr/sbin/sendmail' '--with-default-server=pbs_oscar' '--with-server-name-file=server_name' '--enable-gui' '--enable-syslog' '--with-tcl' '--enable-rpp' '--with-rcp=scp' '--enable-drmaa' '--enable-blcr' '--enable-nvidia-gpus' '--enable-munge-auth' 'CC=' 'CFLAGS=' 'LDFLAGS=' 'PKG_CONFIG_PATH=/usr/lib64/pkgconfig:/usr/share/pkgconfig' buildcflags: -D_LARGEFILE64_SOURCE -DMUNGE_AUTH buildhost: is005045.intra.cea.fr builddate: Tue Dec 11 14:06:01 CET 2012 builddir: /root/rpmbuild/BUILD/torque-4.1.4 builduser: root installdir: /opt/pbs serverhome: /var/lib/torque version: 4.1.4-snap.201211201307 [...] (mom_log on oscarnode49) 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;req_commit:starting job execution 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;0: oscarnode49/11 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;1: oscarnode49/10 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;2: oscarnode49/9 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;3: oscarnode49/8 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;4: oscarnode49/7 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;5: oscarnode49/6 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;6: oscarnode49/5 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;7: oscarnode49/4 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;8: oscarnode49/3 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;9: oscarnode49/2 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;10: oscarnode49/1 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;11: oscarnode49/0 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;12: oscarnode48/11 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;13: oscarnode48/10 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;14: oscarnode48/9 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;15: oscarnode48/8 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;16: oscarnode48/7 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;17: oscarnode48/6 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;18: oscarnode48/5 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;19: oscarnode48/4 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;20: oscarnode48/3 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;21: oscarnode48/2 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;22: oscarnode48/1 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;23: oscarnode48/0 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;24: oscarnode47/5 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;25: oscarnode47/4 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;26: oscarnode47/3 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;27: oscarnode47/2 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;28: oscarnode47/1 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;29: oscarnode47/0 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;job_nodes;job: 15.is003274.intra.cea.fr numnodes=3 numvnod=30 12/11/2012 15:06:11;0001; pbs_mom.3661;Svr;pbs_mom;LOG_DEBUG::init_groups, pre-sigprocmask 12/11/2012 15:06:11;0001; pbs_mom.3661;Svr;pbs_mom;LOG_DEBUG::init_groups, post-initgroups 12/11/2012 15:06:11;0002; pbs_mom.3661;Job;15.is003274.intra.cea.fr;allocate_demux_sockets: stdout: 10:56644 stderr: 11:43813 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;start_exec: total wire-up time for job 0.2247 12/11/2012 15:06:11;0001; pbs_mom.3661;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;about to fork child which will become job 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;phase 2 of job launch successfully completed 12/11/2012 15:06:11;0002; pbs_mom.3977;n/a;mom_close_poll;entered 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;task/session info loaded 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;TMomFinalizeJob3;Job 15.is003274.intra.cea.fr read start return code=0 session=3977 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;saving task (TMomFinalizeJob3) 12/11/2012 15:06:11;0008; pbs_mom.3661;Svr;task_save;saving task in /var/lib/torque/mom_priv/jobs/15.is003274.intra.cea.fr.TK 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;TMomFinalizeJob3;job 15.is003274.intra.cea.fr started, pid = 3977 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;exec_job_on_ms:job successfully started 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;req_commit:job execution started 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 8 addr 127.0.0.1:43387 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_request: job 15.is003274.intra.cea.fr cookie CAAFC3D6302C31FCF9BD92DE9205655D task 1 com 100 event 1 12/11/2012 15:06:11;0002; pbs_mom.3661;node;close_conn;Connection 8 - func 414387 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;matching task located, marking interface closed 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 8 addr 127.0.0.1:43388 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_request: job 15.is003274.intra.cea.fr cookie CAAFC3D6302C31FCF9BD92DE9205655D task 1 com 102 event 2 12/11/2012 15:06:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_spawn_request: SPAWN 15.is003274.intra.cea.fr on node 0 12/11/2012 15:06:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;saving task (TM_SPAWN) 12/11/2012 15:06:11;0008; pbs_mom.3661;Svr;task_save;saving task in /var/lib/torque/mom_priv/jobs/15.is003274.intra.cea.fr.TK 12/11/2012 15:06:11;0002; pbs_mom.4000;n/a;mom_close_poll;entered 12/11/2012 15:06:31;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;task not started, 'hostname', stdio setup failed (see syslog) 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;scan_for_terminated;entered 12/11/2012 15:06:31;0080; pbs_mom.3661;Svr;mom_get_sample;proc_array load started 12/11/2012 15:06:31;0080; pbs_mom.3661;n/a;mom_get_sample;proc_array loaded - nproc=285 12/11/2012 15:06:31;0080; pbs_mom.3661;n/a;cput_sum;proc_array loop start - jobid = 15.is003274.intra.cea.fr 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;cput_sum;cput_sum: session=3977 pid=3977 cputime=0 (cputfactor=1.000000) 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;cput_sum;cput_sum: session=3977 pid=3998 cputime=0 (cputfactor=1.000000) 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;cput_sum;cput_sum: session=3977 pid=3999 cputime=0 (cputfactor=1.000000) 12/11/2012 15:06:31;0080; pbs_mom.3661;n/a;mem_sum;proc_array loop start - jobid = 15.is003274.intra.cea.fr 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;mem_sum;mem_sum: session=3977 pid=3977 vsize=16019456 sum=16019456 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;mem_sum;mem_sum: session=3977 pid=3998 vsize=9412608 sum=25432064 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;mem_sum;mem_sum: session=3977 pid=3999 vsize=55603200 sum=81035264 12/11/2012 15:06:31;0080; pbs_mom.3661;n/a;resi_sum;proc_array loop start - jobid = 15.is003274.intra.cea.fr 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;resi_sum;resi_sum: session=3977 pid=3977 rss=1708032 sum=1708032 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;resi_sum;resi_sum: session=3977 pid=3998 rss=1302528 sum=3010560 12/11/2012 15:06:31;0002; pbs_mom.3661;n/a;resi_sum;resi_sum: session=3977 pid=3999 rss=2523136 sum=5533696 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;scan_for_terminated;pid 4000 not tracked, statloc=65024, exitval=254 12/11/2012 15:06:31;0002; pbs_mom.3661;node;close_conn;Connection 8 - func 414387 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;matching task located, marking interface closed 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 8 addr 127.0.0.1:43399 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_request: job 15.is003274.intra.cea.fr cookie CAAFC3D6302C31FCF9BD92DE9205655D task 1 com 102 event 3 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_spawn_request: SPAWN 15.is003274.intra.cea.fr on node 1 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 10 addr 10.0.238.149:606 12/11/2012 15:06:31;0002; pbs_mom.3661;Svr;im_request;connect from 10.0.238.149:606 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;im_request:rec req 'SPAWN_TASK' (3) for job 15.is003274.intra.cea.fr from 10.0.238.149:606 ev 3 task 1 cookie CAAFC3D6302C31FCF9BD92DE9205655D 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;INFO: received request 'SPAWN_TASK' from 10.0.238.149:606 for job '15.is003274.intra.cea.fr' (spawning task on node '0' with taskid=3, globid='none' 12/11/2012 15:06:31;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;saving task (IM_SPAWN_TASK) 12/11/2012 15:06:31;0008; pbs_mom.3661;Svr;task_save;saving task in /var/lib/torque/mom_priv/jobs/15.is003274.intra.cea.fr.TK 12/11/2012 15:06:31;0002; pbs_mom.4001;n/a;mom_close_poll;entered 12/11/2012 15:06:51;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;task not started, 'hostname', stdio setup failed (see syslog) 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;ERROR: received request 'SPAWN_TASK' from 10.0.238.149:606 for job '15.is003274.intra.cea.fr' (cannot start task) 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;scan_for_terminated;entered 12/11/2012 15:06:51;0080; pbs_mom.3661;Svr;mom_get_sample;proc_array load started 12/11/2012 15:06:51;0080; pbs_mom.3661;n/a;mom_get_sample;proc_array loaded - nproc=285 12/11/2012 15:06:51;0080; pbs_mom.3661;n/a;cput_sum;proc_array loop start - jobid = 15.is003274.intra.cea.fr 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;cput_sum;cput_sum: session=3977 pid=3977 cputime=0 (cputfactor=1.000000) 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;cput_sum;cput_sum: session=3977 pid=3998 cputime=0 (cputfactor=1.000000) 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;cput_sum;cput_sum: session=3977 pid=3999 cputime=0 (cputfactor=1.000000) 12/11/2012 15:06:51;0080; pbs_mom.3661;n/a;mem_sum;proc_array loop start - jobid = 15.is003274.intra.cea.fr 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;mem_sum;mem_sum: session=3977 pid=3977 vsize=16019456 sum=16019456 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;mem_sum;mem_sum: session=3977 pid=3998 vsize=9412608 sum=25432064 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;mem_sum;mem_sum: session=3977 pid=3999 vsize=55603200 sum=81035264 12/11/2012 15:06:51;0080; pbs_mom.3661;n/a;resi_sum;proc_array loop start - jobid = 15.is003274.intra.cea.fr 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;resi_sum;resi_sum: session=3977 pid=3977 rss=1708032 sum=1708032 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;resi_sum;resi_sum: session=3977 pid=3998 rss=1302528 sum=3010560 12/11/2012 15:06:51;0002; pbs_mom.3661;n/a;resi_sum;resi_sum: session=3977 pid=3999 rss=2543616 sum=5554176 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;scan_for_terminated;pid 4001 not tracked, statloc=65024, exitval=254 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 10 addr 10.0.238.149:692 12/11/2012 15:06:51;0002; pbs_mom.3661;Svr;im_request;connect from 10.0.238.149:692 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;im_request:rec req 'ERROR' (99) for job 15.is003274.intra.cea.fr from 10.0.238.149:692 ev 3 task 1 cookie CAAFC3D6302C31FCF9BD92DE9205655D 12/11/2012 15:06:51;0001; pbs_mom.3661;Svr;pbs_mom;LOG_ERROR::im_request, Response recieved from client 10.0.238.149:692 (15003) jobid 15.is003274.intra.cea.fr 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;im_request: REQUEST 3 15.is003274.intra.cea.fr returned ERROR 17000 12/11/2012 15:06:51;0002; pbs_mom.3661;node;close_conn;Connection 8 - func 414387 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;matching task located, marking interface closed 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 8 addr 127.0.0.1:43410 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_request: job 15.is003274.intra.cea.fr cookie CAAFC3D6302C31FCF9BD92DE9205655D task 1 com 102 event 4 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;tm_spawn_request: SPAWN 15.is003274.intra.cea.fr on node 2 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;tcp_request;tcp_request: fd 10 addr 10.0.238.149:310 12/11/2012 15:06:51;0002; pbs_mom.3661;Svr;im_request;connect from 10.0.238.149:310 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;im_request:rec req 'SPAWN_TASK' (3) for job 15.is003274.intra.cea.fr from 10.0.238.149:310 ev 4 task 1 cookie CAAFC3D6302C31FCF9BD92DE9205655D 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;INFO: received request 'SPAWN_TASK' from 10.0.238.149:310 for job '15.is003274.intra.cea.fr' (spawning task on node '0' with taskid=4, globid='none' 12/11/2012 15:06:51;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;saving task (IM_SPAWN_TASK) 12/11/2012 15:06:51;0008; pbs_mom.3661;Svr;task_save;saving task in /var/lib/torque/mom_priv/jobs/15.is003274.intra.cea.fr.TK 12/11/2012 15:06:51;0002; pbs_mom.4002;n/a;mom_close_poll;entered 12/11/2012 15:07:11;0001; pbs_mom.3661;Job;15.is003274.intra.cea.fr;task not started, 'hostname', stdio setup failed (see syslog) 12/11/2012 15:07:11;0008; pbs_mom.3661;Job;15.is003274.intra.cea.fr;ERROR: received request 'SPAWN_TASK' from 10.0.238.149:310 for job '15.is003274.intra.cea.fr' (cannot start task) 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;mom_server_all_update_stat;composing status update for server 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[0]: pid 2530 sid 2529 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[1]: pid 3977 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[2]: pid 3998 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[2]: pid 3999 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;nsessions=2 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[0]: pid 2530 sid 2529 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[1]: pid 3977 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[2]: pid 3998 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[2]: pid 3999 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;nsessions=2 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[0]: pid 2530 sid 2529 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[1]: pid 3977 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[2]: pid 3998 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;sessions[2]: pid 3999 sid 3977 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;sessions;nsessions=2 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;nusers;nusers[0]: pid 2530 uid 496 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;nusers;nusers[1]: pid 3977 uid 1116 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;nusers;nusers[2]: pid 3998 uid 1116 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;nusers;nusers[2]: pid 3999 uid 1116 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;nusers;nusers=2 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;totmem;totmem: total mem=51249725440 12/11/2012 15:07:11;0002; pbs_mom.3661;n/a;availmem;availmem: free mem=50474262528 12/11/2012 15:07:11;0002; pbs_mom.3661;node;ncpus;ncpus=12 12/11/2012 15:07:11;0001; pbs_mom.3661;Svr;pbs_mom;LOG_DEBUG::gpus, gpus: GPU cmd issued: nvidia-smi -q -x 2>&1 12/11/2012 15:07:30;0002; pbs_mom.3661;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "opsys=linux" 12/11/2012 15:07:30;0002; pbs_mom.3661;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "uname=Linux oscarnode49 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64" 12/11/2012 15:07:30;0002; pbs_mom.3661;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "sessions=2529 3977" 12/11/2012 15:07:30;0002; pbs_mom.3661;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nsessions=2" 12/11/2012 15:07:30;0002; pbs_mom.3661;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nusers=2" [...] Olivier. -- Olivier LAHAYE CEA DRT/LIST/DCSI/DIR -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121212/58b852b8/attachment-0001.html From nieroda.lech at uni-koeln.de Wed Dec 12 09:57:55 2012 From: nieroda.lech at uni-koeln.de (Lech Nieroda) Date: Wed, 12 Dec 2012 17:57:55 +0100 Subject: [torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3 In-Reply-To: <50B36868.40603@uni-koeln.de> References: <50AE48F0.1070807@uni-koeln.de> <50B36868.40603@uni-koeln.de> Message-ID: <50C8B793.5090107@uni-koeln.de> Dear list, the problem continues to persist: ca. 1 every 1000-2000 jobs is killed. We've noticed that it almost disappears when the pbs_server loglevel is raised to 7 (ca. one occurrence of the error per week). Considering that higher loglevels simply generate more output and may thus prolong the execution of some tasks for a minuscule amount of time, this may turn out to be a race condition issue. We can't permanently set the loglevel to 7 though, because of two problems: the daily logfile size is ca. 30GB and pbs_server crashes every 2-3 days. The error: initially, the server sends an abort for no apparent reason, however it mails the user "Job does not exist on node". Even though the pbs_mom sends the correct joblist to the pbs_server, it receives a SIGKILL signal for the job (here with the jobid 721711, no dependencies, etc) without an error message of its own: [pbs_server snip] 12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting connect to host 172.18.1.220 port 15002 12/12/2012 02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate: setting job 721711.cheops10 state from RUNNING-RUNNING to QUEUED-SUBSTATE55 (1-55) [...] 12/12/2012 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job does not exist on node) [pbs_server snap] [pbs_mom snip] 12/12/2012 02:49:50;0002; pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10 722458.cheops10" 12/12/2012 02:49:50;0002; pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "varattr= " 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;read_tcp_reply;protocol: 4 version: 3 command:4 sock:8 12/12/2012 02:49:50;0002; pbs_mom.31686;n/a;mom_server_update_stat;status update successfully sent to cheops10 12/12/2012 02:49:50;0080; pbs_mom.31686;Req;dis_request_read;decoding command SignalJob from PBS_Server 12/12/2012 02:49:50;0100; pbs_mom.31686;Req;;Type SignalJob request received from PBS_Server at cheops10, sock=8 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;mom_process_request;request type SignalJob from host cheops10 received 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;mom_process_request;request type SignalJob from host cheops10 allowed 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob on sd=8 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;721711.cheops10;signaling job with signal SIGKILL 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;kill_job;req_signaljob: sending signal 9, "KILL" to job 721711.cheops10, reason: killing job [...] 12/12/2012 02:50:35;0080; pbs_mom.31686;Svr;preobit_reply;top of preobit_reply 12/12/2012 02:50:35;0080; pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 12/12/2012 02:50:35;0001; pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server, deleting locally [pbs_mom snap] Ideas? Fixes? Regards, Lech Nieroda In case it helps to analyze the bug On 26.11.2012 14:02, Lech Nieroda wrote: > Dear list, > > I've raised the loglevel on one of the clients to loglevel=7 in order to > collect more information on the event. > > maui just assumes a successful completion of the job. > > pbs_server log regarding the job 686938[4] that suddenly becomes unknown > at 00:15:21: > [snip] > /var/spool/torque/server_logs/20121125:11/25/2012 > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at request > of maui at localhost.localdomain > /var/spool/torque/server_logs/20121125:11/25/2012 > 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > setting job 686938[4].cheops10 state from QUEUED-QUEUED to > RUNNING-PRERUN (4-40) > /var/spool/torque/server_logs/20121125:11/25/2012 > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported > success for job after 1 seconds (dest=???), rc=0 > /var/spool/torque/server_logs/20121125:11/25/2012 > 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to > RUNNING-RUNNING (4-42) > /var/spool/torque/server_logs/20121125:11/25/2012 > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending email: > User does not want mail of this type. > /var/spool/torque/server_logs/20121125:11/25/2012 > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > setting job 686938[4].cheops10 state from RUNNING-RUNNING to > QUEUED-SUBSTATE55 (1-55) > /var/spool/torque/server_logs/20121126:11/26/2012 > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to > EXITING-SUBSTATE55 (5-55) > /var/spool/torque/server_logs/20121126:11/26/2012 > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from > smp, state EXITING > /var/spool/torque/server_logs/20121126:11/26/2012 > 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray > job 686938[4].cheops10 found on cheops11801 > [snap] > > pbs_mom log at loglevel 7, shortly before the kill: > [snip] > 11/26/2012 00:14:26;0008; pbs_mom.26562;Req;send_sisters;sending > command POLL_JOB for job 686938[4].cheops10 (7) > 11/26/2012 00:15:02;0002; > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "jobs=686659[1].cheops10 686694[8].cheops10 > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10 > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10 > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10 > 686941[212].cheops10 687103[1].cheops10 687224.cheops10 > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10 > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10 > 687367[1].cheops10 687367[2].cheops10 687366.cheops10" > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:15:11;0008; pbs_mom.26562;Req;send_sisters;sending > command POLL_JOB for job 686938[4].cheops10 (7) > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:15:56;0008; pbs_mom.26562;Req;send_sisters;sending > command POLL_JOB for job 686938[4].cheops10 (7) > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:00;0008; > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in > non-exiting substate RUNNING, no obit sent at this time > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > start - jobid = 686938[4].cheops10 > 11/26/2012 00:16:01;0080; > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 (child > pid=4214) > 11/26/2012 00:16:02;0002; > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "jobs=686659[1].cheops10 686694[8].cheops10 > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10 > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10 > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10 > 687103[1].cheops10 687224.cheops10 687252[1].cheops10 687279[1].cheops10 > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10 > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 687366.cheops10" > 11/26/2012 00:16:02;0008; > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal SIGKILL > 11/26/2012 00:16:02;0008; pbs_mom.26562;Job;kill_job;req_signaljob: > sending signal 9, "KILL" to job 686938[4].cheops10, reason: killing job > [snap] > > As we can see, the pbs_mom sends complete job lists to the pbs_server > right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting > (00:15:21), where the job 686938[4].cheops10 is included. However, the > server claims that the job is unknown/doesn't exist and sends a kill > command which the pbs_mom then executes (0:16:02). > > I hope this helps in tracking down the bug. > > Regards, > Lech Nieroda > > > On 22.11.2012 16:46, Lech Nieroda wrote: >> Dear list, >> >> we have another serious problem since our upgrade to Torque 4.1.3. We >> are using it with Maui 3.3.1. The problem in a nutshell: some few, >> random jobs are suddenly "unknown" to the server, it changes their >> status to EXITING-SUBSTATE55 and requests a silent kill on the compute >> nodes. The job then dies, the processes are killed on the node, there is >> no "Exit_status" in the server-log, no entry in maui/stats, no >> stdout/stderr files. The users are, understandably, not amused. >> >> It doesn't seem to be user or application specific. Even a single >> instance from a job array can get blown away in this way while all other >> instances end normally. >> >> Here are some logs of such a job (681684[35]): >> >> maui just assumes a successful completion: >> [snip] >> 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0) >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is 1.000000, >> 1.000000, 1 >> 11/21 19:25:55 INFO: active PBS job 681684[35] has been removed from >> the queue. assuming successful completion >> 11/21 19:25:55 MJobProcessCompleted(681684[35]) >> 11/21 19:25:55 INFO: job '681684[35]' completed X: 0.063356 T: >> 10903 PS: 10903 A: 0.063096 >> 11/21 19:25:55 MJobSendFB(681684[35]) >> 11/21 19:25:55 INFO: job usage sent for job '681684[35]' >> 11/21 19:25:55 MJobRemove(681684[35]) >> 11/21 19:25:55 MJobDestroy(681684[35]) >> [snap] >> >> pbs_server decides at 19:25:11, after 3 hours runtime, that the job is >> unknown (grepped by JobID from the server logs): >> [snip] >> 11/21/2012 >> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate: >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to >> RUNNING-RUNNING (4-42) >> 11/21/2012 >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to >> QUEUED-SUBSTATE55 (1-55) >> 11/21/2012 >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to >> EXITING-SUBSTATE55 (5-55) >> 11/21/2012 >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from >> smp, state EXITING >> 11/21/2012 >> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray >> job 681684[35].cheops10 found on cheops21316 >> [snap] >> >> pbs_client just kills the processes: >> [snip] >> 11/21/2012 16:23:43;0001; pbs_mom.32254;Job;TMomFinalizeJob3;job >> 681684[35].cheops10 started, pid = 17452 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task >> 1 gracefully with sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process >> (pid=17452/state=R) after sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process >> (pid=17452/state=Z) after sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task >> 1 gracefully with sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process >> (pid=17692/state=R) after sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task >> 1 gracefully with sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process >> (pid=17703/state=R) after sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task >> 1 gracefully with sig 15 >> 11/21/2012 19:25:14;0008; >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process >> (pid=17731/state=R) after sig 15 >> 11/21/2012 19:25:15;0080; >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job >> 681684[35].cheops10 task 1 terminated, sid=17452 >> 11/21/2012 19:25:15;0008; pbs_mom.32254;Job;681684[35].cheops10;job >> was terminated >> 11/21/2012 19:25:50;0001; >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server, >> deleting locally >> 11/21/2012 19:25:50;0080; >> pbs_mom.32254;Job;681684[35].cheops10;removed job script >> [snap] >> >> Sometimes, the pbs_mom logs include this message before the killing starts: >> [snip] >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, >> type=StatusJob, from PBS_Server at cheops10 >> [snap] >> >> And finally, some job informations given to epilogue: >> [snip] >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: >> 681684[35].cheops10,hthiele0,cheops21316,Job Information: >> userid=hthiele0, >> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00', >> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34', >> queue=smp, account=ccg-ngs, exitcode=271 >> [snap] >> >> This happens rarely (about 1 in 3000). However, silent deletions of >> random jobs aren't exactly a trifling matter. >> I could try to disable the mom_job_sync option, which could perhaps >> prevent the process killing of unknown jobs, but it would also leave >> corrupt/pre-execution jobs alive. >> >> Can this be fixed? >> >> On a side-note, here are some further, minor Bugs I've noticed in the >> Torque 4.1.3. Version: >> - the epilogue script is usually invoked twice and sometimes even >> several times >> - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2, >> then the number of "tasks" as seen by qstat is zero >> - there have been some API changes between Torque 2.x and Torque 4.x, so >> that two maui calls had to be altered in order to build against Torque >> 4.x (get_svrport, openrm). >> >> >> Regards, >> Lech Nieroda >> > > -- Dipl.-Wirt.-Inf. Lech Nieroda Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) Universit?t zu K?ln Weyertal 121 Raum 309 (3. Etage) D-50931 K?ln Deutschland Tel.: +49 (221) 470-89606 E-Mail: nieroda.lech at uni-koeln.de From dbeer at adaptivecomputing.com Wed Dec 12 10:16:58 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 12 Dec 2012 10:16:58 -0700 Subject: [torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3 In-Reply-To: <50C8B793.5090107@uni-koeln.de> References: <50AE48F0.1070807@uni-koeln.de> <50B36868.40603@uni-koeln.de> <50C8B793.5090107@uni-koeln.de> Message-ID: Lech, >From your log snippet, it appears that the server sent a poll request to the mom (for the job) and the mom replied back that it didn't know about that job, getting the job cancelled on the server's side. I'm going to do a little more investigation to see if there's some way that the poll request to the mom can time out or otherwise fail to erroneously create this condition. If you can provide any more details about creating this scenario, that'd be great. Helpful information would be describing the lengths of jobs that are running, if they are parallel or serial, or estimate the ratio of parallel to serial, things like that might help us figure it out. David On Wed, Dec 12, 2012 at 9:57 AM, Lech Nieroda wrote: > Dear list, > > the problem continues to persist: ca. 1 every 1000-2000 jobs is killed. > We've noticed that it almost disappears when the pbs_server loglevel is > raised to 7 (ca. one occurrence of the error per week). Considering that > higher loglevels simply generate more output and may thus prolong the > execution of some tasks for a minuscule amount of time, this may turn > out to be a race condition issue. > We can't permanently set the loglevel to 7 though, because of two > problems: the daily logfile size is ca. 30GB and pbs_server crashes > every 2-3 days. > > The error: initially, the server sends an abort for no apparent reason, > however it mails the user "Job does not exist on node". Even though the > pbs_mom sends the correct joblist to the pbs_server, it receives a > SIGKILL signal for the job (here with the jobid 721711, no dependencies, > etc) without an error message of its own: > > [pbs_server snip] > 12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting > connect to host 172.18.1.220 port 15002 > 12/12/2012 > 02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate: > setting job 721711.cheops10 state from RUNNING-RUNNING to > QUEUED-SUBSTATE55 (1-55) > [...] > 12/12/2012 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing > to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job does > not exist on node) > [pbs_server snap] > > [pbs_mom snip] > 12/12/2012 02:49:50;0002; > pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10 > 722458.cheops10" > 12/12/2012 02:49:50;0002; > pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "varattr= " > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;read_tcp_reply;protocol: 4 > version: 3 command:4 sock:8 > 12/12/2012 02:49:50;0002; > pbs_mom.31686;n/a;mom_server_update_stat;status update successfully sent > to cheops10 > 12/12/2012 02:49:50;0080; pbs_mom.31686;Req;dis_request_read;decoding > command SignalJob from PBS_Server > 12/12/2012 02:49:50;0100; pbs_mom.31686;Req;;Type SignalJob request > received from PBS_Server at cheops10, sock=8 > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;mom_process_request;request type SignalJob from host > cheops10 received > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;mom_process_request;request type SignalJob from host > cheops10 allowed > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob on > sd=8 > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;721711.cheops10;signaling > job with signal SIGKILL > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;kill_job;req_signaljob: > sending signal 9, "KILL" to job 721711.cheops10, reason: killing job > [...] > 12/12/2012 02:50:35;0080; pbs_mom.31686;Svr;preobit_reply;top of > preobit_reply > 12/12/2012 02:50:35;0080; > pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr > worked, top of while loop > 12/12/2012 02:50:35;0001; > pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server, > deleting locally > [pbs_mom snap] > > Ideas? Fixes? > > Regards, > Lech Nieroda > > > > In case it helps to analyze the bug > > On 26.11.2012 14:02, Lech Nieroda wrote: > > Dear list, > > > > I've raised the loglevel on one of the clients to loglevel=7 in order to > > collect more information on the event. > > > > maui just assumes a successful completion of the job. > > > > pbs_server log regarding the job 686938[4] that suddenly becomes unknown > > at 00:15:21: > > [snip] > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at request > > of maui at localhost.localdomain > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from QUEUED-QUEUED to > > RUNNING-PRERUN (4-40) > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported > > success for job after 1 seconds (dest=???), rc=0 > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to > > RUNNING-RUNNING (4-42) > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending email: > > User does not want mail of this type. > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from RUNNING-RUNNING to > > QUEUED-SUBSTATE55 (1-55) > > /var/spool/torque/server_logs/20121126:11/26/2012 > > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to > > EXITING-SUBSTATE55 (5-55) > > /var/spool/torque/server_logs/20121126:11/26/2012 > > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from > > smp, state EXITING > > /var/spool/torque/server_logs/20121126:11/26/2012 > > > 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, > stray > > job 686938[4].cheops10 found on cheops11801 > > [snap] > > > > pbs_mom log at loglevel 7, shortly before the kill: > > [snip] > > 11/26/2012 00:14:26;0008; pbs_mom.26562;Req;send_sisters;sending > > command POLL_JOB for job 686938[4].cheops10 (7) > > 11/26/2012 00:15:02;0002; > > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending > > to server "jobs=686659[1].cheops10 686694[8].cheops10 > > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10 > > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10 > > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10 > > 686941[212].cheops10 687103[1].cheops10 687224.cheops10 > > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10 > > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10 > > 687367[1].cheops10 687367[2].cheops10 687366.cheops10" > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:11;0008; pbs_mom.26562;Req;send_sisters;sending > > command POLL_JOB for job 686938[4].cheops10 (7) > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:56;0008; pbs_mom.26562;Req;send_sisters;sending > > command POLL_JOB for job 686938[4].cheops10 (7) > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0008; > > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in > > non-exiting substate RUNNING, no obit sent at this time > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; > > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 (child > > pid=4214) > > 11/26/2012 00:16:02;0002; > > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending > > to server "jobs=686659[1].cheops10 686694[8].cheops10 > > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10 > > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10 > > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10 > > 687103[1].cheops10 687224.cheops10 687252[1].cheops10 687279[1].cheops10 > > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10 > > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 687366.cheops10" > > 11/26/2012 00:16:02;0008; > > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal SIGKILL > > 11/26/2012 00:16:02;0008; pbs_mom.26562;Job;kill_job;req_signaljob: > > sending signal 9, "KILL" to job 686938[4].cheops10, reason: killing job > > [snap] > > > > As we can see, the pbs_mom sends complete job lists to the pbs_server > > right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting > > (00:15:21), where the job 686938[4].cheops10 is included. However, the > > server claims that the job is unknown/doesn't exist and sends a kill > > command which the pbs_mom then executes (0:16:02). > > > > I hope this helps in tracking down the bug. > > > > Regards, > > Lech Nieroda > > > > > > On 22.11.2012 16:46, Lech Nieroda wrote: > >> Dear list, > >> > >> we have another serious problem since our upgrade to Torque 4.1.3. We > >> are using it with Maui 3.3.1. The problem in a nutshell: some few, > >> random jobs are suddenly "unknown" to the server, it changes their > >> status to EXITING-SUBSTATE55 and requests a silent kill on the compute > >> nodes. The job then dies, the processes are killed on the node, there is > >> no "Exit_status" in the server-log, no entry in maui/stats, no > >> stdout/stderr files. The users are, understandably, not amused. > >> > >> It doesn't seem to be user or application specific. Even a single > >> instance from a job array can get blown away in this way while all other > >> instances end normally. > >> > >> Here are some logs of such a job (681684[35]): > >> > >> maui just assumes a successful completion: > >> [snip] > >> 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0) > >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is 1.000000, > >> 1.000000, 1 > >> 11/21 19:25:55 INFO: active PBS job 681684[35] has been removed from > >> the queue. assuming successful completion > >> 11/21 19:25:55 MJobProcessCompleted(681684[35]) > >> 11/21 19:25:55 INFO: job '681684[35]' completed X: 0.063356 T: > >> 10903 PS: 10903 A: 0.063096 > >> 11/21 19:25:55 MJobSendFB(681684[35]) > >> 11/21 19:25:55 INFO: job usage sent for job '681684[35]' > >> 11/21 19:25:55 MJobRemove(681684[35]) > >> 11/21 19:25:55 MJobDestroy(681684[35]) > >> [snap] > >> > >> pbs_server decides at 19:25:11, after 3 hours runtime, that the job is > >> unknown (grepped by JobID from the server logs): > >> [snip] > >> 11/21/2012 > >> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate: > >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to > >> RUNNING-RUNNING (4-42) > >> 11/21/2012 > >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: > >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to > >> QUEUED-SUBSTATE55 (1-55) > >> 11/21/2012 > >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: > >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to > >> EXITING-SUBSTATE55 (5-55) > >> 11/21/2012 > >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from > >> smp, state EXITING > >> 11/21/2012 > >> > 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, > stray > >> job 681684[35].cheops10 found on cheops21316 > >> [snap] > >> > >> pbs_client just kills the processes: > >> [snip] > >> 11/21/2012 16:23:43;0001; pbs_mom.32254;Job;TMomFinalizeJob3;job > >> 681684[35].cheops10 started, pid = 17452 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17452/state=R) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17452/state=Z) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17692/state=R) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17703/state=R) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17731/state=R) after sig 15 > >> 11/21/2012 19:25:15;0080; > >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job > >> 681684[35].cheops10 task 1 terminated, sid=17452 > >> 11/21/2012 19:25:15;0008; pbs_mom.32254;Job;681684[35].cheops10;job > >> was terminated > >> 11/21/2012 19:25:50;0001; > >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server, > >> deleting locally > >> 11/21/2012 19:25:50;0080; > >> pbs_mom.32254;Job;681684[35].cheops10;removed job script > >> [snap] > >> > >> Sometimes, the pbs_mom logs include this message before the killing > starts: > >> [snip] > >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, > >> type=StatusJob, from PBS_Server at cheops10 > >> [snap] > >> > >> And finally, some job informations given to epilogue: > >> [snip] > >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: > >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue > >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: > >> 681684[35].cheops10,hthiele0,cheops21316,Job Information: > >> userid=hthiele0, > >> > resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00', > >> > resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34', > >> queue=smp, account=ccg-ngs, exitcode=271 > >> [snap] > >> > >> This happens rarely (about 1 in 3000). However, silent deletions of > >> random jobs aren't exactly a trifling matter. > >> I could try to disable the mom_job_sync option, which could perhaps > >> prevent the process killing of unknown jobs, but it would also leave > >> corrupt/pre-execution jobs alive. > >> > >> Can this be fixed? > >> > >> On a side-note, here are some further, minor Bugs I've noticed in the > >> Torque 4.1.3. Version: > >> - the epilogue script is usually invoked twice and sometimes even > >> several times > >> - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2, > >> then the number of "tasks" as seen by qstat is zero > >> - there have been some API changes between Torque 2.x and Torque 4.x, so > >> that two maui calls had to be altered in order to build against Torque > >> 4.x (get_svrport, openrm). > >> > >> > >> Regards, > >> Lech Nieroda > >> > > > > > > > -- > Dipl.-Wirt.-Inf. Lech Nieroda > Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) > Universit?t zu K?ln > Weyertal 121 > Raum 309 (3. Etage) > D-50931 K?ln > Deutschland > > Tel.: +49 (221) 470-89606 > E-Mail: nieroda.lech at uni-koeln.de > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121212/328e74e9/attachment-0001.html From nieroda.lech at uni-koeln.de Thu Dec 13 06:17:51 2012 From: nieroda.lech at uni-koeln.de (Lech Nieroda) Date: Thu, 13 Dec 2012 14:17:51 +0100 Subject: [torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3 In-Reply-To: References: <50AE48F0.1070807@uni-koeln.de> <50B36868.40603@uni-koeln.de> <50C8B793.5090107@uni-koeln.de> Message-ID: <50C9D57F.7010007@uni-koeln.de> Dear David, thanks - let's hope that this error will be fixed. As to the jobs that are getting killed, there doesn't seem to be any discernible pattern, e.g. take a look at the last 5 killed jobs: 724272: killed after 43min runtime, single node program with 8 threads (according to cputime, ca 4% serial execution) 724650[13]: killed after 148min runtime, single node jobarray instance with a single core serial program (40 instances total) 724843[32]: killed after 12min runtime, single node jobarray instance with a single core serial program (42 instances total) 723747: killed after 909min runtime, single node program with 12 threads (according to cputime, ca 14% serial execution) 724044: killed after 667min runtime, mpi program with 6 nodes, 12 cores each (according to cputime, ca 83% serial execution) There are single node jobs with good/average/minimal thread parallelization, classic serial jobs, various mpi jobs, instances of jobarrays. They can share a node or use it exclusively. They can use different parallel filesystems, like Panasas or Lustre. They are submitted by different users. They are killed after varying amounts of elapsed runtime - it can happen after a few minutes, hours or days. There doesn't seem to be a common denominator for these aborted jobs. I've taken a look into the mom_logs on loglevel10, as well as the server_logs on loglevel7. On the server side, when looking at the thread which performs the state-change to "QUEUED-SUBSTATE55", the previous message mentions "locking complete" on the node on which the job runs, e.g. (the job is 699693[33]): [snip] 11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking complete cheops21314 in method next_node 11/30/2012 19:03:24;0008;PBS_Server.28147;Job;svr_setjobstate;svr_setjobstate: setting job 699693[33].cheops10 state from RUNNING-RUNNING to QUEUED-SUBSTATE55 (1-55) [snap] When grepping the server_logs with the executing host (cheops21314/172.18.3.132), there is a "svr_is_request" message which comes from the executing host before the state-change and later on a few "lockings": [snip] 52243:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking complete cheops21314 in method svr_is_request 52244:11/30/2012 19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;message STATUS (4) received from mom on host cheops21314 (172.18.3.132:273) (sock 13) 52245:11/30/2012 19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;IS_STATUS received from cheops21314 52246:11/30/2012 19:03:24;0040;PBS_Server.28116;Req;is_stat_get;received status from node cheops21314 52247:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking complete cheops21314 in method find_nodebyname 52248:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking complete cheops21314 in method find_nodebyname 52276:11/30/2012 19:03:24;0040;PBS_Server.28116;Req;update_node_state;adjusting state for node cheops21314 - state=0, newstate=0 52277:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking complete cheops21314 in method find_nodebyname 52310:11/30/2012 19:03:24;0080;PBS_Server.28137;node;lock_node;locking complete cheops21314 in method find_nodebyname 52635:11/30/2012 19:03:24;0080;PBS_Server.28096;node;lock_node;locking complete cheops21314 in method next_node 54181:11/30/2012 19:03:24;0080;PBS_Server.28113;node;lock_node;locking complete cheops21314 in method next_node 54596:11/30/2012 19:03:24;0080;PBS_Server.28057;node;lock_node;locking complete cheops21314 in method next_node 54967:11/30/2012 19:03:24;0080;PBS_Server.28031;node;lock_node;locking complete cheops21314 in method next_node 55401:11/30/2012 19:03:24;0080;PBS_Server.28007;node;lock_node;locking complete cheops21314 in method next_node 56029:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking complete cheops21314 in method next_node 56058:11/30/2012 19:03:24;0080;PBS_Server.28035;node;lock_node;locking complete cheops21314 in method next_node 56570:11/30/2012 19:03:24;0080;PBS_Server.28141;node;lock_node;locking complete cheops21314 in method next_node 56760:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking complete cheops21314 in method find_nodebyname 56761:11/30/2012 19:03:24;0040;PBS_Server.28147;Req;remove_job_from_node;freeing node cheops21314/3 from job 699693[33].cheops10 (nsnfree=1) [snap] There is no obvious reference to a Job polling request - what should I be looking for? The client of this particular job has been redeployed and the logs are no longer available. However, here is a loglevel10 output of job 724272, grepped with the jobid: [snip] 12/13/2012 00:00:27;0008; pbs_mom.27123;Req;send_sisters;sending command POLL_JOB for job 724272.cheops10 (7) 12/13/2012 00:00:59;0002; pbs_mom.27123;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "jobs=717556.cheops10 724272.cheops10 724285.cheops10" 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;724272.cheops10;signaling job with signal SIGKILL 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;kill_job;req_signaljob: sending signal 9, "KILL" to job 724272.cheops10, reason: killing job [snap] There is a POLL_JOB request named "send_sisters" (strange, since it is a single node job) and the pbs_mom sends a mom_server_update_stat with a joblist including the job 724272. Right afterwards comes the SignalJob to Kill, here the entire logsnippet between the mom_server_update_stat and the kill: [snip] 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;read_tcp_reply;protocol: 4 version: 3 command:4 sock:8 12/13/2012 00:00:59;0002; pbs_mom.27123;n/a;mom_server_update_stat;status update successfully sent to cheops10 12/13/2012 00:00:59;0080; pbs_mom.27123;Req;dis_request_read;decoding command SignalJob from PBS_Server 12/13/2012 00:00:59;0100; pbs_mom.27123;Req;;Type SignalJob request received from PBS_Server at cheops10, sock=8 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;mom_process_request;request type SignalJob from host cheops10 received 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;mom_process_request;request type SignalJob from host cheops10 allowed 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;mom_dispatch_request;dispatching request SignalJob on sd=8 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;724272.cheops10;signaling job with signal SIGKILL 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;kill_job;req_signaljob: sending signal 9, "KILL" to job 724272.cheops10, reason: killing job [snap] What other info do you need? Are there specific messages to look for? Regards, Lech On 12.12.2012 18:16, David Beer wrote: > Lech, > > From your log snippet, it appears that the server sent a poll request > to the mom (for the job) and the mom replied back that it didn't know > about that job, getting the job cancelled on the server's side. I'm > going to do a little more investigation to see if there's some way that > the poll request to the mom can time out or otherwise fail to > erroneously create this condition. If you can provide any more details > about creating this scenario, that'd be great. Helpful information would > be describing the lengths of jobs that are running, if they are parallel > or serial, or estimate the ratio of parallel to serial, things like that > might help us figure it out. > > David > > On Wed, Dec 12, 2012 at 9:57 AM, Lech Nieroda > wrote: > > Dear list, > > the problem continues to persist: ca. 1 every 1000-2000 jobs is killed. > We've noticed that it almost disappears when the pbs_server loglevel is > raised to 7 (ca. one occurrence of the error per week). Considering that > higher loglevels simply generate more output and may thus prolong the > execution of some tasks for a minuscule amount of time, this may turn > out to be a race condition issue. > We can't permanently set the loglevel to 7 though, because of two > problems: the daily logfile size is ca. 30GB and pbs_server crashes > every 2-3 days. > > The error: initially, the server sends an abort for no apparent reason, > however it mails the user "Job does not exist on node". Even though the > pbs_mom sends the correct joblist to the pbs_server, it receives a > SIGKILL signal for the job (here with the jobid 721711, no dependencies, > etc) without an error message of its own: > > [pbs_server snip] > 12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting > connect to host 172.18.1.220 port 15002 > 12/12/2012 > 02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate: > setting job 721711.cheops10 state from RUNNING-RUNNING to > QUEUED-SUBSTATE55 (1-55) > [...] > 12/12/2012 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing > to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job does > not exist on node) > [pbs_server snap] > > [pbs_mom snip] > 12/12/2012 02:49:50;0002; > pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10 > 722458.cheops10" > 12/12/2012 02:49:50;0002; > pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "varattr= " > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;read_tcp_reply;protocol: 4 > version: 3 command:4 sock:8 > 12/12/2012 02:49:50;0002; > pbs_mom.31686;n/a;mom_server_update_stat;status update successfully sent > to cheops10 > 12/12/2012 02:49:50;0080; pbs_mom.31686;Req;dis_request_read;decoding > command SignalJob from PBS_Server > 12/12/2012 02:49:50;0100; pbs_mom.31686;Req;;Type SignalJob request > received from PBS_Server at cheops10, sock=8 > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;mom_process_request;request type SignalJob from host > cheops10 received > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;mom_process_request;request type SignalJob from host > cheops10 allowed > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob > on sd=8 > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;721711.cheops10;signaling > job with signal SIGKILL > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;kill_job;req_signaljob: > sending signal 9, "KILL" to job 721711.cheops10, reason: killing job > [...] > 12/12/2012 02:50:35;0080; pbs_mom.31686;Svr;preobit_reply;top of > preobit_reply > 12/12/2012 02:50:35;0080; > pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr > worked, top of while loop > 12/12/2012 02:50:35;0001; > pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server, > deleting locally > [pbs_mom snap] > > Ideas? Fixes? > > Regards, > Lech Nieroda > > > > In case it helps to analyze the bug > > On 26.11.2012 14:02, Lech Nieroda wrote: > > Dear list, > > > > I've raised the loglevel on one of the clients to loglevel=7 in > order to > > collect more information on the event. > > > > maui just assumes a successful completion of the job. > > > > pbs_server log regarding the job 686938[4] that suddenly becomes > unknown > > at 00:15:21: > > [snip] > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at > request > > of maui at localhost.localdomain > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from QUEUED-QUEUED to > > RUNNING-PRERUN (4-40) > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported > > success for job after 1 seconds (dest=???), rc=0 > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to > > RUNNING-RUNNING (4-42) > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending > email: > > User does not want mail of this type. > > /var/spool/torque/server_logs/20121125:11/25/2012 > > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from RUNNING-RUNNING to > > QUEUED-SUBSTATE55 (1-55) > > /var/spool/torque/server_logs/20121126:11/26/2012 > > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to > > EXITING-SUBSTATE55 (5-55) > > /var/spool/torque/server_logs/20121126:11/26/2012 > > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from > > smp, state EXITING > > /var/spool/torque/server_logs/20121126:11/26/2012 > > > 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, > stray > > job 686938[4].cheops10 found on cheops11801 > > [snap] > > > > pbs_mom log at loglevel 7, shortly before the kill: > > [snip] > > 11/26/2012 00:14:26;0008; pbs_mom.26562;Req;send_sisters;sending > > command POLL_JOB for job 686938[4].cheops10 (7) > > 11/26/2012 00:15:02;0002; > > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: > sending > > to server "jobs=686659[1].cheops10 686694[8].cheops10 > > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10 > > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10 > > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10 > > 686941[212].cheops10 687103[1].cheops10 687224.cheops10 > > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10 > > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10 > > 687367[1].cheops10 687367[2].cheops10 687366.cheops10" > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;cput_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;resi_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:11;0008; pbs_mom.26562;Req;send_sisters;sending > > command POLL_JOB for job 686938[4].cheops10 (7) > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;cput_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;resi_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:15:56;0008; pbs_mom.26562;Req;send_sisters;sending > > command POLL_JOB for job 686938[4].cheops10 (7) > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0008; > > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in > > non-exiting substate RUNNING, no obit sent at this time > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array > loop > > start - jobid = 686938[4].cheops10 > > 11/26/2012 00:16:01;0080; > > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 > (child > > pid=4214) > > 11/26/2012 00:16:02;0002; > > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: > sending > > to server "jobs=686659[1].cheops10 686694[8].cheops10 > > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10 > > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10 > > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10 > > 687103[1].cheops10 687224.cheops10 687252[1].cheops10 > 687279[1].cheops10 > > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10 > > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 > 687366.cheops10" > > 11/26/2012 00:16:02;0008; > > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal > SIGKILL > > 11/26/2012 00:16:02;0008; pbs_mom.26562;Job;kill_job;req_signaljob: > > sending signal 9, "KILL" to job 686938[4].cheops10, reason: > killing job > > [snap] > > > > As we can see, the pbs_mom sends complete job lists to the pbs_server > > right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting > > (00:15:21), where the job 686938[4].cheops10 is included. > However, the > > server claims that the job is unknown/doesn't exist and sends a kill > > command which the pbs_mom then executes (0:16:02). > > > > I hope this helps in tracking down the bug. > > > > Regards, > > Lech Nieroda > > > > > > On 22.11.2012 16:46, Lech Nieroda wrote: > >> Dear list, > >> > >> we have another serious problem since our upgrade to Torque > 4.1.3. We > >> are using it with Maui 3.3.1. The problem in a nutshell: some few, > >> random jobs are suddenly "unknown" to the server, it changes their > >> status to EXITING-SUBSTATE55 and requests a silent kill on the > compute > >> nodes. The job then dies, the processes are killed on the node, > there is > >> no "Exit_status" in the server-log, no entry in maui/stats, no > >> stdout/stderr files. The users are, understandably, not amused. > >> > >> It doesn't seem to be user or application specific. Even a single > >> instance from a job array can get blown away in this way while > all other > >> instances end normally. > >> > >> Here are some logs of such a job (681684[35]): > >> > >> maui just assumes a successful completion: > >> [snip] > >> 11/21 19:24:49 > MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0) > >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is > 1.000000, > >> 1.000000, 1 > >> 11/21 19:25:55 INFO: active PBS job 681684[35] has been > removed from > >> the queue. assuming successful completion > >> 11/21 19:25:55 MJobProcessCompleted(681684[35]) > >> 11/21 19:25:55 INFO: job '681684[35]' completed X: 0.063356 T: > >> 10903 PS: 10903 A: 0.063096 > >> 11/21 19:25:55 MJobSendFB(681684[35]) > >> 11/21 19:25:55 INFO: job usage sent for job '681684[35]' > >> 11/21 19:25:55 MJobRemove(681684[35]) > >> 11/21 19:25:55 MJobDestroy(681684[35]) > >> [snap] > >> > >> pbs_server decides at 19:25:11, after 3 hours runtime, that the > job is > >> unknown (grepped by JobID from the server logs): > >> [snip] > >> 11/21/2012 > >> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate: > >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to > >> RUNNING-RUNNING (4-42) > >> 11/21/2012 > >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: > >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to > >> QUEUED-SUBSTATE55 (1-55) > >> 11/21/2012 > >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: > >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to > >> EXITING-SUBSTATE55 (5-55) > >> 11/21/2012 > >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing > from > >> smp, state EXITING > >> 11/21/2012 > >> > 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, > stray > >> job 681684[35].cheops10 found on cheops21316 > >> [snap] > >> > >> pbs_client just kills the processes: > >> [snip] > >> 11/21/2012 16:23:43;0001; pbs_mom.32254;Job;TMomFinalizeJob3;job > >> 681684[35].cheops10 started, pid = 17452 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > 17452 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17452/state=R) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17452/state=Z) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > 17692 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17692/state=R) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > 17703 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17703/state=R) after sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > 17731 task > >> 1 gracefully with sig 15 > >> 11/21/2012 19:25:14;0008; > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > >> (pid=17731/state=R) after sig 15 > >> 11/21/2012 19:25:15;0080; > >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job > >> 681684[35].cheops10 task 1 terminated, sid=17452 > >> 11/21/2012 19:25:15;0008; > pbs_mom.32254;Job;681684[35].cheops10;job > >> was terminated > >> 11/21/2012 19:25:50;0001; > >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on > server, > >> deleting locally > >> 11/21/2012 19:25:50;0080; > >> pbs_mom.32254;Job;681684[35].cheops10;removed job script > >> [snap] > >> > >> Sometimes, the pbs_mom logs include this message before the > killing starts: > >> [snip] > >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, > >> type=StatusJob, from PBS_Server at cheops10 > >> [snap] > >> > >> And finally, some job informations given to epilogue: > >> [snip] > >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: > >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue > >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: > >> 681684[35].cheops10,hthiele0,cheops21316,Job Information: > >> userid=hthiele0, > >> > resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00', > >> > resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34', > >> queue=smp, account=ccg-ngs, exitcode=271 > >> [snap] > >> > >> This happens rarely (about 1 in 3000). However, silent deletions of > >> random jobs aren't exactly a trifling matter. > >> I could try to disable the mom_job_sync option, which could perhaps > >> prevent the process killing of unknown jobs, but it would also leave > >> corrupt/pre-execution jobs alive. > >> > >> Can this be fixed? > >> > >> On a side-note, here are some further, minor Bugs I've noticed > in the > >> Torque 4.1.3. Version: > >> - the epilogue script is usually invoked twice and sometimes even > >> several times > >> - when explicit node lists are used, e.g. > nodes=node1:ppn=2+node2:ppn=2, > >> then the number of "tasks" as seen by qstat is zero > >> - there have been some API changes between Torque 2.x and Torque > 4.x, so > >> that two maui calls had to be altered in order to build against > Torque > >> 4.x (get_svrport, openrm). > >> > >> > >> Regards, > >> Lech Nieroda > >> > > > > > > > -- > Dipl.-Wirt.-Inf. Lech Nieroda > Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) > Universit?t zu K?ln > Weyertal 121 > Raum 309 (3. Etage) > D-50931 K?ln > Deutschland > > Tel.: +49 (221) 470-89606 > E-Mail: nieroda.lech at uni-koeln.de > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Senior Software Engineer > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dipl.-Wirt.-Inf. Lech Nieroda Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) Universit?t zu K?ln Weyertal 121 Raum 309 (3. Etage) D-50931 K?ln Deutschland Tel.: +49 (221) 470-89606 E-Mail: nieroda.lech at uni-koeln.de From raphael.leplae at ulb.ac.be Thu Dec 13 02:22:54 2012 From: raphael.leplae at ulb.ac.be (Raphael Leplae) Date: Thu, 13 Dec 2012 10:22:54 +0100 Subject: [torqueusers] Problems after upgrading Torque Message-ID: <50C99E6E.2060106@ulb.ac.be> Dear all, We have upgraded Torque from 2.5.4 to 4.2.0. The scheduler is Moab 6.0. Among various problems we have encountered with the upgrade, some are persisting and we can't find the cause and therefore a possible solution. Among them: 1) Error messages bursts in the logs In the Torque log files, we get regular bursts of the following message: 12/13/2012 09:30:07;0080;PBS_Server.18604;Req;req_reject;Reject reply code=15021(Invalid credential), aux=0, type=AuthenticateUser, from root@ being the host name where pbs_server and moab are running. The message is repeated ~140 times and the burst is repeated every 30 min. It is also systematically present in the new log file started after log rotation (midnight). There are occasional similar error messages but the 'from' refers to users on compute/login nodes instead of root on the host running pbs_server. Note: with the upgrade, we found in the documentation that it was necessary to run trqauthd. It is running on all the nodes: the control node with pbs_server, the compute nodes (jobs can submit jobs) and the login nodes. However, it regularly crashes on the compute nodes (looks like a random behaviour so far). Is there a way to get a log out of trqauthd? 2) Odd job reporting from the compute nodes Since the upgrade, we observe the following odd 'qstat -n -1 -r' output (a sampling of the output): 845605.xxx.yyy smp1 my_method_with_c -- 1 1 2gb 3000:00: R -- nic95/0 853677.xxx.yyy mpi I0z028a074 -- 5 20 4gb 48:00:00 R -- nic62/7+nic62/6+nic62/5+nic62/4+nic53/3+nic53/2+nic53/1+nic53/0+nic55/15+nic55/14+nic55/13+nic55/4+nic56/15+nic56/14+nic56/13+nic56/4+nic50/3+nic50/2+nic50/1+nic50/0 853809.xxx.yyy mpi I0z036a094 20579 5 20 4gb 48:00:00 R 00:27:41 nic62/3+nic62/2+nic62/1+nic62/0+nic55/3+nic55/2+nic55/1+nic55/0+nic56/3+nic56/2+nic56/1+nic56/0+nic59/15+nic59/14+nic59/13+nic59/12+nic54/15+nic54/14+nic54/13+nic54/4 Note: hostname attached to the job ID and username are masked. Essentially, we get the following combinations of information for the running jobs (checked and properly running on the nodes): - Expected output: SessID, resources asked/used, Requested walltime and Elasped time. - Only SessID is missing, a dash is given. - SessID and Elapsed time are missing, a dash is given. - SessID, Elapsed time and requested/used resources are missing, a dash is given. There is no correlation with the user or the node. We can even have two jobs of the same user, on the same compute node with two different versions of the reported odd information described in the list above. Solving these first 2 (may be related issues) would be a start. Any suggestion/help is welcomed. Cheers From brockp at umich.edu Thu Dec 13 14:04:09 2012 From: brockp at umich.edu (Brock Palen) Date: Thu, 13 Dec 2012 16:04:09 -0500 Subject: [torqueusers] tm aware ssh Message-ID: I have a few applications that spawn using ssh and don't support tm, There was once on the list a 'pbsssh' that wrapped pbsdsh to act like ssh, It looks like that script no longer works and I am scratching my head as to getting it working again (my bash fu is weak). Does anyone already have a way they do this? The hope is to get correct process reporting, and cleanup for applications that don't support TM but spawn with ssh/rsh. Thanks! Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 From bdandrus at nps.edu Thu Dec 13 14:29:41 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 13 Dec 2012 21:29:41 +0000 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <20121210213613.GN8827@lbl.gov> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> Message-ID: It seems fine: [root at build SPECS]# pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." Yep. Yet: ./configure --enable-cpuset Dies with the same message: checking for HWLOC... configure: error: cpuset support requires the hwloc package This can be solved by configuring with --with-hwloc-path=. This path should be the path to the directory containing the lib/ and include/ directories for your version of hwloc. Another option is adding the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. If you have done these and still get this error, try running ./autogen.sh and then configuring again. Same if I do: ./configure --enable-cpuset --with-hwloc-path=/usr/ Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Michael Jennings > Sent: Monday, December 10, 2012 1:36 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] hwloc error torque 4.1.3 > > On Monday, 10 December 2012, at 19:37:18 (+0000), Andrus, Brian Contractor > wrote: > > > Same thing in config.log: > > ------------------- > > configure:22377: checking for HWLOC > > configure:22476: error: cpuset support requires the hwloc package > > It's a pretty straight-forward test; either pkgconfig finds hwloc >= 1.1, or it > doesn't. > > For me, installing hwloc-devel.x86_64 worked just fine. I, too, am using a > rebuild of RHEL 6.3 (Scientific Linux). > > What does this command tell you? > > pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." > > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Thu Dec 13 14:50:33 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 13 Dec 2012 14:50:33 -0700 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> Message-ID: Have you made sure you are specifying --with-hwloc-path correctly? /usr/ seems unlikely to be correct. Note the coment that configure adds: Specifies the path to the hwloc libraries and include files. Example: ./configure --with-hwloc-path=/usr/local/hwloc-1.1 Will specify that the include files are in /usr/local/hwloc-1.1/include and the libraries are in /usr/local/hwloc-1.1/lib], So specifying /usr would mean that /usr/include contains the hwloc headers and /usr/lib/ contains the library. David On Thu, Dec 13, 2012 at 2:29 PM, Andrus, Brian Contractor wrote: > It seems fine: > > [root at build SPECS]# pkg-config --atleast-version 1.1 hwloc && echo "Yep." > || echo "Nope." > Yep. > > Yet: > ./configure --enable-cpuset > > Dies with the same message: > > checking for HWLOC... configure: error: cpuset support requires the hwloc > package > > > > This can be solved by configuring with --with-hwloc-path=. This path > should be the path to the directory containing the lib/ and include/ > directories > for your version of hwloc. > Another option is adding the directory containing 'hwloc.pc' > to the PKG_CONFIG_PATH environment variable. > > If you have done these and still get this error, try running ./autogen.sh > and > then configuring again. > > Same if I do: > ./configure --enable-cpuset --with-hwloc-path=/usr/ > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > > bounces at supercluster.org] On Behalf Of Michael Jennings > > Sent: Monday, December 10, 2012 1:36 PM > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] hwloc error torque 4.1.3 > > > > On Monday, 10 December 2012, at 19:37:18 (+0000), Andrus, Brian > Contractor > > wrote: > > > > > Same thing in config.log: > > > ------------------- > > > configure:22377: checking for HWLOC > > > configure:22476: error: cpuset support requires the hwloc package > > > > It's a pretty straight-forward test; either pkgconfig finds hwloc >= > 1.1, or it > > doesn't. > > > > For me, installing hwloc-devel.x86_64 worked just fine. I, too, am > using a > > rebuild of RHEL 6.3 (Scientific Linux). > > > > What does this command tell you? > > > > pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." > > > > > > Michael > > > > -- > > Michael Jennings > > Senior HPC Systems Engineer > > High-Performance Computing Services > > Lawrence Berkeley National Laboratory > > Bldg 50B-3209E W: 510-495-2687 > > MS 050B-3209 F: 510-486-8615 > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121213/78b1beaa/attachment.html From mej at lbl.gov Thu Dec 13 15:13:58 2012 From: mej at lbl.gov (Michael Jennings) Date: Thu, 13 Dec 2012 14:13:58 -0800 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: Message-ID: <20121213221357.GF8827@lbl.gov> On Thursday, 13 December 2012, at 21:29:41 (+0000), Andrus, Brian Contractor wrote: > It seems fine: > > [root at build SPECS]# pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." > Yep. Then pkgconfig appears to know you have it, which means the only thing left I can think of is the macro. Try this: $ make maintainer-clean $ rm -f aclocal.m4 $ ./configure --enable-cpuset See what that does. On Thursday, 13 December 2012, at 14:50:33 (-0700), David Beer wrote: > Have you made sure you are specifying --with-hwloc-path correctly? /usr/ > seems unlikely to be correct. Note the coment that configure adds: > > Specifies the path to the hwloc libraries and include files. > Example: ./configure > --with-hwloc-path=/usr/local/hwloc-1.1 > Will specify that the include files are in > /usr/local/hwloc-1.1/include and > the libraries are in /usr/local/hwloc-1.1/lib], > > So specifying /usr would mean that > > /usr/include contains the hwloc headers and /usr/lib/ contains the library. This is correct for RHEL-based systems with hwloc installed from RPM. It works fine on my SL6 box with --with-hwloc-path=/usr (though it also works without that). Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From bdandrus at nps.edu Thu Dec 13 15:30:43 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 13 Dec 2012 22:30:43 +0000 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> Message-ID: Yep, the default locations for the hwloc libs are /usr/include and /usr/lib and /usr/lib64 Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Thursday, December 13, 2012 1:51 PM To: Torque Users Mailing List Subject: Re: [torqueusers] hwloc error torque 4.1.3 Have you made sure you are specifying --with-hwloc-path correctly? /usr/ seems unlikely to be correct. Note the coment that configure adds: Specifies the path to the hwloc libraries and include files. Example: ./configure --with-hwloc-path=/usr/local/hwloc-1.1 Will specify that the include files are in /usr/local/hwloc-1.1/include and the libraries are in /usr/local/hwloc-1.1/lib], So specifying /usr would mean that /usr/include contains the hwloc headers and /usr/lib/ contains the library. David On Thu, Dec 13, 2012 at 2:29 PM, Andrus, Brian Contractor > wrote: It seems fine: [root at build SPECS]# pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." Yep. Yet: ./configure --enable-cpuset Dies with the same message: checking for HWLOC... configure: error: cpuset support requires the hwloc package This can be solved by configuring with --with-hwloc-path=. This path should be the path to the directory containing the lib/ and include/ directories for your version of hwloc. Another option is adding the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. If you have done these and still get this error, try running ./autogen.sh and then configuring again. Same if I do: ./configure --enable-cpuset --with-hwloc-path=/usr/ Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Michael Jennings > Sent: Monday, December 10, 2012 1:36 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] hwloc error torque 4.1.3 > > On Monday, 10 December 2012, at 19:37:18 (+0000), Andrus, Brian Contractor > wrote: > > > Same thing in config.log: > > ------------------- > > configure:22377: checking for HWLOC > > configure:22476: error: cpuset support requires the hwloc package > > It's a pretty straight-forward test; either pkgconfig finds hwloc >= 1.1, or it > doesn't. > > For me, installing hwloc-devel.x86_64 worked just fine. I, too, am using a > rebuild of RHEL 6.3 (Scientific Linux). > > What does this command tell you? > > pkg-config --atleast-version 1.1 hwloc && echo "Yep." || echo "Nope." > > > Michael > > -- > Michael Jennings > > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121213/3258e3f5/attachment-0001.html From Gareth.Williams at csiro.au Thu Dec 13 16:18:47 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 14 Dec 2012 10:18:47 +1100 Subject: [torqueusers] tm aware ssh In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620123B93B2121@exvic-mbx04.nexus.csiro.au> > I have a few applications that spawn using ssh and don't support tm, > > There was once on the list a 'pbsssh' that wrapped pbsdsh to act like > ssh, > > It looks like that script no longer works and I am scratching my head > as to getting it working again (my bash fu is weak). > > Does anyone already have a way they do this? > > The hope is to get correct process reporting, and cleanup for > applications that don't support TM but spawn with ssh/rsh. > > > Thanks! > > Brock Palen Hi Brock, We have this (below) in place. You might need to swallow more ssh options if they are present - and there is no checking that "node" actually gets set to a cluster node name. Specifying a user would break this... but you would not expect that in a cluster inside a batch job, right? Gareth wil240 at burnet-login:~> cat /apps/ascutils/bin/pbsssh #!/bin/bash # $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ # $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $ usage="usage: $0 " #swallow -x -n and -q (for intel mpi) while getopts "xqn" opt do : done shift $((OPTIND-1)) if [ $# -lt 2 ] then echo $usage exit fi node=$1 shift exec pbsdsh -o -h $node "$@" From dbeer at adaptivecomputing.com Thu Dec 13 17:40:19 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 13 Dec 2012 17:40:19 -0700 Subject: [torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3 In-Reply-To: <50C9D57F.7010007@uni-koeln.de> References: <50AE48F0.1070807@uni-koeln.de> <50B36868.40603@uni-koeln.de> <50C8B793.5090107@uni-koeln.de> <50C9D57F.7010007@uni-koeln.de> Message-ID: Actually Lech, do you think you could just email in the logs for the mother superior and the server for one of these jobs for the day that it gets deleted? If it is too big to email, I'm sending you guest credentials for the scp server separately. David On Thu, Dec 13, 2012 at 6:17 AM, Lech Nieroda wrote: > Dear David, > > thanks - let's hope that this error will be fixed. > > As to the jobs that are getting killed, there doesn't seem to be any > discernible pattern, e.g. take a look at the last 5 killed jobs: > > 724272: killed after 43min runtime, single node program with 8 threads > (according to cputime, ca 4% serial execution) > 724650[13]: killed after 148min runtime, single node jobarray instance > with a single core serial program (40 instances total) > 724843[32]: killed after 12min runtime, single node jobarray instance > with a single core serial program (42 instances total) > 723747: killed after 909min runtime, single node program with 12 threads > (according to cputime, ca 14% serial execution) > 724044: killed after 667min runtime, mpi program with 6 nodes, 12 cores > each (according to cputime, ca 83% serial execution) > > There are single node jobs with good/average/minimal thread > parallelization, classic serial jobs, various mpi jobs, instances of > jobarrays. They can share a node or use it exclusively. They can use > different parallel filesystems, like Panasas or Lustre. They are > submitted by different users. They are killed after varying amounts of > elapsed runtime - it can happen after a few minutes, hours or days. > There doesn't seem to be a common denominator for these aborted jobs. > > I've taken a look into the mom_logs on loglevel10, as well as the > server_logs on loglevel7. > > On the server side, when looking at the thread which performs the > state-change to "QUEUED-SUBSTATE55", the previous message mentions > "locking complete" on the node on which the job runs, e.g. (the job is > 699693[33]): > > [snip] > 11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking > complete cheops21314 in method next_node > 11/30/2012 > 19:03:24;0008;PBS_Server.28147;Job;svr_setjobstate;svr_setjobstate: > setting job 699693[33].cheops10 state from RUNNING-RUNNING to > QUEUED-SUBSTATE55 (1-55) > [snap] > > When grepping the server_logs with the executing host > (cheops21314/172.18.3.132), there is a "svr_is_request" message which > comes from the executing host before the state-change and later on a few > "lockings": > [snip] > 52243:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking > complete cheops21314 in method svr_is_request > 52244:11/30/2012 > 19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;message STATUS (4) > received from mom on host cheops21314 (172.18.3.132:273) (sock 13) > 52245:11/30/2012 > 19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;IS_STATUS received > from cheops21314 > 52246:11/30/2012 19:03:24;0040;PBS_Server.28116;Req;is_stat_get;received > status from node cheops21314 > 52247:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking > complete cheops21314 in method find_nodebyname > 52248:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking > complete cheops21314 in method find_nodebyname > 52276:11/30/2012 > 19:03:24;0040;PBS_Server.28116;Req;update_node_state;adjusting state for > node cheops21314 - state=0, newstate=0 > 52277:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking > complete cheops21314 in method find_nodebyname > 52310:11/30/2012 19:03:24;0080;PBS_Server.28137;node;lock_node;locking > complete cheops21314 in method find_nodebyname > 52635:11/30/2012 19:03:24;0080;PBS_Server.28096;node;lock_node;locking > complete cheops21314 in method next_node > 54181:11/30/2012 19:03:24;0080;PBS_Server.28113;node;lock_node;locking > complete cheops21314 in method next_node > 54596:11/30/2012 19:03:24;0080;PBS_Server.28057;node;lock_node;locking > complete cheops21314 in method next_node > 54967:11/30/2012 19:03:24;0080;PBS_Server.28031;node;lock_node;locking > complete cheops21314 in method next_node > 55401:11/30/2012 19:03:24;0080;PBS_Server.28007;node;lock_node;locking > complete cheops21314 in method next_node > 56029:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking > complete cheops21314 in method next_node > 56058:11/30/2012 19:03:24;0080;PBS_Server.28035;node;lock_node;locking > complete cheops21314 in method next_node > 56570:11/30/2012 19:03:24;0080;PBS_Server.28141;node;lock_node;locking > complete cheops21314 in method next_node > 56760:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking > complete cheops21314 in method find_nodebyname > 56761:11/30/2012 > 19:03:24;0040;PBS_Server.28147;Req;remove_job_from_node;freeing node > cheops21314/3 from job 699693[33].cheops10 (nsnfree=1) > [snap] > > There is no obvious reference to a Job polling request - what should I > be looking for? > > The client of this particular job has been redeployed and the logs are > no longer available. However, here is a loglevel10 output of job 724272, > grepped with the jobid: > > [snip] > 12/13/2012 00:00:27;0008; pbs_mom.27123;Req;send_sisters;sending > command POLL_JOB for job 724272.cheops10 (7) > 12/13/2012 00:00:59;0002; > pbs_mom.27123;n/a;mom_server_update_stat;mom_server_update_stat: sending > to server "jobs=717556.cheops10 724272.cheops10 724285.cheops10" > 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;724272.cheops10;signaling > job with signal SIGKILL > 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;kill_job;req_signaljob: > sending signal 9, "KILL" to job 724272.cheops10, reason: killing job > [snap] > > There is a POLL_JOB request named "send_sisters" (strange, since it is a > single node job) and the pbs_mom sends a mom_server_update_stat with a > joblist including the job 724272. Right afterwards comes the SignalJob > to Kill, here the entire logsnippet between the mom_server_update_stat > and the kill: > > [snip] > 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;read_tcp_reply;protocol: 4 > version: 3 command:4 sock:8 > 12/13/2012 00:00:59;0002; > pbs_mom.27123;n/a;mom_server_update_stat;status update successfully sent > to cheops10 > 12/13/2012 00:00:59;0080; pbs_mom.27123;Req;dis_request_read;decoding > command SignalJob from PBS_Server > 12/13/2012 00:00:59;0100; pbs_mom.27123;Req;;Type SignalJob request > received from PBS_Server at cheops10, sock=8 > 12/13/2012 00:00:59;0008; > pbs_mom.27123;Job;mom_process_request;request type SignalJob from host > cheops10 received > 12/13/2012 00:00:59;0008; > pbs_mom.27123;Job;mom_process_request;request type SignalJob from host > cheops10 allowed > 12/13/2012 00:00:59;0008; > pbs_mom.27123;Job;mom_dispatch_request;dispatching request SignalJob on > sd=8 > 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;724272.cheops10;signaling > job with signal SIGKILL > 12/13/2012 00:00:59;0008; pbs_mom.27123;Job;kill_job;req_signaljob: > sending signal 9, "KILL" to job 724272.cheops10, reason: killing job > [snap] > > What other info do you need? Are there specific messages to look for? > > Regards, > Lech > > > > On 12.12.2012 18:16, David Beer wrote: > > Lech, > > > > From your log snippet, it appears that the server sent a poll request > > to the mom (for the job) and the mom replied back that it didn't know > > about that job, getting the job cancelled on the server's side. I'm > > going to do a little more investigation to see if there's some way that > > the poll request to the mom can time out or otherwise fail to > > erroneously create this condition. If you can provide any more details > > about creating this scenario, that'd be great. Helpful information would > > be describing the lengths of jobs that are running, if they are parallel > > or serial, or estimate the ratio of parallel to serial, things like that > > might help us figure it out. > > > > David > > > > On Wed, Dec 12, 2012 at 9:57 AM, Lech Nieroda > > wrote: > > > > Dear list, > > > > the problem continues to persist: ca. 1 every 1000-2000 jobs is > killed. > > We've noticed that it almost disappears when the pbs_server loglevel > is > > raised to 7 (ca. one occurrence of the error per week). Considering > that > > higher loglevels simply generate more output and may thus prolong the > > execution of some tasks for a minuscule amount of time, this may turn > > out to be a race condition issue. > > We can't permanently set the loglevel to 7 though, because of two > > problems: the daily logfile size is ca. 30GB and pbs_server crashes > > every 2-3 days. > > > > The error: initially, the server sends an abort for no apparent > reason, > > however it mails the user "Job does not exist on node". Even though > the > > pbs_mom sends the correct joblist to the pbs_server, it receives a > > SIGKILL signal for the job (here with the jobid 721711, no > dependencies, > > etc) without an error message of its own: > > > > [pbs_server snip] > > 12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting > > connect to host 172.18.1.220 port 15002 > > 12/12/2012 > > 02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate: > > setting job 721711.cheops10 state from RUNNING-RUNNING to > > QUEUED-SUBSTATE55 (1-55) > > [...] > > 12/12/2012 > 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing > > to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job > does > > not exist on node) > > [pbs_server snap] > > > > [pbs_mom snip] > > 12/12/2012 02:49:50;0002; > > pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: > sending > > to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10 > > 722458.cheops10" > > 12/12/2012 02:49:50;0002; > > pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: > sending > > to server "varattr= " > > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;read_tcp_reply;protocol: 4 > > version: 3 command:4 sock:8 > > 12/12/2012 02:49:50;0002; > > pbs_mom.31686;n/a;mom_server_update_stat;status update successfully > sent > > to cheops10 > > 12/12/2012 02:49:50;0080; > pbs_mom.31686;Req;dis_request_read;decoding > > command SignalJob from PBS_Server > > 12/12/2012 02:49:50;0100; pbs_mom.31686;Req;;Type SignalJob request > > received from PBS_Server at cheops10, sock=8 > > 12/12/2012 02:49:50;0008; > > pbs_mom.31686;Job;mom_process_request;request type SignalJob from > host > > cheops10 received > > 12/12/2012 02:49:50;0008; > > pbs_mom.31686;Job;mom_process_request;request type SignalJob from > host > > cheops10 allowed > > 12/12/2012 02:49:50;0008; > > pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob > > on sd=8 > > 12/12/2012 02:49:50;0008; > pbs_mom.31686;Job;721711.cheops10;signaling > > job with signal SIGKILL > > 12/12/2012 02:49:50;0008; pbs_mom.31686;Job;kill_job;req_signaljob: > > sending signal 9, "KILL" to job 721711.cheops10, reason: killing job > > [...] > > 12/12/2012 02:50:35;0080; pbs_mom.31686;Svr;preobit_reply;top of > > preobit_reply > > 12/12/2012 02:50:35;0080; > > pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr > > worked, top of while loop > > 12/12/2012 02:50:35;0001; > > pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server, > > deleting locally > > [pbs_mom snap] > > > > Ideas? Fixes? > > > > Regards, > > Lech Nieroda > > > > > > > > In case it helps to analyze the bug > > > > On 26.11.2012 14:02, Lech Nieroda wrote: > > > Dear list, > > > > > > I've raised the loglevel on one of the clients to loglevel=7 in > > order to > > > collect more information on the event. > > > > > > maui just assumes a successful completion of the job. > > > > > > pbs_server log regarding the job 686938[4] that suddenly becomes > > unknown > > > at 00:15:21: > > > [snip] > > > /var/spool/torque/server_logs/20121125:11/25/2012 > > > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at > > request > > > of maui at localhost.localdomain > > > /var/spool/torque/server_logs/20121125:11/25/2012 > > > > 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > > > setting job 686938[4].cheops10 state from QUEUED-QUEUED to > > > RUNNING-PRERUN (4-40) > > > /var/spool/torque/server_logs/20121125:11/25/2012 > > > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child > reported > > > success for job after 1 seconds (dest=???), rc=0 > > > /var/spool/torque/server_logs/20121125:11/25/2012 > > > > 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: > > > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to > > > RUNNING-RUNNING (4-42) > > > /var/spool/torque/server_logs/20121125:11/25/2012 > > > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending > > email: > > > User does not want mail of this type. > > > /var/spool/torque/server_logs/20121125:11/25/2012 > > > > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > > > setting job 686938[4].cheops10 state from RUNNING-RUNNING to > > > QUEUED-SUBSTATE55 (1-55) > > > /var/spool/torque/server_logs/20121126:11/26/2012 > > > > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: > > > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to > > > EXITING-SUBSTATE55 (5-55) > > > /var/spool/torque/server_logs/20121126:11/26/2012 > > > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing > from > > > smp, state EXITING > > > /var/spool/torque/server_logs/20121126:11/26/2012 > > > > > > 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, > > stray > > > job 686938[4].cheops10 found on cheops11801 > > > [snap] > > > > > > pbs_mom log at loglevel 7, shortly before the kill: > > > [snip] > > > 11/26/2012 00:14:26;0008; pbs_mom.26562;Req;send_sisters;sending > > > command POLL_JOB for job 686938[4].cheops10 (7) > > > 11/26/2012 00:15:02;0002; > > > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: > > sending > > > to server "jobs=686659[1].cheops10 686694[8].cheops10 > > > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > > > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > > > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10 > > > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10 > > > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10 > > > 686941[212].cheops10 687103[1].cheops10 687224.cheops10 > > > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10 > > > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10 > > > 687367[1].cheops10 687367[2].cheops10 687366.cheops10" > > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;cput_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;mem_sum;proc_array > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:15:11;0080; pbs_mom.26562;n/a;resi_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:15:11;0008; pbs_mom.26562;Req;send_sisters;sending > > > command POLL_JOB for job 686938[4].cheops10 (7) > > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;cput_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;mem_sum;proc_array > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:15:56;0080; pbs_mom.26562;n/a;resi_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:15:56;0008; pbs_mom.26562;Req;send_sisters;sending > > > command POLL_JOB for job 686938[4].cheops10 (7) > > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:00;0008; > > > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in > > > non-exiting substate RUNNING, no obit sent at this time > > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;cput_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;mem_sum;proc_array > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:00;0080; pbs_mom.26562;n/a;resi_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;cput_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;mem_sum;proc_array > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; pbs_mom.26562;n/a;resi_sum;proc_array > > loop > > > start - jobid = 686938[4].cheops10 > > > 11/26/2012 00:16:01;0080; > > > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 > > (child > > > pid=4214) > > > 11/26/2012 00:16:02;0002; > > > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: > > sending > > > to server "jobs=686659[1].cheops10 686694[8].cheops10 > > > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 > > > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 > > > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10 > > > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10 > > > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10 > > > 687103[1].cheops10 687224.cheops10 687252[1].cheops10 > > 687279[1].cheops10 > > > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10 > > > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 > > 687366.cheops10" > > > 11/26/2012 00:16:02;0008; > > > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal > > SIGKILL > > > 11/26/2012 00:16:02;0008; > pbs_mom.26562;Job;kill_job;req_signaljob: > > > sending signal 9, "KILL" to job 686938[4].cheops10, reason: > > killing job > > > [snap] > > > > > > As we can see, the pbs_mom sends complete job lists to the > pbs_server > > > right before (00:15:02) and after (00:16:02) the SUBSTATE55 > setting > > > (00:15:21), where the job 686938[4].cheops10 is included. > > However, the > > > server claims that the job is unknown/doesn't exist and sends a > kill > > > command which the pbs_mom then executes (0:16:02). > > > > > > I hope this helps in tracking down the bug. > > > > > > Regards, > > > Lech Nieroda > > > > > > > > > On 22.11.2012 16:46, Lech Nieroda wrote: > > >> Dear list, > > >> > > >> we have another serious problem since our upgrade to Torque > > 4.1.3. We > > >> are using it with Maui 3.3.1. The problem in a nutshell: some > few, > > >> random jobs are suddenly "unknown" to the server, it changes > their > > >> status to EXITING-SUBSTATE55 and requests a silent kill on the > > compute > > >> nodes. The job then dies, the processes are killed on the node, > > there is > > >> no "Exit_status" in the server-log, no entry in maui/stats, no > > >> stdout/stderr files. The users are, understandably, not amused. > > >> > > >> It doesn't seem to be user or application specific. Even a single > > >> instance from a job array can get blown away in this way while > > all other > > >> instances end normally. > > >> > > >> Here are some logs of such a job (681684[35]): > > >> > > >> maui just assumes a successful completion: > > >> [snip] > > >> 11/21 19:24:49 > > MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0) > > >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is > > 1.000000, > > >> 1.000000, 1 > > >> 11/21 19:25:55 INFO: active PBS job 681684[35] has been > > removed from > > >> the queue. assuming successful completion > > >> 11/21 19:25:55 MJobProcessCompleted(681684[35]) > > >> 11/21 19:25:55 INFO: job '681684[35]' completed X: 0.063356 > T: > > >> 10903 PS: 10903 A: 0.063096 > > >> 11/21 19:25:55 MJobSendFB(681684[35]) > > >> 11/21 19:25:55 INFO: job usage sent for job '681684[35]' > > >> 11/21 19:25:55 MJobRemove(681684[35]) > > >> 11/21 19:25:55 MJobDestroy(681684[35]) > > >> [snap] > > >> > > >> pbs_server decides at 19:25:11, after 3 hours runtime, that the > > job is > > >> unknown (grepped by JobID from the server logs): > > >> [snip] > > >> 11/21/2012 > > >> > 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate: > > >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to > > >> RUNNING-RUNNING (4-42) > > >> 11/21/2012 > > >> > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: > > >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to > > >> QUEUED-SUBSTATE55 (1-55) > > >> 11/21/2012 > > >> > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: > > >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to > > >> EXITING-SUBSTATE55 (5-55) > > >> 11/21/2012 > > >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing > > from > > >> smp, state EXITING > > >> 11/21/2012 > > >> > > > 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, > > stray > > >> job 681684[35].cheops10 found on cheops21316 > > >> [snap] > > >> > > >> pbs_client just kills the processes: > > >> [snip] > > >> 11/21/2012 16:23:43;0001; > pbs_mom.32254;Job;TMomFinalizeJob3;job > > >> 681684[35].cheops10 started, pid = 17452 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > > 17452 task > > >> 1 gracefully with sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > > >> (pid=17452/state=R) after sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > > >> (pid=17452/state=Z) after sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > > 17692 task > > >> 1 gracefully with sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > > >> (pid=17692/state=R) after sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > > 17703 task > > >> 1 gracefully with sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > > >> (pid=17703/state=R) after sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid > > 17731 task > > >> 1 gracefully with sig 15 > > >> 11/21/2012 19:25:14;0008; > > >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process > > >> (pid=17731/state=R) after sig 15 > > >> 11/21/2012 19:25:15;0080; > > >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job > > >> 681684[35].cheops10 task 1 terminated, sid=17452 > > >> 11/21/2012 19:25:15;0008; > > pbs_mom.32254;Job;681684[35].cheops10;job > > >> was terminated > > >> 11/21/2012 19:25:50;0001; > > >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on > > server, > > >> deleting locally > > >> 11/21/2012 19:25:50;0080; > > >> pbs_mom.32254;Job;681684[35].cheops10;removed job script > > >> [snap] > > >> > > >> Sometimes, the pbs_mom logs include this message before the > > killing starts: > > >> [snip] > > >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), > aux=0, > > >> type=StatusJob, from PBS_Server at cheops10 > > >> [snap] > > >> > > >> And finally, some job informations given to epilogue: > > >> [snip] > > >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: > > >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue > > >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: > > >> 681684[35].cheops10,hthiele0,cheops21316,Job Information: > > >> userid=hthiele0, > > >> > > > resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00', > > >> > > > resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34', > > >> queue=smp, account=ccg-ngs, exitcode=271 > > >> [snap] > > >> > > >> This happens rarely (about 1 in 3000). However, silent deletions > of > > >> random jobs aren't exactly a trifling matter. > > >> I could try to disable the mom_job_sync option, which could > perhaps > > >> prevent the process killing of unknown jobs, but it would also > leave > > >> corrupt/pre-execution jobs alive. > > >> > > >> Can this be fixed? > > >> > > >> On a side-note, here are some further, minor Bugs I've noticed > > in the > > >> Torque 4.1.3. Version: > > >> - the epilogue script is usually invoked twice and sometimes even > > >> several times > > >> - when explicit node lists are used, e.g. > > nodes=node1:ppn=2+node2:ppn=2, > > >> then the number of "tasks" as seen by qstat is zero > > >> - there have been some API changes between Torque 2.x and Torque > > 4.x, so > > >> that two maui calls had to be altered in order to build against > > Torque > > >> 4.x (get_svrport, openrm). > > >> > > >> > > >> Regards, > > >> Lech Nieroda > > >> > > > > > > > > > > > > -- > > Dipl.-Wirt.-Inf. Lech Nieroda > > Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) > > Universit?t zu K?ln > > Weyertal 121 > > Raum 309 (3. Etage) > > D-50931 K?ln > > Deutschland > > > > Tel.: +49 (221) 470-89606 > > E-Mail: nieroda.lech at uni-koeln.de > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Senior Software Engineer > > Adaptive Computing > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Dipl.-Wirt.-Inf. Lech Nieroda > Regionales Rechenzentrum der Universit?t zu K?ln (RRZK) > Universit?t zu K?ln > Weyertal 121 > Raum 309 (3. Etage) > D-50931 K?ln > Deutschland > > Tel.: +49 (221) 470-89606 > E-Mail: nieroda.lech at uni-koeln.de > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121213/b1fe0d4f/attachment-0001.html From danield at igb.uiuc.edu Thu Dec 13 19:25:24 2012 From: danield at igb.uiuc.edu (Daniel Davidson) Date: Thu, 13 Dec 2012 20:25:24 -0600 Subject: [torqueusers] Trouble with multinode mpi on two new nodes Message-ID: <50CA8E14.3080803@igb.uiuc.edu> We recently received four new systems to add to our cluster, they have pure uefi bios on them, and do not pxe boot. So I will have to install them manually instead of just installing rocks to build the nodes. We have run into a problem when we try to use more than one of these new nodes to run openmpi jobs. We have compiled our on openmpi with torque support and we have used it on existing hardware for a while. What does work: serial jobs submitted to each node parallel jobs that fit into one node parallel program submitted by command line on either node (mpirun -host compute-2-0,compute-2-1 -np 2 hostname) parallel jobs run on other node But when the job is submitted, it is queued, but the output files are never created. FYI we use moab, if that matters. I have ensured: mpi/torque libraries are loaded torque versions are the same (3.0.5-1 supplied by rocks installs on other nodes) /etc/hosts is the same on all systems ldap is working correctly on all system universal file system home directories are mounted and writeable Please help, I am at a loss for as to what is happening. Dan Pertinent log information compute-2-0 12/13/2012 19:46:02;0008; pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp 12/13/2012 19:46:02;0002; pbs_mom;Svr;im_request;connect from 10.1.255.226:1023 12/13/2012 19:46:02;0008; pbs_mom;Job;296009.server.name.edu;im_request:received request 'ABORT_JOB' (10) for job 296009.server.name.edu from 10.1.255.226:1023 12/13/2012 19:46:02;0008; pbs_mom;Job;296009.server.name.edu;ERROR: received request 'ABORT_JOB' from 10.1.255.226:1023 for job '296009.server.name.edu' (job does not exist locally) 12/13/2012 19:46:02;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor (9) in do_rpp, cannot get protocol End of File 12/13/2012 19:46:02;0002; pbs_mom;Svr;im_eof;End of File from addr 10.1.255.226:1023 compute-2-1 12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor (9) in do_rpp, cannot get protocol Premature end of message 12/13/2012 19:48:06;0002; pbs_mom;Svr;im_eof;Premature end of message from addr 10.1.255.227:15003 12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::node_bailout, 296009.server.name.edu join_job failed from node compute-2-0 1 - recovery attempted) 12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::sister could not communicate (15061) in 296009.server.name.edu, job_start_error from node compute-2-0 in job_start_error 12/13/2012 19:48:06;0008; pbs_mom;Req;send_sisters;sending command ABORT_JOB for job 296009.server.name.edu (10) 12/13/2012 19:48:06;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 296009.server.name.edu 12/13/2012 19:48:06;0001; pbs_mom;Job;296009.server.name.edu;send_sisters: sister #1 (compute-2-0) is not ok (1099) 12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::node_bailout, node_bailout: received KILL/ABORT request for job 296009.server.name.edu from node compute-2-0 12/13/2012 19:48:06;0080; pbs_mom;Svr;scan_for_exiting;searching for exiting jobs 12/13/2012 19:48:06;0008; pbs_mom;Job;kill_job;scan_for_exiting: sending signal 9, "KILL" to job 296009.server.name.edu, reason: local task termination detected 12/13/2012 19:48:06;0008; pbs_mom;Job;296009.server.name.edu;kill_job done (killed 0 processes) 12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;sending preobit jobstat 12/13/2012 19:48:06;0008; pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp 12/13/2012 19:48:06;0002; pbs_mom;Svr;im_request;connect from 10.1.255.227:15003 12/13/2012 19:48:06;0008; pbs_mom;Job;296009.server.name.edu;im_request:received request 'ERROR' (99) for job 296009.server.name.edu from 10.1.255.227:15003 12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, event 4 taskid 0 not found 12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, error sending command 99 to job 296009.server.name.edu 12/13/2012 19:48:06;0002; pbs_mom;Svr;im_eof;No error from addr 10.1.255.227:15003 12/13/2012 19:48:06;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 12/13/2012 19:48:06;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 12/13/2012 19:48:06;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;performing job clean-up in preobit_reply() 12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;epilog subtask created with pid 72123 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue Head node: 12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;Job Run at request of root at server.name.edu 12/13/2012 19:47:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 296009.server.name.edu state from QUEUED-QUEUED to RUNNING-PRERUN (4-40) 12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;forking in send_job 12/13/2012 19:47:22;0004;PBS_Server;Svr;svr_connect;attempting connect to host 10.1.255.226 port 15002 12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;entering post_sendmom 12/13/2012 19:47:22;0002;PBS_Server;Job;296009.server.name.edu;child reported success for job after 0 seconds (dest=compute-2-1), rc=0 12/13/2012 19:47:22;0008;PBS_Server;Job;reply_send;Reply sent for request type RunJob on socket 13 12/13/2012 19:47:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 296009.server.name.edu state from RUNNING-PRERUN to RUNNING-RUNNING (4-42) 12/13/2012 19:47:22;0004;PBS_Server;Svr;svr_connect;attempting connect to host 10.1.255.226 port 15002 12/13/2012 19:47:27;0004;PBS_Server;Svr;svr_connect;attempting connect to host 10.1.255.226 port 15002 ....... 12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;obit received - updating final job usage info 12/13/2012 19:50:30;0008;PBS_Server;Job;296009.server.name.edu;attr resources_used modified 12/13/2012 19:50:30;0008;PBS_Server;Job;296009.server.name.edu;attr Error_Path modified 12/13/2012 19:50:30;0008;PBS_Server;Job;reply_send;Reply sent for request type JobObituary on socket 14 12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;job exit status -3 handled 12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 296009.server.name.edu state from RUNNING-RERUN1 to EXITING-RERUN1 (5-61) 12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;on_job_rerun task assigned to job 12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;req_jobobit completed 12/13/2012 19:50:30;0004;PBS_Server;Svr;svr_connect;attempting connect to host 10.1.255.226 port 15002 12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 296009.server.name.edu state from EXITING-RERUN1 to EXITING-RERUN2 (5-62) 12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 296009.server.name.edu state from EXITING-RERUN2 to EXITING-RERUN3 (5-63) ....... and this keeps repeating We also get logs like: Unable to copy file /var/spool/torque/spool/296008.biocluster.igb.illinois.edu.OU to /home/a-m/danield/mpitest.sh.o296008 *** error from copy /bin/cp: cannot stat `/var/spool/torque/spool/296008.biocluster.igb.illinois.edu.OU': No such file or directory *** end error output But I am pretty sure that is because the mpi job is not running on the second node, from the first node that receives the job information. From brockp at umich.edu Fri Dec 14 08:02:58 2012 From: brockp at umich.edu (Brock Palen) Date: Fri, 14 Dec 2012 10:02:58 -0500 Subject: [torqueusers] tm aware ssh In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620123B93B2121@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620123B93B2121@exvic-mbx04.nexus.csiro.au> Message-ID: <0B6801DB-F9FC-4634-9565-A3A31BD2175A@umich.edu> Thanks Gareth, Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Dec 13, 2012, at 6:18 PM, wrote: >> I have a few applications that spawn using ssh and don't support tm, >> >> There was once on the list a 'pbsssh' that wrapped pbsdsh to act like >> ssh, >> >> It looks like that script no longer works and I am scratching my head >> as to getting it working again (my bash fu is weak). >> >> Does anyone already have a way they do this? >> >> The hope is to get correct process reporting, and cleanup for >> applications that don't support TM but spawn with ssh/rsh. >> >> >> Thanks! >> >> Brock Palen > > Hi Brock, > > We have this (below) in place. You might need to swallow more ssh options if they are present - and there is no checking that "node" actually gets set to a cluster node name. > Specifying a user would break this... but you would not expect that in a cluster inside a batch job, right? > > Gareth > > > wil240 at burnet-login:~> cat /apps/ascutils/bin/pbsssh > #!/bin/bash > # $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ > # $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $ > > usage="usage: $0 " > > #swallow -x -n and -q (for intel mpi) > while getopts "xqn" opt > do > : > done > shift $((OPTIND-1)) > > if [ $# -lt 2 ] > then > echo $usage > exit > fi > > node=$1 > > shift > > exec pbsdsh -o -h $node "$@" > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From David.Roman at noveltis.fr Fri Dec 14 05:52:50 2012 From: David.Roman at noveltis.fr (David Roman) Date: Fri, 14 Dec 2012 12:52:50 +0000 Subject: [torqueusers] How to run mpirun of intel on torque Message-ID: Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121214/b088d557/attachment.html From David.Roman at noveltis.fr Fri Dec 14 09:13:26 2012 From: David.Roman at noveltis.fr (David Roman) Date: Fri, 14 Dec 2012 16:13:26 +0000 Subject: [torqueusers] tm aware ssh In-Reply-To: <0B6801DB-F9FC-4634-9565-A3A31BD2175A@umich.edu> References: <007DECE986B47F4EABF823C1FBB19C620123B93B2121@exvic-mbx04.nexus.csiro.au> <0B6801DB-F9FC-4634-9565-A3A31BD2175A@umich.edu> Message-ID: Hello, I use this script with mpirun of intel. But it doesn't work with me. #!/bin/bash # $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ # $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $ usage="usage: $0 " #swallow -x -n and -q (for intel mpi) while getopts "xqn" opt do : done shift $((OPTIND-1)) if [ $# -lt 2 ] then echo $usage exit fi node=$1 shift exec pbsdsh -v -o -h $node "$@" I open an interactive session qsub -I -l nodes=32 I launch my code with (I 'm on node12) mpirun -bootstrap-exec pbsssh -genv I_MPI_FABRICS_LIST tmi ./wrf.exe And I have these messages pbsdsh(): rescinfo from 0: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 16: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 17: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 18: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 19: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 20: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 21: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 22: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 23: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 24: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 25: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 26: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 27: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 28: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 29: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 30: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 31: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 1: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 2: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 3: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 4: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 5: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 6: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 7: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 8: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 9: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 10: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 11: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 12: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 13: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 14: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): rescinfo from 15: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32 pbsdsh(): spawned task 16 pbsdsh(): spawn event returned: 16 (1 spawns and 0 obits outstanding) pbsdsh(): sending obit for task 3 pbsdsh(): Event poll failed, error TM_ENOTCONNECTED starting wrf task 13 of 32 starting wrf task 1 of 32 starting wrf task 8 of 32 starting wrf task 0 of 32 starting wrf task 9 of 32 starting wrf task 27 of 32 starting wrf task 16 of 32 starting wrf task 10 of 32 starting wrf task 25 of 32 starting wrf task 29 of 32 starting wrf task 30 of 32 starting wrf task 31 of 32 starting wrf task 21 of 32 starting wrf task 17 of 32 starting wrf task 24 of 32 starting wrf task 18 of 32 starting wrf task 22 of 32 starting wrf task 5 of 32 starting wrf task 15 of 32 starting wrf task 6 of 32 starting wrf task 7 of 32 starting wrf task 2 of 32 starting wrf task 3 of 32 starting wrf task 11 of 32 starting wrf task 14 of 32 starting wrf task 20 of 32 starting wrf task 19 of 32 starting wrf task 12 of 32 starting wrf task 26 of 32 starting wrf task 4 of 32 starting wrf task 28 of 32 starting wrf task 23 of 32 pbsdsh(): reconnected pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): sending obit for task 3 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): Event poll failed, error TM_ENOTCONNECTED pbsdsh(): reconnected pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): sending obit for task 3 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 pbsdsh(): skipping obit resend for 0 etc Some body can said me why ? Thank you everybody -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Brock Palen Envoy??: vendredi 14 d?cembre 2012 16:03 ??: Torque Users Mailing List Objet?: Re: [torqueusers] tm aware ssh Thanks Gareth, Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Dec 13, 2012, at 6:18 PM, wrote: >> I have a few applications that spawn using ssh and don't support tm, >> >> There was once on the list a 'pbsssh' that wrapped pbsdsh to act >> like ssh, >> >> It looks like that script no longer works and I am scratching my head >> as to getting it working again (my bash fu is weak). >> >> Does anyone already have a way they do this? >> >> The hope is to get correct process reporting, and cleanup for >> applications that don't support TM but spawn with ssh/rsh. >> >> >> Thanks! >> >> Brock Palen > > Hi Brock, > > We have this (below) in place. You might need to swallow more ssh options if they are present - and there is no checking that "node" actually gets set to a cluster node name. > Specifying a user would break this... but you would not expect that in a cluster inside a batch job, right? > > Gareth > > > wil240 at burnet-login:~> cat /apps/ascutils/bin/pbsssh #!/bin/bash # > $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ # $HeadURL: > svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $ > > usage="usage: $0 " > > #swallow -x -n and -q (for intel mpi) > while getopts "xqn" opt > do > : > done > shift $((OPTIND-1)) > > if [ $# -lt 2 ] > then > echo $usage > exit > fi > > node=$1 > > shift > > exec pbsdsh -o -h $node "$@" > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Fri Dec 14 10:34:25 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 14 Dec 2012 17:34:25 +0000 Subject: [torqueusers] MAUI / Torque integration problems. Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1C3CE@ITSDAG1D.its.iastate.edu> I'm a long-time pbs_sched user who would like to transition to using MAUI. I seem to have things set up as in the adaptive computing integration guide at: http://www.adaptivecomputing.com/resources/docs/maui/mauistart.php and using info about parameters from: http://www.adaptivecomputing.com/resources/docs/maui/13.2rmconfiguration.php In particular I changed the following settings: SERVERNAME hpcg1.its.iastate.edu SERVERHOST hpcg1.its.iastate.edu # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[base] TYPE=PBS HOST=hpcg1.its.iastate.edu and I removed SERVERPORT 42559 so that it should use the default port (15003?). I will append the entire maui.cfg file. I can ask the other questions on the mauiusers list I just joined, but this may have something to do with TORQUE, so I am asking here. I appear to have everything running, but jobs just queue. I can get a job to run by using qrun. It looks to me that maui and Torque are not talking. The maui log has nothing about any of the jobs that I submitted and deleted, or the ones I forced to run using qrun. I am just getting started, so pbs_server, pbs_mom and maui all run on the same node. (If I substitute pbs_sched for maui, everything runs fine, so other than the scheduler and integration, everything else is working fine.) I am most uncertain about RMCFG: Do I need to specify some PORT on RMCFG[base] ? If so, what should I put? Anything else wrong with maui.cfg or am I looking in the wrong place? Any help would be appreciated. Thanks, - Jim James Coyle, PhD High Performance Computing Group 115 Durham Center Iowa State Univ. phone: (515)-294-2099 Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121214/5af79c85/attachment.html From gus at ldeo.columbia.edu Fri Dec 14 10:57:50 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 14 Dec 2012 12:57:50 -0500 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: References: Message-ID: <50CB689E.4040309@ldeo.columbia.edu> On 12/14/2012 07:52 AM, David Roman wrote: > Hello, > > I am sorry, but my english is really sad. I thank you for your patience > for that. > > I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option > --with-tm=/usr/local/torque. > > I disallowed ssh connections for my users on executable nodes. In > /etc/ssh/sshd_config I set > > AllowGroups root admin > > This works fine. > > But now, I install MPI Intel Librarie and Ifort. > > When I open a interactive pbs session: > > qsub -I -l nodes=2:ppn=8 > > I am connected on a node. > > I run mpi job > > mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program > > But it can not start, because it cannot connect on the other node. > > If I append users in AllowGroups of /etc/ssh/sshd_config it works. > > But if I do this, all users can connect on executable nodes, without use > torque, and this is bad. > > How can I do to disallow ssh connection without torque or make mpirun of > intel works like openmpi, without ssh connection allowed for users ? > > Thank you > > David > > Hi David For what it is worth, I_MPI_FABRICS_LIST seems to be an Intel-MPI environment variable, not OpenMPI. I am not familiar to Intel-MPI, and I couldn't find out what exactly a "tmi" fabric/network is. ** The corresponding OpenMPI way to set runtime parameters is through the "mca" parameters, and this includes the network fabric to use. Say, if you want (Ethernet) tcp and intranode shared memory, then add: -mca btl tcp,sm,self If you want Infinband and intranode shared memory, add: -mca btl openib,sm,self You can get a lot of information about your OpenMPI installation running: ompi_info ** I wonder if the mpiexec you're using is really part of your OpenMPI, and really the one you built with Torque/tm support, or perhaps pointing inadvertently to your Intel-MPI mpiexec. You cannot mix the various MPI implementations. Would there be a problem with the PATH? For OpenMPI you need to set both PATH and LD_LIBRARY_PATH properly. The OpenMPI FAQ explain both the environment setup and the use of "mca" parameters: http://www.open-mpi.org/faq/ A very simple test of OpenMPI functionality is: mpiexec hostname There are also the connectivity_c.c, ring_c.c, and hello_c.c in the OpenMPI "examples" directory, which you can compile with mpicc and run with mpiexec. ** Also, to prevent users from connecting directly to the nodes (or more precisely, to the nodes where they don't have Torque jobs running), you can configure and install Torque with the pam module" ./configure --with-pam ** I hope it helps, Gus Correa From jjc at iastate.edu Fri Dec 14 11:28:08 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 14 Dec 2012 18:28:08 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu> I have a script that I run from /var/spool/mom_priv/prologue whenever a josb that requires all the cores on the node. This script kills any user level processes that are not the job owner or a torque manager. This gets rid of any effect of users who should not be logged on to the compute node, that is who have just ssh'd in rather than using qsub -I I clean out scratch space on exit with epilogue if it is a dedicated job. You could couple this with techniques to prevent ssh logins durin the job (For jobs which dedicate the node, have prologue disallow logins from other users, and restore normal function on exit.) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Roman Sent: Friday, December 14, 2012 6:53 AM To: 'torqueusers at supercluster.org' Subject: [torqueusers] How to run mpirun of intel on torque Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121214/ac718833/attachment-0001.html From brian.haymore at utah.edu Fri Dec 14 11:56:23 2012 From: brian.haymore at utah.edu (Brian Haymore) Date: Fri, 14 Dec 2012 18:56:23 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu> References: , <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu> Message-ID: <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com> There are no quotas present on the scratch file systems. So for you to be getting a quota limited message tells me that somehow the tool you're using is trying to write something to your user home directory. You should look to clean up some of your home directory to bring yourself below your quota limit . -- Brian D Haymore University of Utah Center for High Performance Computing 155 South 1452 East, RM 405 Salt Lake City, Utah 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://www.map.utah.edu/umaplink/0019.html "Coyle, James J [ITACD]" wrote: I have a script that I run from /var/spool/mom_priv/prologue whenever a josb that requires all the cores on the node. This script kills any user level processes that are not the job owner or a torque manager. This gets rid of any effect of users who should not be logged on to the compute node, that is who have just ssh?d in rather than using qsub ?I I clean out scratch space on exit with epilogue if it is a dedicated job. You could couple this with techniques to prevent ssh logins durin the job (For jobs which dedicate the node, have prologue disallow logins from other users, and restore normal function on exit.) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Roman Sent: Friday, December 14, 2012 6:53 AM To: 'torqueusers at supercluster.org' Subject: [torqueusers] How to run mpirun of intel on torque Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121214/f4cb0dc8/attachment.html From jjc at iastate.edu Fri Dec 14 13:05:24 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 14 Dec 2012 20:05:24 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com> References: , <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu> <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com> Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1C42C@ITSDAG1D.its.iastate.edu> Huh? I assume this response for some other message. - Jim C. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Brian Haymore Sent: Friday, December 14, 2012 12:56 PM To: Torque Users Mailing List Subject: Re: [torqueusers] How to run mpirun of intel on torque There are no quotas present on the scratch file systems. So for you to be getting a quota limited message tells me that somehow the tool you're using is trying to write something to your user home directory. You should look to clean up some of your home directory to bring yourself below your quota limit . -- Brian D Haymore University of Utah Center for High Performance Computing 155 South 1452 East, RM 405 Salt Lake City, Utah 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://www.map.utah.edu/umaplink/0019.html "Coyle, James J [ITACD]" > wrote: I have a script that I run from /var/spool/mom_priv/prologue whenever a josb that requires all the cores on the node. This script kills any user level processes that are not the job owner or a torque manager. This gets rid of any effect of users who should not be logged on to the compute node, that is who have just ssh'd in rather than using qsub -I I clean out scratch space on exit with epilogue if it is a dedicated job. You could couple this with techniques to prevent ssh logins durin the job (For jobs which dedicate the node, have prologue disallow logins from other users, and restore normal function on exit.) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Roman Sent: Friday, December 14, 2012 6:53 AM To: 'torqueusers at supercluster.org' Subject: [torqueusers] How to run mpirun of intel on torque Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121214/1179c71d/attachment-0001.html From brian.haymore at utah.edu Fri Dec 14 13:55:51 2012 From: brian.haymore at utah.edu (Brian Haymore) Date: Fri, 14 Dec 2012 20:55:51 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <242421BFAF465844BE24EB90BB97E2210FA1C42C@ITSDAG1D.its.iastate.edu> References: , <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu> <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com>, <242421BFAF465844BE24EB90BB97E2210FA1C42C@ITSDAG1D.its.iastate.edu> Message-ID: oops ;) yep. sorry about that. -- Brian D Haymore University of Utah Center for High Performance Computing 155 South 1452 East, RM 405 Salt Lake City, Utah 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://www.map.utah.edu/umaplink/0019.html "Coyle, James J [ITACD]" wrote: Huh? I assume this response for some other message. - Jim C. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Brian Haymore Sent: Friday, December 14, 2012 12:56 PM To: Torque Users Mailing List Subject: Re: [torqueusers] How to run mpirun of intel on torque There are no quotas present on the scratch file systems. So for you to be getting a quota limited message tells me that somehow the tool you're using is trying to write something to your user home directory. You should look to clean up some of your home directory to bring yourself below your quota limit . -- Brian D Haymore University of Utah Center for High Performance Computing 155 South 1452 East, RM 405 Salt Lake City, Utah 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://www.map.utah.edu/umaplink/0019.html "Coyle, James J [ITACD]" > wrote: I have a script that I run from /var/spool/mom_priv/prologue whenever a josb that requires all the cores on the node. This script kills any user level processes that are not the job owner or a torque manager. This gets rid of any effect of users who should not be logged on to the compute node, that is who have just ssh?d in rather than using qsub ?I I clean out scratch space on exit with epilogue if it is a dedicated job. You could couple this with techniques to prevent ssh logins durin the job (For jobs which dedicate the node, have prologue disallow logins from other users, and restore normal function on exit.) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Roman Sent: Friday, December 14, 2012 6:53 AM To: 'torqueusers at supercluster.org' Subject: [torqueusers] How to run mpirun of intel on torque Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121214/6c7eade8/attachment.html From David.Roman at noveltis.fr Sat Dec 15 02:49:28 2012 From: David.Roman at noveltis.fr (David Roman) Date: Sat, 15 Dec 2012 09:49:28 +0000 Subject: [torqueusers] RE : How to run mpirun of intel on torque In-Reply-To: <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com> References: , <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu>, <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com> Message-ID: Thank for your reply, I think that it can be my solution. Could you show an example of script please ? David ________________________________________ De : torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] de la part de Brian Haymore [brian.haymore at utah.edu] Date d'envoi : vendredi 14 d?cembre 2012 19:56 ? : Torque Users Mailing List Objet : Re: [torqueusers] How to run mpirun of intel on torque There are no quotas present on the scratch file systems. So for you to be getting a quota limited message tells me that somehow the tool you're using is trying to write something to your user home directory. You should look to clean up some of your home directory to bring yourself below your quota limit . -- Brian D Haymore University of Utah Center for High Performance Computing 155 South 1452 East, RM 405 Salt Lake City, Utah 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://www.map.utah.edu/umaplink/0019.html "Coyle, James J [ITACD]" wrote: I have a script that I run from /var/spool/mom_priv/prologue whenever a josb that requires all the cores on the node. This script kills any user level processes that are not the job owner or a torque manager. This gets rid of any effect of users who should not be logged on to the compute node, that is who have just ssh?d in rather than using qsub ?I I clean out scratch space on exit with epilogue if it is a dedicated job. You could couple this with techniques to prevent ssh logins durin the job (For jobs which dedicate the node, have prologue disallow logins from other users, and restore normal function on exit.) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Roman Sent: Friday, December 14, 2012 6:53 AM To: 'torqueusers at supercluster.org' Subject: [torqueusers] How to run mpirun of intel on torque Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David From dong.tian at gmail.com Sat Dec 15 09:32:02 2012 From: dong.tian at gmail.com (Tian, Dong) Date: Sat, 15 Dec 2012 11:32:02 -0500 Subject: [torqueusers] prologue script Message-ID: Dear Experts, I have a question on how to run prologue script on our TORQUE system, in order to perform some health check before a job is to be executed. My first test is print environment variables, but the print never appear in the output log file. So I suspect the prologue script never runs. The script is like this, #!/bin/bash echo "Requested Resource: $5" exit 0 It was placed under /opt/torgue/mom_priv, with file name "prologue". File permission is set to "-r-x------". The system was not restarted after the changes, for there are ongoing jobs. Could someone give me a hint anything I did was wrong or I missed something? Thanks a lot! Dong -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121215/7b52db1a/attachment.html From dong.tian at gmail.com Sun Dec 16 16:32:18 2012 From: dong.tian at gmail.com (Tian, Dong) Date: Sun, 16 Dec 2012 18:32:18 -0500 Subject: [torqueusers] prologue script In-Reply-To: References: Message-ID: It was solved. The problem is that the script was not copied to the compute nodes. It only resides at the head node. Thanks, Dong On Sat, Dec 15, 2012 at 11:32 AM, Tian, Dong wrote: > Dear Experts, > > I have a question on how to run prologue script on our TORQUE system, in > order to perform some health check before a job is to be executed. > > My first test is print environment variables, but the print never appear > in the output log file. So I suspect the prologue script never runs. > > The script is like this, > > #!/bin/bash > > echo "Requested Resource: $5" > > exit 0 > > It was placed under /opt/torgue/mom_priv, with file name "prologue". File > permission is set to "-r-x------". > > The system was not restarted after the changes, for there are ongoing jobs. > > Could someone give me a hint anything I did was wrong or I missed > something? > > Thanks a lot! > Dong > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121216/044a9a9d/attachment.html From jjc at iastate.edu Mon Dec 17 09:20:21 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 17 Dec 2012 16:20:21 +0000 Subject: [torqueusers] MAUI / Torque integration problems. In-Reply-To: <242421BFAF465844BE24EB90BB97E2210FA1C3CE@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210FA1C3CE@ITSDAG1D.its.iastate.edu> Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1C94D@ITSDAG1D.its.iastate.edu> I forgot to append the maui.cfg file, here it is: # maui.cfg 3.2.6p20 SERVERNAME hpcg1.its.iastate.edu SERVERHOST hpcg1.its.iastate.edu # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[base] TYPE=PBS HOST=hpcg1.its.iastate.edu # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Cred Component CREDWEIGHT 0 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Friday, December 14, 2012 11:34 AM To: Torque Users Mailing List Subject: [torqueusers] MAUI / Torque integration problems. I'm a long-time pbs_sched user who would like to transition to using MAUI. I seem to have things set up as in the adaptive computing integration guide at: http://www.adaptivecomputing.com/resources/docs/maui/mauistart.php and using info about parameters from: http://www.adaptivecomputing.com/resources/docs/maui/13.2rmconfiguration.php In particular I changed the following settings: SERVERNAME hpcg1.its.iastate.edu SERVERHOST hpcg1.its.iastate.edu # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[base] TYPE=PBS HOST=hpcg1.its.iastate.edu and I removed SERVERPORT 42559 so that it should use the default port (15003?). I will append the entire maui.cfg file. I can ask the other questions on the mauiusers list I just joined, but this may have something to do with TORQUE, so I am asking here. I appear to have everything running, but jobs just queue. I can get a job to run by using qrun. It looks to me that maui and Torque are not talking. The maui log has nothing about any of the jobs that I submitted and deleted, or the ones I forced to run using qrun. I am just getting started, so pbs_server, pbs_mom and maui all run on the same node. (If I substitute pbs_sched for maui, everything runs fine, so other than the scheduler and integration, everything else is working fine.) I am most uncertain about RMCFG: Do I need to specify some PORT on RMCFG[base] ? If so, what should I put? Anything else wrong with maui.cfg or am I looking in the wrong place? Any help would be appreciated. Thanks, - Jim James Coyle, PhD High Performance Computing Group 115 Durham Center Iowa State Univ. phone: (515)-294-2099 Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121217/79cabfb6/attachment-0001.html From k.mohammad1 at physics.ox.ac.uk Tue Dec 18 05:22:39 2012 From: k.mohammad1 at physics.ox.ac.uk (Kashif Mohammad) Date: Tue, 18 Dec 2012 12:22:39 +0000 Subject: [torqueusers] momctl errors Message-ID: <88B17E26E0A9F94381C67535AEF2BB5E7982ED01@EXCHNG13.physics.ox.ac.uk> Hi I am troubleshooting a problem at our cluster where some times jobs at compute nodes lost contact with torque server. In this case jobs keep running at compute node while torque server thinks that job has been finished. There are lot of this errors in log file PBS_Server: LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to a3010517 port 15002 If I check momctl -h node_name -d 2 it gives the output but if I check momctl -p 15002 -h node_name -d 2 it fails with this error ERROR: query[0] 'diag2' failed on node_name (errno=0-Success: 5-Input/output error) I can see on compute node that it is listening on port 15002 but request coming to this port stay in TIME_WAIT state netstat -an | grep 15002 tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN tcp 0 0 163.1.5.98:15002 163.1.5.44:861 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:607 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:831 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:999 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:993 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:985 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:685 TIME_WAIT tcp 0 0 163.1.5.98:15002 163.1.5.44:897 TIME_WAIT We are running torque-2.5.12 and we have around 1300 jobs slots in our cluster. I will appreciate if some one can give some hints. Thanks and Regards Kashif From knielson at adaptivecomputing.com Tue Dec 18 18:17:59 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 18 Dec 2012 18:17:59 -0700 Subject: [torqueusers] TORQUE 4.1.4 available for general availability Message-ID: Hi all, Adaptive Computing is pleased to announce the general availability of TORQUE 4.1.4. Several fixes and a few enhancements were added to this build. For a complete list of changes see the attached CHANGELOG. The tarball can be downloaded from http://www.adaptivecomputing.com/support/download-center/torque-download/ torque-4.1.4.tar.gz We would like to remind everyone that TORQUE is now available on GitHub. Our project can be found at https://github.com/adaptivecomputing/torque. 4.1.4 is a repository in that project. Thanks to everyone who helped make this possible. We had great participation from the community. The contribution from the community is tremendous and we are very grateful for the feedback and help Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121218/4d772276/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: torque-4.1.4.CHANGELOG Type: application/octet-stream Size: 4788 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121218/4d772276/attachment.obj From stijn.deweirdt at ugent.be Wed Dec 19 06:47:05 2012 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 19 Dec 2012 14:47:05 +0100 Subject: [torqueusers] send_sister sister not ok Message-ID: <50D1C559.50600@ugent.be> hi all, we are seeing following errors with launching a 512 node job anyone any idea how to debug? is this a timeout issue? nodes seem fine and have been running jobs before stijn 12/19/2012 13:55:16;0001; pbs_mom.3471;Job;194.master-moab;send_sisters: sister #178 (node1181.muk.os) is not ok (15001) 12/19/2012 13:55:16;0001; pbs_mom.3471;Job;194.master-moab;send_sisters: sister #222 (node1225.muk.os) is not ok (15001) 12/19/2012 13:55:16;0001; pbs_mom.3471;Job;194.master-moab;send_sisters: sister #287 (node1291.muk.os) is not ok (15001) 12/19/2012 13:55:16;0001; pbs_mom.3471;Job;194.master-moab;send_sisters: sister #455 (node1468.muk.os) is not ok (15001) 12/19/2012 13:55:16;0001; pbs_mom.3471;Job;194.master-moab;send_sisters: sister #496 (node1509.muk.os) is not ok (15001) 12/19/2012 13:55:19;0001; pbs_mom.3471;Svr;pbs_mom;LOG_ERROR::im_request, Response recieved from client 10.141.129.181:471 (15003) jobid 194.master\ -moab 12/19/2012 13:55:19;0001; pbs_mom.3471;Svr;pbs_mom;LOG_ERROR::im_request, Response recieved from client 10.141.129.225:575 (15003) jobid 194.master\ -moab 12/19/2012 13:55:25;0001; pbs_mom.3471;Svr;pbs_mom;LOG_ERROR::im_request, Response recieved from client 10.141.130.35:603 (15003) jobid 194.master-\ moab 12/19/2012 13:55:25;0001; pbs_mom.3471;Svr;pbs_mom;LOG_ERROR::im_request, Response recieved from client 10.141.130.212:743 (15003) jobid 194.master\ -moab 12/19/2012 13:55:37;0001; pbs_mom.3471;Svr;pbs_mom;LOG_ERROR::im_request, Response recieved from client 10.141.130.253:414 (15003) jobid 194.master\ -moab From David.Roman at noveltis.fr Wed Dec 19 06:59:01 2012 From: David.Roman at noveltis.fr (David Roman) Date: Wed, 19 Dec 2012 13:59:01 +0000 Subject: [torqueusers] RE : How to run mpirun of intel on torque In-Reply-To: References: , <242421BFAF465844BE24EB90BB97E2210FA1C3F1@ITSDAG1D.its.iastate.edu>, <4coxporgpwl5ws616tlw92xp.1355511373856@email.android.com> Message-ID: Hello, I think I solved my problem. To prevent that users use more than their resources requested, I use cpuset with torque. On each executable nodes, all process are running under PBS have /proc//cpuset with the definition of cpuset, other process have only / in this file. In crontab I check all user process and I kill all jobs are not running under PBS . David -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de David Roman Envoy??: samedi 15 d?cembre 2012 10:49 ??: Torque Users Mailing List Objet?: [torqueusers] RE : How to run mpirun of intel on torque Thank for your reply, I think that it can be my solution. Could you show an example of script please ? David ________________________________________ De : torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] de la part de Brian Haymore [brian.haymore at utah.edu] Date d'envoi : vendredi 14 d?cembre 2012 19:56 ? : Torque Users Mailing List Objet : Re: [torqueusers] How to run mpirun of intel on torque There are no quotas present on the scratch file systems. So for you to be getting a quota limited message tells me that somehow the tool you're using is trying to write something to your user home directory. You should look to clean up some of your home directory to bring yourself below your quota limit . -- Brian D Haymore University of Utah Center for High Performance Computing 155 South 1452 East, RM 405 Salt Lake City, Utah 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://www.map.utah.edu/umaplink/0019.html "Coyle, James J [ITACD]" wrote: I have a script that I run from /var/spool/mom_priv/prologue whenever a josb that requires all the cores on the node. This script kills any user level processes that are not the job owner or a torque manager. This gets rid of any effect of users who should not be logged on to the compute node, that is who have just ssh'd in rather than using qsub -I I clean out scratch space on exit with epilogue if it is a dedicated job. You could couple this with techniques to prevent ssh logins durin the job (For jobs which dedicate the node, have prologue disallow logins from other users, and restore normal function on exit.) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Roman Sent: Friday, December 14, 2012 6:53 AM To: 'torqueusers at supercluster.org' Subject: [torqueusers] How to run mpirun of intel on torque Hello, I am sorry, but my english is really sad. I thank you for your patience for that. I installed TORQUE 4.1.0, with MAU 3.3.1. I compiled OPENMPI with option --with-tm=/usr/local/torque. I disallowed ssh connections for my users on executable nodes. In /etc/ssh/sshd_config I set AllowGroups root admin This works fine. But now, I install MPI Intel Librarie and Ifort. When I open a interactive pbs session: qsub -I -l nodes=2:ppn=8 I am connected on a node. I run mpi job mpirun -genv I_MPI_FABRICS_LIST tmi ./my_program But it can not start, because it cannot connect on the other node. If I append users in AllowGroups of /etc/ssh/sshd_config it works. But if I do this, all users can connect on executable nodes, without use torque, and this is bad. How can I do to disallow ssh connection without torque or make mpirun of intel works like openmpi, without ssh connection allowed for users ? Thank you David _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Wed Dec 19 15:34:37 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 20 Dec 2012 09:34:37 +1100 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: References: Message-ID: <50D240FD.9060505@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 14/12/12 23:52, David Roman wrote: > But now, I install MPI Intel Librarie and Ifort. You probably want to look at the OSC mpiexec replacement which knows how to speak to Torque via the TM interface and can launch Intel MPI jobs for you. https://www.osc.edu/~djohnson/mpiexec/ We build ours with: ./configure --with-pbs=/usr/local/torque/latest --with-default-comm=mpich2-pmi --prefix=/usr/local/$(basename $(pwd) | sed 's#-#/#') You'll need to adjust the various paths in that to get it to work for you of course. Best of luck, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDSQP0ACgkQO2KABBYQAh+RPACeJb6k/8viRujfX0UCgm1ux3ru hu0An1tjq+RNcn76mQTP0ADg8Z6BXbpE =krge -----END PGP SIGNATURE----- From basv at sara.nl Wed Dec 19 15:43:49 2012 From: basv at sara.nl (Bas van der Vlies) Date: Wed, 19 Dec 2012 22:43:49 +0000 Subject: [torqueusers] pbsnodes message=EVENT in status field Message-ID: <74EB35DC444C754DA400390F673C23D608677120@sara-exch-02.ka.sara.nl> Hello, i want to know how i can set a EVENT message for a node; e.g.: pbsnodes n1: {{{{ eg: node: n1 state = free np = 8 properties = ib,switch1,highmem ntype = cluster jobs = 0/567403.sara.nl, 1/567403.sara.nl status = ..,loadave=0.00,message=EVENT:sample.time=1288864220.003:cputotals.user=0:iconnect.pktout=0,netload=3487600394,state=free,? }}} This is an example that i programmed in pbs_python for somebody else, but i can not remember how to set it. . I can set MESSAGE=ERROR, just print ERROR to stdout in a health script with a description. Or is the same for an event message? regards -- Bas van der Vlies mail: basv at sara.nl SARA - Academic Computing Services , Amsterdam, The Netherlands From andre.gemuend at scai.fraunhofer.de Thu Dec 20 02:02:50 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Thu, 20 Dec 2012 10:02:50 +0100 (CET) Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <50D240FD.9060505@unimelb.edu.au> Message-ID: <1832273347.433663.1355994170754.JavaMail.root@scai.fraunhofer.de> Hi, I hope its okay if I chime in. Did you try that with Intel MPI 4.x as well? I had no luck with mpiexec 0.84 and IMPI 4.0.3 using mpich2-pmi. Or maybe I'm missing something (didn't have much time to test). Greetings Andr? ----- Urspr?ngliche Mail ----- > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 14/12/12 23:52, David Roman wrote: > > > But now, I install MPI Intel Librarie and Ifort. > > You probably want to look at the OSC mpiexec replacement which knows > how to speak to Torque via the TM interface and can launch Intel MPI > jobs for you. > > https://www.osc.edu/~djohnson/mpiexec/ > > We build ours with: > > ./configure --with-pbs=/usr/local/torque/latest > --with-default-comm=mpich2-pmi --prefix=/usr/local/$(basename $(pwd) > | sed 's#-#/#') > > You'll need to adjust the various paths in that to get it to work for > you of course. > > Best of luck, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with undefined - http://www.enigmail.net/ > > iEYEARECAAYFAlDSQP0ACgkQO2KABBYQAh+RPACeJb6k/8viRujfX0UCgm1ux3ru > hu0An1tjq+RNcn76mQTP0ADg8Z6BXbpE > =krge > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From David.Roman at noveltis.fr Thu Dec 20 02:52:55 2012 From: David.Roman at noveltis.fr (David Roman) Date: Thu, 20 Dec 2012 09:52:55 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <50D240FD.9060505@unimelb.edu.au> References: <50D240FD.9060505@unimelb.edu.au> Message-ID: Hello, Thank you for this post. I tried it, but when I start my code with mpirun ./my_code, mpirun wants to use mpiexec.hydra but no mpiexec. I look option in mpirun to use mpiexec and not mpiexec.hydra David -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Christopher Samuel Envoy??: mercredi 19 d?cembre 2012 23:35 ??: torqueusers at supercluster.org Objet?: Re: [torqueusers] How to run mpirun of intel on torque -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 14/12/12 23:52, David Roman wrote: > But now, I install MPI Intel Librarie and Ifort. You probably want to look at the OSC mpiexec replacement which knows how to speak to Torque via the TM interface and can launch Intel MPI jobs for you. https://www.osc.edu/~djohnson/mpiexec/ We build ours with: ./configure --with-pbs=/usr/local/torque/latest --with-default-comm=mpich2-pmi --prefix=/usr/local/$(basename $(pwd) | sed 's#-#/#') You'll need to adjust the various paths in that to get it to work for you of course. Best of luck, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDSQP0ACgkQO2KABBYQAh+RPACeJb6k/8viRujfX0UCgm1ux3ru hu0An1tjq+RNcn76mQTP0ADg8Z6BXbpE =krge -----END PGP SIGNATURE----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Thu Dec 20 03:10:49 2012 From: samuel at unimelb.edu.au (Chris Samuel) Date: Thu, 20 Dec 2012 21:10:49 +1100 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: References: <50D240FD.9060505@unimelb.edu.au> Message-ID: <1896339.7huMAhI2zI@quad> On Thu, 20 Dec 2012 09:52:55 AM David Roman wrote: > Hello, Hiya, > Thank you for this post. I tried it, but when I start my code with mpirun > ./my_code, mpirun wants to use mpiexec.hydra but no mpiexec. > > I look option in mpirun to use mpiexec and not mpiexec.hydra Ah, don't run your application with mpirun, run it with this mpiexec instead. :-) cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From samuel at unimelb.edu.au Thu Dec 20 03:12:07 2012 From: samuel at unimelb.edu.au (Chris Samuel) Date: Thu, 20 Dec 2012 21:12:07 +1100 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <1832273347.433663.1355994170754.JavaMail.root@scai.fraunhofer.de> References: <1832273347.433663.1355994170754.JavaMail.root@scai.fraunhofer.de> Message-ID: <1644788.mnmqCMGgJu@quad> On Thu, 20 Dec 2012 10:02:50 AM Andr? Gem?nd wrote: > I hope its okay if I chime in. Did you try that with Intel MPI 4.x as well? No I haven't I'm afraid, we only set it up for COMSOL, the one application we have that uses Intel MPI. > I had no luck with mpiexec 0.84 and IMPI 4.0.3 using mpich2-pmi. Or maybe > I'm missing something (didn't have much time to test). I'd suggest asking on the mpiexec list (if you've not already). Best of luck! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From David.Roman at noveltis.fr Thu Dec 20 03:11:58 2012 From: David.Roman at noveltis.fr (David Roman) Date: Thu, 20 Dec 2012 10:11:58 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <1896339.7huMAhI2zI@quad> References: <50D240FD.9060505@unimelb.edu.au> <1896339.7huMAhI2zI@quad> Message-ID: Yes, i did this after my reply Did this test echo 'hpc-node15: hostname' | mpiexec --comm=none -nostdin -config=- But I have a segmentation fault I read the documentation to find my mistake David -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Chris Samuel Envoy??: jeudi 20 d?cembre 2012 11:11 ??: torqueusers at supercluster.org Objet?: Re: [torqueusers] How to run mpirun of intel on torque On Thu, 20 Dec 2012 09:52:55 AM David Roman wrote: > Hello, Hiya, > Thank you for this post. I tried it, but when I start my code with > mpirun ./my_code, mpirun wants to use mpiexec.hydra but no mpiexec. > > I look option in mpirun to use mpiexec and not mpiexec.hydra Ah, don't run your application with mpirun, run it with this mpiexec instead. :-) cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Thu Dec 20 04:54:59 2012 From: samuel at unimelb.edu.au (Chris Samuel) Date: Thu, 20 Dec 2012 22:54:59 +1100 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: References: <1896339.7huMAhI2zI@quad> Message-ID: <6312126.dbERavkOWt@quad> On Thu, 20 Dec 2012 10:11:58 AM David Roman wrote: > Yes, i did this after my reply > > Did this test > > echo 'hpc-node15: hostname' | mpiexec --comm=none -nostdin -config=- > > But I have a segmentation fault Umm, yes, quite probably. :-) > I read the documentation to find my mistake You should just need to do: mpiexec program arguments replacing program and arguments with the executable and any arguments you need to pass to it. Hope that helps! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From David.Roman at noveltis.fr Thu Dec 20 05:00:24 2012 From: David.Roman at noveltis.fr (David Roman) Date: Thu, 20 Dec 2012 12:00:24 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <6312126.dbERavkOWt@quad> References: <1896339.7huMAhI2zI@quad> <6312126.dbERavkOWt@quad> Message-ID: I used torque 4.1.0, and with version mpiexec failed. I removes this version, and I installed version 2.4.8 (with aptitude under debian) With this version mpiexec not failed, I have am other problem, but I think it's my code whose had a bug David -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Chris Samuel Envoy??: jeudi 20 d?cembre 2012 12:55 ??: torqueusers at supercluster.org Objet?: Re: [torqueusers] How to run mpirun of intel on torque On Thu, 20 Dec 2012 10:11:58 AM David Roman wrote: > Yes, i did this after my reply > > Did this test > > echo 'hpc-node15: hostname' | mpiexec --comm=none -nostdin -config=- > > But I have a segmentation fault Umm, yes, quite probably. :-) > I read the documentation to find my mistake You should just need to do: mpiexec program arguments replacing program and arguments with the executable and any arguments you need to pass to it. Hope that helps! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From brockp at umich.edu Thu Dec 20 07:49:11 2012 From: brockp at umich.edu (Brock Palen) Date: Thu, 20 Dec 2012 09:49:11 -0500 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: <6312126.dbERavkOWt@quad> References: <1896339.7huMAhI2zI@quad> <6312126.dbERavkOWt@quad> Message-ID: <8E198263-DC41-4896-8BFD-238DCBBA2DF7@umich.edu> Stock mpiexec from OSC is broken with torque 4, /home/software/rhel6/mpiexec/04292011/bin/mpiexec uptime *** Hang *** This broke our matlab PCT install, If you install your own build of hydra from mpich though, you can build it with tm support: Then tell this new mpiexec.hydra to use pbs: export HYDRA_LAUNCHER=pbs export HYDRA_RMK=pbs ./mpiexec.hydra uptime *** no hang *** Though it does print errors, these errors are related to torque 4 bug that causes OSC to hang. [mpiexec at nyx5354.engin.umich.edu] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002 [mpiexec at nyx5354.engin.umich.edu] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec at nyx5354.engin.umich.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion [mpiexec at nyx5354.engin.umich.edu] main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion But it should work. Just an alternative. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Dec 20, 2012, at 6:54 AM, Chris Samuel wrote: > On Thu, 20 Dec 2012 10:11:58 AM David Roman wrote: > >> Yes, i did this after my reply >> >> Did this test >> >> echo 'hpc-node15: hostname' | mpiexec --comm=none -nostdin -config=- >> >> But I have a segmentation fault > > Umm, yes, quite probably. :-) > >> I read the documentation to find my mistake > > You should just need to do: > > mpiexec program arguments > > replacing program and arguments with the executable and any arguments you need > to pass to it. > > Hope that helps! > Chris > -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From David.Roman at noveltis.fr Thu Dec 20 09:04:31 2012 From: David.Roman at noveltis.fr (David Roman) Date: Thu, 20 Dec 2012 16:04:31 +0000 Subject: [torqueusers] How to run mpirun of intel on torque In-Reply-To: References: <1896339.7huMAhI2zI@quad> <6312126.dbERavkOWt@quad> Message-ID: In fact I have always some problems with mpiexec. I have this short code : program simple4 implicit none integer ierr,my_rank,size,partner CHARACTER*50 greeting include 'mpif.h' integer status(MPI_STATUS_SIZE) call mpi_init(ierr) call mpi_comm_rank(MPI_COMM_WORLD,my_rank,ierr) call mpi_comm_size(MPI_COMM_WORLD,size,ierr) write(greeting,100) my_rank, size if(my_rank.eq.0) then write(6,*) greeting do partner=1,size-1 call mpi_recv(greeting, 50, MPI_CHARACTER, partner, 1, & MPI_COMM_WORLD, status, ierr) write(6,*) greeting end do else call mpi_send(greeting, 50, MPI_CHARACTER, 0, 1, & MPI_COMM_WORLD, ierr) end if if(my_rank.eq.0) then write(6,*) 'That is all for now!' end if call mpi_finalize(ierr) 100 format('Hello World: processor ', I2, ' of ', I2) End If I use mpirun of intel I have this result : roman at hpc-node11:~/test$ mpirun -genv I_MPI_FABRICS_LIST tmi ./a.out Hello World: processor 0 of 8 Hello World: processor 1 of 8 Hello World: processor 2 of 8 Hello World: processor 3 of 8 Hello World: processor 4 of 8 Hello World: processor 5 of 8 Hello World: processor 6 of 8 Hello World: processor 7 of 8 That is all for now! But now if I use mpiexec : roman at hpc-node11:~/test$ /NOVELTIS/roman/bin/mpiexec/bin/mpiexec -v ./a.out mpiexec: resolve_exe: using absolute path "./a.out". node 0: name hpc-node11, cpu avail 4 node 1: name hpc-node10, cpu avail 4 Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! Hello World: processor 0 of 1 That is all for now! mpiexec: process_start_event: evt 2 task 0 on hpc-node11. mpiexec: process_start_event: evt 3 task 1 on hpc-node11. mpiexec: process_start_event: evt 6 task 4 on hpc-node10. mpiexec: process_start_event: evt 4 task 2 on hpc-node11. mpiexec: process_start_event: evt 7 task 5 on hpc-node10. mpiexec: process_start_event: evt 5 task 3 on hpc-node11. mpiexec: process_start_event: evt 8 task 6 on hpc-node10. mpiexec: process_start_event: evt 9 task 7 on hpc-node10. mpiexec: All 8 tasks (spawn 0) started. mpiexec: wait_tasks: waiting for hpc-node11 hpc-node11 and 6 others. mpiexec: process_obit_event: evt 10 task 0 on hpc-node11 stat 0. mpiexec: process_obit_event: evt 12 task 4 on hpc-node10 stat 0. mpiexec: process_obit_event: evt 14 task 5 on hpc-node10 stat 0. mpiexec: process_obit_event: evt 16 task 6 on hpc-node10 stat 0. mpiexec: process_obit_event: evt 17 task 7 on hpc-node10 stat 0. mpiexec: process_obit_event: evt 11 task 1 on hpc-node11 stat 0. mpiexec: process_obit_event: evt 13 task 2 on hpc-node11 stat 0. mpiexec: process_obit_event: evt 15 task 3 on hpc-node11 stat 0. All process are launched, but the number of cpus and the rank are not correctly read. -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de David Roman Envoy??: jeudi 20 d?cembre 2012 13:00 ??: 'Torque Users Mailing List' Objet?: Re: [torqueusers] How to run mpirun of intel on torque I used torque 4.1.0, and with version mpiexec failed. I removes this version, and I installed version 2.4.8 (with aptitude under debian) With this version mpiexec not failed, I have am other problem, but I think it's my code whose had a bug David -----Message d'origine----- De?: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Chris Samuel Envoy??: jeudi 20 d?cembre 2012 12:55 ??: torqueusers at supercluster.org Objet?: Re: [torqueusers] How to run mpirun of intel on torque On Thu, 20 Dec 2012 10:11:58 AM David Roman wrote: > Yes, i did this after my reply > > Did this test > > echo 'hpc-node15: hostname' | mpiexec --comm=none -nostdin -config=- > > But I have a segmentation fault Umm, yes, quite probably. :-) > I read the documentation to find my mistake You should just need to do: mpiexec program arguments replacing program and arguments with the executable and any arguments you need to pass to it. Hope that helps! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From diego.bacchin at bmr-genomics.it Thu Dec 20 09:47:55 2012 From: diego.bacchin at bmr-genomics.it (diego bacchin) Date: Thu, 20 Dec 2012 17:47:55 +0100 Subject: [torqueusers] Jobs in Q status Message-ID: <50D3413B.9000809@bmr-genomics.it> Hi to All! I have a cluster with 32 nodes b001 b032 When a user launch 25/30 jobs togheter sometimes 1 job remains in Q status and it never start although there are free nodes. I have tried to move the job on other queue but the problem still remains. This is the log: 12/19/2012 15:24:41;0100;PBS_Server;Job;54061.master.nfs;enqueuing into queuename, state 1 hop 1 12/19/2012 15:24:41;0008;PBS_Server;Job;54061.master.nfs;Job Queued at request of user at b032.nfs, owner = user at b032.nfs, job name = 4_2_SM_var, queue = queuename 12/19/2012 15:24:52;0008;PBS_Server;Job;54061.master.nfs;could not locate requested resources 'b013:ppn=12' (node_spec failed) cannot allocate node 'b013' to job - node not currently available (nps needed/free: 12/0, gpus needed/free: 0/0, joblist: 54060.master.nfs:0,54060.master.nfs:1,54060.master.nfs:2,54060.master.nfs:3,54060.master.nfs:4,54060.master.nfs:5,54060.master.nfs:6,54060.master.nfs:7,54060.master.nfs:8,54060.master.nfs:9,54060.master.nfs:10,54060.master.nfs:11) # I have tried to requeue the job in the same queue 12/19/2012 15:51:20;0100;PBS_Server;Job;54061.master.nfs;dequeuing from queuename, state QUEUED 12/19/2012 15:51:20;0100;PBS_Server;Job;54061.master.nfs;enqueuing into queuename, state 1 hop 1 12/19/2012 15:51:20;0008;PBS_Server;Job;54061.master.nfs;Job moved to queuename at request of root at master.nfs 12/19/2012 15:51:31;0100;PBS_Server;Job;54061.master.nfs;dequeuing from queuename, state QUEUED # I have tried to requeue the job in another queue with the same nodes but different policies 12/19/2012 15:51:31;0100;PBS_Server;Job;54061.master.nfs;enqueuing into queuename2, state 1 hop 1 12/19/2012 15:51:31;0008;PBS_Server;Job;54061.master.nfs;Job moved to queuename2 at request of root at master.nfs # I have tried to requeue the job in a third queue with different nodes 12/20/2012 12:21:20;0100;PBS_Server;Job;54061.master.nfs;dequeuing from queuename2, state QUEUED 12/20/2012 12:21:20;0100;PBS_Server;Job;54061.master.nfs;enqueuing into fatnode, state 1 hop 1 12/20/2012 12:21:20;0008;PBS_Server;Job;54061.master.nfs;Job moved to fatnode at request of root at master.nfs qstat -f 54061.master.nfs Job Id: 54061.master.nfs Job_Name = 4_2_SM_var Job_Owner = user at b032.nfs job_state = Q queue = fatnode server = master.nfs Checkpoint = u ctime = Wed Dec 19 15:24:41 2012 Error_Path = b032.nfs:/lustre/projects/USER_data/4517695bf8edbc2 e54ce6d59ab799f27/vcf/2_SM/4_2_SM_var.e54061 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Dec 19 15:24:41 2012 Output_Path = b032.nfs:/lustre/projects/USER_data/4517695bf8edbc 2e54ce6d59ab799f27/vcf/2_SM/4_2_SM_var.o54061 Priority = 0 qtime = Thu Dec 20 12:21:20 2012 Rerunable = True Resource_List.neednodes = 1:ppn=12 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=12 Resource_List.walltime = 01:00:00 substate = 10 Variable_List = PBS_O_QUEUE=queuename,PBS_O_HOME=/u/user, PBS_O_LANG=C,PBS_O_LOGNAME=user, PBS_O_PATH=/opt/mysql/client:/opt/python/2.6.7/bin:/opt/torque//bin:/opt/maui/bin:/bin:/usr/bin :/usr/local/sbin:/usr/sbin:/sbin:/u/user/bin, PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash, PBS_O_HOST=b032.nfs,PBS_SERVER=master.nfs, PBS_O_WORKDIR=/lustre/projects/USER_data/4517695bf8edbc2e54 ce6d59ab799f27/vcf/2_SM euser = user egroup = group queue_rank = 2531 queue_type = E etime = Wed Dec 19 15:24:41 2012 submit_args = 4.variation.job fault_tolerant = False submit_host = b032.nfs init_work_dir = /lustre/projects/USER_data/4517695bf8edbc2e54ce6 d59ab799f27/vcf/2_SM Qmgr: list queue queuename queue_type = Execution total_jobs = 7 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:7 Exiting:0 resources_max.nodect = 32 resources_max.walltime = 96:00:00 resources_min.nodect = 1 resources_default.neednodes = bladenoht resources_default.walltime = 01:00:00 acl_group_enable = True acl_groups = cribi mtime = Wed Sep 26 19:23:20 2012 resources_assigned.nodect = 7 enabled = True started = True Qmgr: list server Server master.nfs server_state = Active scheduling = True total_jobs = 18 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:16 Exiting:0 acl_hosts = master.nfs,master,master.sp,localhost.localdomain acl_roots = root at master.nfs,root at master,root at master.sp managers = maui at master.nfs,maui at localhost.localdomain,root at master.nfs, root at localhost.localdomain default_queue = queuename log_events = 511 mail_from = adm query_other_jobs = True resources_default.walltime = 00:01:00 resources_assigned.ncpus = 1 resources_assigned.nodect = 16 scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 node_pack = True mom_job_sync = True pbs_version = 3.0.2 kill_delay = 300 keep_completed = 600 submit_hosts = master2.nfs allow_node_submit = True next_job_number = 54163 net_counter = 93 48 24 Any suggestion? Thanks in advantage -- Diego Bacchin IT System Administrator at BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it 366 72 97 232 From jwilkinson at stoneeagle.com Thu Dec 20 16:02:53 2012 From: jwilkinson at stoneeagle.com (Jack Wilkinson) Date: Thu, 20 Dec 2012 23:02:53 +0000 Subject: [torqueusers] multiple jobs: yes, no, maybe Message-ID: <00051F5C670B8444B35CB2B31B9B1D090C5B3EFB@se-ex2.stoneeagle.com> We have a -small- cluster configured. Four dedicated nodes and one node that shares the function of the headbox. I had a user ask a question today that I wasn't sure how to answer. In our current configuration, if we, for example, drop 12 jobs into the queue, the systems run the first five, as a node becomes available, the next job in the queue starts running until all 12 jobs have been run. There is only one job on any one node at a time. The user wanted to know if it was possible to have more than one job running on a node at a time? I honestly didn't have an answer for him. My feeling is that the whole purpose of the cluster is to give the power of a whole node to process a job and not have to share it. Might I get some input from the crowd?? Happy holidays to all! jack Jack Wilkinson, Programmer Services | VPay(r) P: 972.367-6622 jwilkinson at stoneeagle.com www.stoneeagle.com www.vpayusa.com 111 W. Spring Valley Rd., #100 Richardson, TX 75081 CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121220/5c44064f/attachment.html From dong.tian at gmail.com Thu Dec 20 16:35:59 2012 From: dong.tian at gmail.com (Tian, Dong) Date: Thu, 20 Dec 2012 18:35:59 -0500 Subject: [torqueusers] Short of physical memory, crash? Message-ID: Dear Experts, I have the following question as a cluster user. My job is to submit jobs to the cluster to do simulations. Forgive me if my question sound simple. :-) In one example, on one compute node, there are 48 GB RAM, 12 cores/CPUs. If each job take <4GB RAM, there should be no any issue to run 12 jobs on one node. Now the problem is that one job takes 4.5 GB physical RAM at peak, say as reported by qstat -f. If 12 such jobs are submitted and running on one compute node. Are there any risks to crash down the compute node? Let us assume the job program is written in a safe manner. My understanding is that the compute node may crash from the shortage of memory, but want to have confirmation from you guys. Appreciate your time! Thanks, Dong -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121220/39117f76/attachment.html From diego.bacchin at bmr-genomics.it Thu Dec 20 17:52:05 2012 From: diego.bacchin at bmr-genomics.it (Diego Bacchin) Date: Fri, 21 Dec 2012 01:52:05 +0100 Subject: [torqueusers] Short of physical memory, crash? In-Reply-To: References: Message-ID: <70D07B35-1679-47DE-999C-75A140529217@bmr-genomics.it> Hi, In my experience the node will start use the swap partition. the jobs will work if you have enough swap but the performance will be very very slow. For example a simple ssh will take 5 seconds. In my opinion the best solution is to run the jobs in 2 set, in this way you will not use the swap Bye -- Diego Bacchin Il giorno 21/dic/2012, alle ore 00:35, "Tian, Dong" ha scritto: > Dear Experts, > > I have the following question as a cluster user. My job is to submit jobs to the cluster to do simulations. Forgive me if my question sound simple. :-) > > In one example, on one compute node, there are 48 GB RAM, 12 cores/CPUs. If each job take <4GB RAM, there should be no any issue to run 12 jobs on one node. > > Now the problem is that one job takes 4.5 GB physical RAM at peak, say as reported by qstat -f. If 12 such jobs are submitted and running on one compute node. Are there any risks to crash down the compute node? Let us assume the job program is written in a safe manner. > > My understanding is that the compute node may crash from the shortage of memory, but want to have confirmation from you guys. > > Appreciate your time! > > Thanks, > Dong > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From andre.gemuend at scai.fraunhofer.de Fri Dec 21 00:25:30 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Fri, 21 Dec 2012 08:25:30 +0100 (CET) Subject: [torqueusers] multiple jobs: yes, no, maybe In-Reply-To: <00051F5C670B8444B35CB2B31B9B1D090C5B3EFB@se-ex2.stoneeagle.com> Message-ID: <1029176132.464271.1356074730552.JavaMail.root@scai.fraunhofer.de> Hi Jack, ----- Urspr?ngliche Mail ----- > The user wanted to know if it was possible to have more than one job > running on a node at a time? I honestly didn?t have an answer for > him. it is possible. If you use Maui, you can set NODEACCESSPOLICY to shared (allow multiple jobs) or singleuser (allow multiple jobs from same user). That way you can allow one job per core. The problem is usually misbehaving software. A job can request only one core, but still use the whole node. E.g. codes that use OpenMP will usually use the whole machine. The first step against that would be to set the node availability computation to combined (respecting utilized resources instead of only "reserved" resources): http://www.adaptivecomputing.com/resources/docs/maui/5.4nodeavailability.php Then you can restrict the available resources of a job using CPUsets (http://www.adaptivecomputing.com/resources/docs/torque/2-5-12/help.htm#topics/3-nodes/linuxCpusetSupport.htm). > My feeling is that the whole purpose of the cluster is to give the > power of a whole node to process a job and not have to share it. Really depends on the workload. If you have many single core independent jobs (more HTC nature), you'll want to allow one job per core. You could set your queue resource default to allocate a whole node, but let users specifically request single cores, or even use a dedicated queue for that. Greetings Andr? -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From basv at sara.nl Fri Dec 21 00:28:48 2012 From: basv at sara.nl (Bas van der Vlies) Date: Fri, 21 Dec 2012 07:28:48 +0000 Subject: [torqueusers] pbsnodes message=EVENT in status field Message-ID: <74EB35DC444C754DA400390F673C23D60867C392@sara-exch-02.ka.sara.nl> > Hello, > > i want to know how i can set a EVENT message for a node; e.g.: pbsnodes n1: > {{{{ > eg: node: n1 > state = free > np = 8 > properties = ib,switch1,highmem > ntype = cluster > jobs = 0/567403.sara.nl, 1/567403.sara.nl > status = > ..,loadave=0.00,message=EVENT:sample.time=1288864220.003:cputotals.user=0:iconnect.pktout=0,netload=3487600394,state=free,? > }}} > > This is an example that i programmed in pbs_python for somebody else, but i can not remember how to set it. . I can set MESSAGE=ERROR, just print ERROR to stdout in a health script with a description. Or is the same for an event message? > I have found the answer in the source mom_server.c: {{{ if (!strncasecmp(tmpPBSNodeMsgBuf, "ERROR", strlen("ERROR"))) { IsError = 1; } else if (!strncasecmp(tmpPBSNodeMsgBuf, "EVENT:", strlen("EVENT:"))) { /* pass event directly to scheduler for processing */ /* NO-OP */ }}} Here are some examples how i use it. I could not find any info on the web. in your health script print EVENT: to stdout. You can set multiple events with or without values. * echo EVENT:reason1=kernel_upgrade * echo EVENT:reason2=reinstall_node * echo EVENT:firmware_update I have also patched PBBQuery.py in the pbs_python package so it can handle multiple EVENTS. regards -- Bas van der Vlies mail: basv at sara.nl SARA - Academic Computing Services , Amsterdam, The Netherlands From stucki at mi.fu-berlin.de Fri Dec 21 06:01:58 2012 From: stucki at mi.fu-berlin.de (Christoph (Stucki) von Stuckrad) Date: Fri, 21 Dec 2012 14:01:58 +0100 Subject: [torqueusers] Short of physical memory, crash? In-Reply-To: <70D07B35-1679-47DE-999C-75A140529217@bmr-genomics.it> References: <70D07B35-1679-47DE-999C-75A140529217@bmr-genomics.it> Message-ID: <20121221130157.GS20653@localhost.mi.fu-berlin.de> On Fri, 21 Dec 2012, Diego Bacchin wrote: > In my experience the node will start use the swap partition. the jobs > will work if you have enough swap but the performance will be very > very slow. This is correct, if the memory-limit is not enforced. I think, there are three typical cases: Given a large number of jobs 'supposed to be <4G' and SETTING the limit to 4G via qsub (or jobscripts, defaults). Torque/Maui will make sure, to put no more jobs into a machine than there is memory. (In my test case only three in a 32G host! Seemingly this uses/counts the really free memory (so the system itself uses too much to allow the forth job in my case). Now, what happens, if the job starts to allocate more, will depend on the settings of torque (I believe). 1) In our case here, the job will be killed shortly after starting to grow. To 'survive this time of growth till killing it' the system uses the swapspace if necessary, and will definitely slow down to a crawl as long as it is swapping in AND out, but only for a relatively short time. I never tried yet, but reading the manuals I think one can define alternatives to killing the job, so you might 2) simply let them run (but slow), if you're sure they each 'overbook' the memory only for a short time, and NOT ALL AT ONCE - if memory AND swap BOTH are exhausted, the Kernel will randomly kill 'programs which request more memory' and the system will be unstable or die horribly (or e.g. only torque_mom dies first). 3) There also seem to be settings to allow SLIGHT overbooking for the sum of all jobs, to 'fill' the host completely. (In my case I'd have to allow near 2G to crowd in another 4G job, which will result in 'swap out' of near 2G mem, but there's a good chance, those 2G might not really be needed all the time, so might NOT slow down overall use). I have not found out, whether such a 'soft limit' versus 'hard/kill limit' solution exists for the jobs themselves. And may be somebody will point me(us?) in the correct direction, how to calculate/install the correct settings for such 'overbooking of memory to fully use the nodes' ??? Yours Stucki (starting cluster admin) -- Christoph von Stuckrad * * |nickname |Mail \ Freie Universitaet Berlin |/_*|'stucki' |Tel(Mo.,Mi.):+49 30 838-75 459| Mathematik & Informatik EDV |\ *|if online| (Di,Do,Fr):+49 30 77 39 6600| Takustr. 9 / 14195 Berlin * * |on IRCnet|Fax(home): +49 30 77 39 6601/ From jjc at iastate.edu Fri Dec 21 08:22:55 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 21 Dec 2012 15:22:55 +0000 Subject: [torqueusers] Short of physical memory, crash? In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210FA1D5EC@ITSDAG1D.its.iastate.edu> The crash will happen only if all physical memory + swap space is exceeded, and the out-of-memory (oom) process killer (See http://linux-mm.org/OOM_Killer) may save the node by killing exceptionally huge processes. You cab check the anount iof swapspace on a bnode via the swapon -s Linux command. If there is sufficient swapspace + physical memory, for both your program, the pbs_mom and other system processes, then there should be no crash, but things may slow down quite a bit. If your processes really need 4.5GB, then use vmem=4608MB,pmem=4608MB. This should allow 10 on a node. If you submit with a reservation less than what will be used, expect problems (slowness probably). If you do so, run at least two of the commands with a nice parameter of 18 or 19 This will allow the OS and paging system to get more CPU cycles, and hence be able to respond marginally better. E.g. #!/bin/csh #PBS -l nodes=1:ppn=1,vmem=4GB,pmem=4GB,mem=4GB,walltime=48:00:00 cd ${PBS_O_WORKDIR} nice +19 ./a.out From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tian, Dong Sent: Thursday, December 20, 2012 5:36 PM To: Torque Users Mailing List Subject: [torqueusers] Short of physical memory, crash? Dear Experts, I have the following question as a cluster user. My job is to submit jobs to the cluster to do simulations. Forgive me if my question sound simple. :-) In one example, on one compute node, there are 48 GB RAM, 12 cores/CPUs. If each job take <4GB RAM, there should be no any issue to run 12 jobs on one node. Now the problem is that one job takes 4.5 GB physical RAM at peak, say as reported by qstat -f. If 12 such jobs are submitted and running on one compute node. Are there any risks to crash down the compute node? Let us assume the job program is written in a safe manner. My understanding is that the compute node may crash from the shortage of memory, but want to have confirmation from you guys. Appreciate your time! Thanks, Dong -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121221/280a158e/attachment.html From craig.tierney at noaa.gov Fri Dec 21 10:39:37 2012 From: craig.tierney at noaa.gov (Craig Tierney - NOAA Affiliate) Date: Fri, 21 Dec 2012 10:39:37 -0700 Subject: [torqueusers] multiple jobs: yes, no, maybe In-Reply-To: <00051F5C670B8444B35CB2B31B9B1D090C5B3EFB@se-ex2.stoneeagle.com> References: <00051F5C670B8444B35CB2B31B9B1D090C5B3EFB@se-ex2.stoneeagle.com> Message-ID: Jack, There is an obvious answer to this, "It depends". Many HPC people thing that serial jobs are not HPC and have no business on their system. We take a different approach. Most (95%) of our cycles are large parallel jobs. However, we have to pre- and post- process data to get it into a form that the big parallel model does. Most of these jobs are serial. These serial jobs comprise about 70% of but only 5% of the cpu time. Now that nodes have 12, 16 or more cores, letting a serial process on a node run is a waste. So we pack serial jobs onto nodes, but if the job is parallel we dedicate nodes. This works well for us. The trick with the serial jobs is to not let one user affect another. We have no swap on our nodes, so it is easy to blow out memory. To help guard against this we do a couple things. First, we require that users specify -lvmem and set how much memory their jobs use. Second, we use cgroups directly on the nodes to set a maximum amount of RAM that can be used so that system processes are not affected when users allocate too much memory. The more nodes you have, the easier it is to be wasteful. If you only have 5 nodes, it would make sense to try and pack jobs together. Craig On Thu, Dec 20, 2012 at 4:02 PM, Jack Wilkinson wrote: > We have a ?small- cluster configured. Four dedicated nodes and one node > that shares the function of the headbox. > > > > I had a user ask a question today that I wasn?t sure how to answer. In > our current configuration, if we, for example, drop 12 jobs into the queue, > the systems run the first five, as a node becomes available, the next job > in the queue starts running until all 12 jobs have been run. There is only > one job on any one node at a time. > > > > The user wanted to know if it was possible to have more than one job > running on a node at a time? I honestly didn?t have an answer for him. > > > > My feeling is that the whole purpose of the cluster is to give the power > of a whole node to process a job and not have to share it. > > > > Might I get some input from the crowd?? > > > > Happy holidays to all! > > jack > > > > *Jack Wilkinson,* Programmer > > Services | VPay? > > P: 972.367-6622 > > jwilkinson at stoneeagle.com > > www.stoneeagle.com > > www.vpayusa.com > > > > 111 W. Spring Valley Rd., #100 > > Richardson, TX 75081 > > > CONFIDENTIALITY NOTICE: This email, including any attachments, is for the > sole use of the intended recipient(s) and may contain confidential and > privileged information. Any unauthorized review, use, disclosure, or > distribution is prohibited. If you received this email and are not the > intended recipient, please inform the sender by email reply and destroy all > copies of the original message. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121221/74dcff0a/attachment-0001.html From dong.tian at gmail.com Fri Dec 21 11:59:56 2012 From: dong.tian at gmail.com (Tian, Dong) Date: Fri, 21 Dec 2012 13:59:56 -0500 Subject: [torqueusers] Short of physical memory, crash? In-Reply-To: <242421BFAF465844BE24EB90BB97E2210FA1D5EC@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210FA1D5EC@ITSDAG1D.its.iastate.edu> Message-ID: Dear James, Chris, Bacchin and all, Thanks for explaining in much details. You helped me to understand better the memory management issues on a computing node. Though it is interesting to use a nice parameter of 18 or 19, I conclude that to mandate memory parameter for all job submissions should be a good practice, otherwise the jobs without memory parameter may still jump into a node and it may exceed the memory + swap space, which would cause problems. Is it common to mandate memory parameter for all job submissions? I do not want to ask other users to do any extra work, even it is just to type a few more words. Thanks and Happy Holiday to you, Dong On Fri, Dec 21, 2012 at 10:22 AM, Coyle, James J [ITACD] wrote: > ** ** > > The crash will happen only if all physical memory + swap space is > exceeded, and the out-of-memory**** > > (oom) process killer (See http://linux-mm.org/OOM_Killer) may save the > node by killing exceptionally **** > > huge processes. You cab check the anount iof swapspace on a bnode via the > **** > > swapon -s**** > > Linux command. If there is sufficient swapspace + physical memory, for > both your program, the pbs_mom**** > > and other system processes, then there should be no crash, but things may > slow down quite a bit.**** > > ** ** > > If your processes really need 4.5GB, then use vmem=4608MB,pmem=4608MB.* > *** > > This should allow 10 on a node.**** > > ** ** > > If you submit with a reservation less than what will be used, expect > problems (slowness probably).**** > > If you do so, run at least two of the commands with a nice parameter of 18 > or 19**** > > This will allow the OS and paging system to get more CPU cycles, and hence > be able to respond**** > > marginally better.**** > > ** ** > > E.g.**** > > ** ** > > #!/bin/csh**** > > ** ** > > #PBS ?l nodes=1:ppn=1,vmem=4GB,pmem=4GB,mem=4GB,walltime=48:00:00**** > > ** ** > > cd ${PBS_O_WORKDIR}**** > > nice +19 ./a.out**** > > ** ** > > ** ** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Tian, Dong > *Sent:* Thursday, December 20, 2012 5:36 PM > *To:* Torque Users Mailing List > *Subject:* [torqueusers] Short of physical memory, crash?**** > > ** ** > > Dear Experts,**** > > ** ** > > I have the following question as a cluster user. My job is to submit jobs > to the cluster to do simulations. Forgive me if my question sound simple. > :-)**** > > ** ** > > In one example, on one compute node, there are 48 GB RAM, 12 cores/CPUs. > If each job take <4GB RAM, there should be no any issue to run 12 jobs on > one node. **** > > ** ** > > Now the problem is that one job takes 4.5 GB physical RAM at peak, say as > reported by qstat -f. If 12 such jobs are submitted and running on one > compute node. Are there any risks to crash down the compute node? Let us > assume the job program is written in a safe manner.**** > > ** ** > > My understanding is that the compute node may crash from the shortage of > memory, but want to have confirmation from you guys.**** > > ** ** > > Appreciate your time!**** > > ** ** > > Thanks,**** > > Dong**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121221/79e5d4ae/attachment.html