[torqueusers] torque launchs more jobs than numberofvirtualproc per node

garrick at speculation.org garrick at speculation.org
Sat Jun 17 23:22:30 MDT 2006


On Fri, Jun 16, 2006 at 06:23:58PM -0700, Sam Rash alleged:
> I looked at the scheduling logs for pbs_sched and have another nugget of
> info if *anyone* has any ideas...:
> 
> 06/16/2006 17:42:26;0040;
> pbs_sched;Job;254.mediagg32.data.corp.sc5.yahoo.com;Internal Scheduling
> Error
> 
> 
> Not all that informative...but at least something was logged.
> 
> I seem to notice the pbs_sched crash (or with maui even, I see it become
> unresponsive) when I have a 'task' submit 512 jobs in rapid succession (say
> in 30-60 seconds)
> 
> I'm kind of at a loss as to where to go to 'fix' things--is this a bug in
> the latest pbs_server that pbs_sched AND maui have issues with? (and local
> to our FreeBSD setup?)

It is a bug in the PBS client libs, so it effects both maui and
pbs_sched.  I believe it is a mishandled timeout from pbs_server.

I had put in a change into 2.1.0 that I thought would fix it.  But I'm
working blind since I can't reproduce it.  I'll go back and look at that
code again.



> 
> 
> Sam Rash
> srash at yahoo-inc.com
> 408-349-7312
> vertigosr37
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
> Sent: Friday, June 16, 2006 12:38 PM
> To: 'Adrian Wu'; torqueusers at supercluster.org
> Subject: RE: [torqueusers] torque launchs more jobs than numberofvirtualproc
> per node
> 
> I don't know if this applies here, but I found when I wanted my job to
> properly inherit a ppn value (which, if I set a host to say 8, and wanted
> each job from a queue to get ppn=2), I needed to set both nodes and
> neednodes...
> 
> set queue batch resources_default.nodes=1:ppn=2
> set queue batch resources_default.neednodes=1:ppn=2
> 
> ...
> set node <hostname> np=8
> 
> would make it 4.  (of course you can use 1/4 unless you have other queues
> where you want to use ppn=3, etc)
> 
> Using just the nodes resulted in a single host getting an unlimited # of
> jobs...
> 
> Hope this helps,
> 
> Regards,
> Sam Rash
> srash at yahoo-inc.com
> 408-349-7312
> vertigosr37
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Adrian Wu
> Sent: Friday, June 16, 2006 12:28 PM
> To: jscoggins at lbl.gov
> Cc: torqueusers at supercluster.org
> Subject: RE: [torqueusers] torque launchs more jobs than number
> ofvirtualproc per node
> 
> Hi Jackie,
> 
> here is my qmgr -c 'p s':
> 
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch max_running = 80
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server operators = root at mumag2.com
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server resources_default.nodes = 1
> set server scheduler_iteration = 60
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server node_pack = False
> set server pbs_version = 2.1.0p0
> 
> my setup is simple, as you can see: one queue, 20 nodes with 4 VP per node.
> All I want is to have no more than 4 jobs launched per node at any given
> time.
> 
> I just reset my database, and applied the above settings; it does not seem
> to help or change the behavior. 
> 
> How did you fix the problem that you feel is similar to mine?
> 
> thanks!
> adrian
> 
> -----Original Message-----
> From: Jacqueline Scoggins [mailto:jscoggins at lbl.gov]
> Sent: Friday, June 16, 2006 11:04 AM
> To: Adrian Wu
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] torque launchs more jobs than number of
> virtualproc per node
> 
> 
> What does your qmgr output look like:  provide the following:
> 
> qmgr -c 'p s' and then we can determine why it goes like this.
> 
> I have a system with both dualcore and dual process systems and I have
> my nodes file similar except I created 2 classes - shared and dualcore
> so the users would have to specify which type of nodes to run on.  But I
> found that the parameters in the database for the scheduler was causing
> me problems similar to this.  So send that information and maybe the
> answer will pop out.
> 
> Jackie
> 
> On Thu, 2006-06-15 at 08:36, Adrian Wu wrote:
> > Hi all,
> > 
> > I have installed torque 2.1.0p0 on 20 dual socket dual-core nodes, and
> using pbs_sched. in my nodes files i have specified:
> > 
> > node1 np=4
> > node2 np=4
> > .
> > .
> > node20 np=4
> > 
> > All my jobs are single process jobs that needs to run on one core/virtual
> processor, and tend to finish about the same time. I can't get torque to
> stop launching just 4 jobs per node. If my queue is not full, this seems to
> work; but if I have, say, 300 jobs in the queue, with majority of the jobs
> queued up behind the first "wave" of jobs, some of the jobs from the 2nd
> "wave" would launch as many as 8 jobs on a single node, therefore
> substantially slowing down all the jobs on this node. When I try to set
> $max_load in the mom_priv/config (tried to set at 3.5), the nodes gets the
> job-exclusive,busy state, but would still continue to take on jobs. It seems
> like, once there are jobs queued up, torque no longer check each node's
> state before launching more jobs to it...
> > 
> > I've read posts similar (not exactly same behavior) to this, and a
> recompile of torque without optimization helped. I just ran ./configure and
> make - where should I take out the optimization?
> > 
> > Would using the maui scheduler (instead of pbs_sched) help?
> > 
> > any suggestion from the list would be helpful. thanks in advance!
> > 
> > adrian
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list