[torqueusers] pbs_sched problem in 4.2.5

David Beer dbeer at adaptivecomputing.com
Thu Sep 26 15:52:06 MDT 2013


Thanks for the info Matt! That will make solving this problem easy. I have
recorded a github issue along with a proposed solution on github:
https://github.com/adaptivecomputing/torque/issues/188

All interested parties feel free to view and and critique the proposed
solution if you like.


On Thu, Sep 26, 2013 at 3:05 PM, Ezell, Matthew A. <ezellma at ornl.gov> wrote:

> I think it was broken by commit 062443f9b826bce01c400acd72c779c806764198.
> It appears that pbs_sched works differently than Moab/Maui.  Moab and Maui
> actively connect to the pbs_server and ask it for status, but pbs_sched
> appears to communicate across the connection that the pbs_server initiates
> for  the SCH_SCHEDULE_TIME command.  Now, the server immediately closes
> the socket, so pbs_sched doesn't have a chance to ask it for status.
>
> I reverted the commit and pbs_sched appeared to start working again.  I'm
> not sure if it has bad implications for Moab/Maui, as I don't have either
> setup on my development platform.
>
> ~Matt
>
> ---
> Matt Ezell
> HPC Systems Administrator
> Oak Ridge National Laboratory
>
>
>
>
> On 9/17/13 12:09 PM, "Ken Nielson" <knielson at adaptivecomputing.com> wrote:
>
> >Josh,
> >
> >
> >You are right. We need to fix pbs_sched
> >
> >
> >ken
> >
> >
> >
> >On Tue, Sep 17, 2013 at 9:41 AM, Trutwin, Joshua
> ><JTRUTWIN at csbsju.edu> wrote:
> >
> >Yes it is running.
> >
> >
> ># qmgr -c 'p s'
> >#
> ># Create queues and set their attributes.
> >#
> >#
> ># Create and define queue batch
> >#
> >create queue batch
> >set queue batch queue_type = Execution
> >set queue batch resources_default.nodes = 1
> >set queue batch resources_default.walltime = 01:00:00
> >set queue batch enabled = True
> >set queue batch started = True
> >#
> ># Set server attributes.
> >#
> >set server scheduling = True
> >set server acl_hosts =
> >torque.csbsju.edu <http://torque.csbsju.edu>
> >set server managers =
> >root at torque.csbsju.edu
> >set server operators =
> >root at torque.csbsju.edu
> >set server default_queue = batch
> >set server log_events = 511
> >set server mail_from = adm
> >set server scheduler_iteration = 600
> >set server node_check_rate = 150
> >set server tcp_timeout = 300
> >set server job_stat_rate = 45
> >set server poll_jobs = True
> >set server log_level = 4
> >set server disable_server_id_check = True
> >set server mom_job_sync = True
> >set server mail_domain =
> >csbsju.edu <http://csbsju.edu>
> >set server keep_completed = 300
> >set server submit_hosts = lincl[1-17]
> >set server submit_hosts += lin[1-24]
> >set server submit_hosts += lincsb[1-3]
> >set server submit_hosts += linhab[1-2]
> >set server submit_hosts += linfac[1-6]
> >set server submit_hosts += linmath[1-4]
> >set server submit_hosts += linphys[1-9]
> >set server submit_hosts += linphysfac[1-4]
> >set server submit_hosts += nx
> >set server allow_node_submit = True
> >set server allow_proxy_user = True
> >set server auto_node_np = True
> >set server next_job_number = 16
> >set server record_job_info = True
> >set server record_job_script = True
> >set server moab_array_compatible = True
> >
> >
> >I installed maui and things are working well for me now, but it would be
> >nice if pbs_sched worked as well.
> >
> >Thanks,
> >
> >Josh
> >
> >
> >From:torqueusers-bounces at supercluster.org
> >[mailto:torqueusers-bounces at supercluster.org]
> >On Behalf Of Ken Nielson
> >Sent: Friday, September 13, 2013 11:30 AM
> >To: Torque Users Mailing List
> >Subject: Re: [torqueusers] pbs_sched problem in 4.2.5
> >
> >do you have trqauthd running?
> >
> >What does your qmgr -c 'p s' output look like?
> >
> >Thanks
> >
> >
> >On Thu, Sep 12, 2013 at 6:19 PM, Trutwin, Joshua <JTRUTWIN at csbsju.edu>
> >wrote:
> >Hi,
> >
> >I think I¹m running into a known issue but wanted to confirm.
> >
> >I setup a simple torque environment using 4.2.5 ­ I have a single compute
> >node and when I try to submit a test job it winds up getting stuck in the
> >queue until I run qrun to force it.  I ran the scheduler like so:
> >
> >export PBSDEBUG=1
> >export PBSLOGLEVEL=3
> >/opt/torque-4.2.5/sbin/pbs_sched
> >
> >When I submit the job this shows up in the console:
> >
> >pbs_statserver failed: 15033
> >Problem with creating server data structure
> >
> >Looking up this error I see these two posts about it:
> >
> >http://comments.gmane.org/gmane.comp.clustering.torque.user/13273
> >http://comments.gmane.org/gmane.comp.clustering.torque.user/13058
> >
> >Is there a fix or do I have to switch to Maui?
> >
> >Thanks,
> >
> >Josh
> >
> >
> >
> >
> >
> >_______________________________________________
> >torqueusers mailing list
> >torqueusers at supercluster.org
> >http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> >--
> >Ken Nielson
> >+1 801.717.3700 <tel:%2B1%20801.717.3700> office
> >+1 801.717.3738 <tel:%2B1%20801.717.3738> fax
> >1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> >www.adaptivecomputing.com <http://www.adaptivecomputing.com>
> >
> >
> >
> >
> >
> >
> >_______________________________________________
> >torqueusers mailing list
> >torqueusers at supercluster.org
> >http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> >
> >
> >
> >
> >--
> >Ken Nielson
> >+1 801.717.3700 office +1 801.717.3738 fax
> >1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> >www.adaptivecomputing.com <http://www.adaptivecomputing.com>
> >
> >
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130926/252b7db0/attachment-0001.html 


More information about the torqueusers mailing list