[torqueusers] pbs_sched problem in 4.2.5

Ezell, Matthew A. ezellma at ornl.gov
Thu Sep 26 15:05:27 MDT 2013


I think it was broken by commit 062443f9b826bce01c400acd72c779c806764198.
It appears that pbs_sched works differently than Moab/Maui.  Moab and Maui
actively connect to the pbs_server and ask it for status, but pbs_sched
appears to communicate across the connection that the pbs_server initiates
for  the SCH_SCHEDULE_TIME command.  Now, the server immediately closes
the socket, so pbs_sched doesn't have a chance to ask it for status.

I reverted the commit and pbs_sched appeared to start working again.  I'm
not sure if it has bad implications for Moab/Maui, as I don't have either
setup on my development platform.

~Matt

---
Matt Ezell
HPC Systems Administrator
Oak Ridge National Laboratory




On 9/17/13 12:09 PM, "Ken Nielson" <knielson at adaptivecomputing.com> wrote:

>Josh,
>
>
>You are right. We need to fix pbs_sched
>
>
>ken
>
>
>
>On Tue, Sep 17, 2013 at 9:41 AM, Trutwin, Joshua
><JTRUTWIN at csbsju.edu> wrote:
>
>Yes it is running.
>
> 
># qmgr -c 'p s'
>#
># Create queues and set their attributes.
>#
>#
># Create and define queue batch
>#
>create queue batch
>set queue batch queue_type = Execution
>set queue batch resources_default.nodes = 1
>set queue batch resources_default.walltime = 01:00:00
>set queue batch enabled = True
>set queue batch started = True
>#
># Set server attributes.
>#
>set server scheduling = True
>set server acl_hosts =
>torque.csbsju.edu <http://torque.csbsju.edu>
>set server managers =
>root at torque.csbsju.edu
>set server operators =
>root at torque.csbsju.edu
>set server default_queue = batch
>set server log_events = 511
>set server mail_from = adm
>set server scheduler_iteration = 600
>set server node_check_rate = 150
>set server tcp_timeout = 300
>set server job_stat_rate = 45
>set server poll_jobs = True
>set server log_level = 4
>set server disable_server_id_check = True
>set server mom_job_sync = True
>set server mail_domain =
>csbsju.edu <http://csbsju.edu>
>set server keep_completed = 300
>set server submit_hosts = lincl[1-17]
>set server submit_hosts += lin[1-24]
>set server submit_hosts += lincsb[1-3]
>set server submit_hosts += linhab[1-2]
>set server submit_hosts += linfac[1-6]
>set server submit_hosts += linmath[1-4]
>set server submit_hosts += linphys[1-9]
>set server submit_hosts += linphysfac[1-4]
>set server submit_hosts += nx
>set server allow_node_submit = True
>set server allow_proxy_user = True
>set server auto_node_np = True
>set server next_job_number = 16
>set server record_job_info = True
>set server record_job_script = True
>set server moab_array_compatible = True
> 
> 
>I installed maui and things are working well for me now, but it would be
>nice if pbs_sched worked as well.
> 
>Thanks,
> 
>Josh
> 
> 
>From:torqueusers-bounces at supercluster.org
>[mailto:torqueusers-bounces at supercluster.org]
>On Behalf Of Ken Nielson
>Sent: Friday, September 13, 2013 11:30 AM
>To: Torque Users Mailing List
>Subject: Re: [torqueusers] pbs_sched problem in 4.2.5
> 
>do you have trqauthd running?
>
>What does your qmgr -c 'p s' output look like?
>
>Thanks
>
> 
>On Thu, Sep 12, 2013 at 6:19 PM, Trutwin, Joshua <JTRUTWIN at csbsju.edu>
>wrote:
>Hi,
> 
>I think I¹m running into a known issue but wanted to confirm.
> 
>I setup a simple torque environment using 4.2.5 ­ I have a single compute
>node and when I try to submit a test job it winds up getting stuck in the
>queue until I run qrun to force it.  I ran the scheduler like so:
> 
>export PBSDEBUG=1
>export PBSLOGLEVEL=3
>/opt/torque-4.2.5/sbin/pbs_sched
> 
>When I submit the job this shows up in the console:
> 
>pbs_statserver failed: 15033
>Problem with creating server data structure
> 
>Looking up this error I see these two posts about it:
> 
>http://comments.gmane.org/gmane.comp.clustering.torque.user/13273
>http://comments.gmane.org/gmane.comp.clustering.torque.user/13058
> 
>Is there a fix or do I have to switch to Maui?
> 
>Thanks,
> 
>Josh
> 
> 
>
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>-- 
>Ken Nielson
>+1 801.717.3700 <tel:%2B1%20801.717.3700> office
>+1 801.717.3738 <tel:%2B1%20801.717.3738> fax
>1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>
>
>
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>-- 
>Ken Nielson
>+1 801.717.3700 office +1 801.717.3738 fax
>1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>



More information about the torqueusers mailing list