Materials Studio Bugs with PBS (was Re: [torqueusers] Node free in pbsnodes but queues won't use it)

Dave Jackson jacksond at clusterresources.com
Tue Jul 5 09:50:43 MDT 2005


Chris,

  Can you organize this into an 'application integration guide' that we
can put into the TORQUE admin manual?  In the upcoming 'TORQUE 2.0 Admin
Manual', we are beginning to add a full application integration
appendix.  Also, if there are others out there with other useful
integration tips, please let us know.

Thanks,
Dave

On Tue, 2005-07-05 at 13:42 +1000, Chris Samuel wrote:
> On Wed, 11 May 2005 07:09 am, Munson,Jennifer N. wrote:
> 
> > I have a Rocks cluster which was running Maui and torque but in order to
> > troubleshoot and get jobs to run from MS Modeling we stopped Maui and
> > enabled the pbs_sched.
> [...]
> > And the related question would be... Has anyone out there enabled the
> > Accelrys application MS Modeling 3.2 to work correctly with Maui and
> > Torque using hpmpi? Alternatively, is there anyone out there who is
> > itching to know how and would like to help me? ;>
> 
> On Thu, 12 May 2005 04:40 am, Roy Dragseth wrote:
> 
> > I'm the maintainer of the PBS/Maui roll in Rocks and we have also been 
> > struggling to make MS Modeling work on our cluster.
> 
> 
> Hi Jennifer, Roy,
> 
> Ahh, so we're not the only ones struggling with that "interesting" piece of 
> software to get it to work with PBS.
> 
> They make a lot of *VERY* bad assumptions about the layout of a PBS cluster 
> and I am forever having to hand patch dsd_pbs.pm to *FIX* their software when 
> our users need it upgrading. :-(
> 
> For instance their way of working out if you've got a queueing system is to 
> look for a pbs_server process!  This of course fails if its not running on 
> your management node (which we don't allow).   The fix is to go and edit:
> 
>  ${where MS is}/Gateway/root_default/dsd/commands/queues/PBS/dsd_pbs.pm
> 
> search for pbs_serv and change their code that "detects" PBS to always return 
> 1 and then restart the gatekeeper.  You should then be able to edit the 
> "Gateway Data" and select Open PBS as the queueing system.
> 
> A more portable solution would be to check that "qstat -q" returns 0 (to show 
> it's all OK).
> 
> You will probably also want to stop its very anti-social pounding of your 
> pbs_server, it defaults to sleeping for 1 second between qstat's for the 
> first 30 seconds of the jobs life, and then backs off to hitting it every 5 
> seconds!
> 
> To fix this edit:
> 
>  ${where MS is}/Gateway/root_default/dsd/commands/DSD_serverutils.pm
> 
> and search for the Perl variable $delay and fix the couple of lines there.
> 
> It's even worse when it comes to checking for the status of jobs it has queued 
> to find out if they've finished yet - we were forever having the Gateway 
> declare that MS jobs had finished or crashed when they were still running on 
> the system, or in some cases were still queued waiting to run!
> 
> This is because they wrongly assume that if qstat returns *any* error then the 
> job has died, which is of course badly incorrect.
> 
> The *only* time you can assume a job is finished is when qstat of the job ID 
> returns with the exit code 153, i.e. in dsd_pbs.pm the qstat check should do:
> 
>  if (($?>>8) == 153)
> 
> Any other error code indicates a problem somewhere else that is unlikely to 
> have affected the running of a job, and once fixed by the admins you will 
> either find that the job is still there *or* get the 153 code to show it's 
> finished.
> 
> I've just done the upgrade to 3.2 here and I'm in the process of patching 
> dsd_pbs.pm yet *AGAIN* to fix it.
> 
> It's trivial to prove to yourself by doing:
> 
>  qstat 001
>  echo $?
> 
> and comparing it with:
> 
>  qstat  nobody
>  echo $?
> 
> The former tells you "qstat: Unknown Job Id 001.${PBSSERVER}" and returns 153 
> whilst the second tells you "qstat: Unknown queue destination nobody" and 
> returns 170.
> 
> In September *2003* I provided this particular fix to Accelrys and it's still 
> not appeared in their code, so I'm BCC'ing this to the people at Accelrys I 
> had contact with then in the hope that I may get some response to this, and 
> so that they can see that this is a real problem that lots of sites have with 
> their software.
> 
> When I brought this up on the old ScalablePBS Users list back in 2003 I got a 
> response from some folks at a very large company who were having exactly 
> these problems with Materials Studio, and this fix (amongst others) helped 
> them too..
> 
> At least the broken parts are all written in Perl, so I can fix their broken 
> code for them!
> 
> cheers,
> Chris
> -- 
>  Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
>  Victorian Partnership for Advanced Computing http://www.vpac.org/
>  Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list