Materials Studio Bugs with PBS (was Re: [torqueusers] Node free in
pbsnodes but queues won't use it)
Chris Samuel
csamuel at vpac.org
Mon Jul 4 21:42:24 MDT 2005
On Wed, 11 May 2005 07:09 am, Munson,Jennifer N. wrote:
> I have a Rocks cluster which was running Maui and torque but in order to
> troubleshoot and get jobs to run from MS Modeling we stopped Maui and
> enabled the pbs_sched.
[...]
> And the related question would be... Has anyone out there enabled the
> Accelrys application MS Modeling 3.2 to work correctly with Maui and
> Torque using hpmpi? Alternatively, is there anyone out there who is
> itching to know how and would like to help me? ;>
On Thu, 12 May 2005 04:40 am, Roy Dragseth wrote:
> I'm the maintainer of the PBS/Maui roll in Rocks and we have also been
> struggling to make MS Modeling work on our cluster.
Hi Jennifer, Roy,
Ahh, so we're not the only ones struggling with that "interesting" piece of
software to get it to work with PBS.
They make a lot of *VERY* bad assumptions about the layout of a PBS cluster
and I am forever having to hand patch dsd_pbs.pm to *FIX* their software when
our users need it upgrading. :-(
For instance their way of working out if you've got a queueing system is to
look for a pbs_server process! This of course fails if its not running on
your management node (which we don't allow). The fix is to go and edit:
${where MS is}/Gateway/root_default/dsd/commands/queues/PBS/dsd_pbs.pm
search for pbs_serv and change their code that "detects" PBS to always return
1 and then restart the gatekeeper. You should then be able to edit the
"Gateway Data" and select Open PBS as the queueing system.
A more portable solution would be to check that "qstat -q" returns 0 (to show
it's all OK).
You will probably also want to stop its very anti-social pounding of your
pbs_server, it defaults to sleeping for 1 second between qstat's for the
first 30 seconds of the jobs life, and then backs off to hitting it every 5
seconds!
To fix this edit:
${where MS is}/Gateway/root_default/dsd/commands/DSD_serverutils.pm
and search for the Perl variable $delay and fix the couple of lines there.
It's even worse when it comes to checking for the status of jobs it has queued
to find out if they've finished yet - we were forever having the Gateway
declare that MS jobs had finished or crashed when they were still running on
the system, or in some cases were still queued waiting to run!
This is because they wrongly assume that if qstat returns *any* error then the
job has died, which is of course badly incorrect.
The *only* time you can assume a job is finished is when qstat of the job ID
returns with the exit code 153, i.e. in dsd_pbs.pm the qstat check should do:
if (($?>>8) == 153)
Any other error code indicates a problem somewhere else that is unlikely to
have affected the running of a job, and once fixed by the admins you will
either find that the job is still there *or* get the 153 code to show it's
finished.
I've just done the upgrade to 3.2 here and I'm in the process of patching
dsd_pbs.pm yet *AGAIN* to fix it.
It's trivial to prove to yourself by doing:
qstat 001
echo $?
and comparing it with:
qstat nobody
echo $?
The former tells you "qstat: Unknown Job Id 001.${PBSSERVER}" and returns 153
whilst the second tells you "qstat: Unknown queue destination nobody" and
returns 170.
In September *2003* I provided this particular fix to Accelrys and it's still
not appeared in their code, so I'm BCC'ing this to the people at Accelrys I
had contact with then in the hope that I may get some response to this, and
so that they can see that this is a real problem that lots of sites have with
their software.
When I brought this up on the old ScalablePBS Users list back in 2003 I got a
response from some folks at a very large company who were having exactly
these problems with Materials Studio, and this fix (amongst others) helped
them too..
At least the broken parts are all written in Perl, so I can fix their broken
code for them!
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050705/d0bb9165/attachment.bin
More information about the torqueusers
mailing list