Bugzilla – Bug 142
pbs_server hangs trying to check spurious pbs server from depend=afterok line
Last modified: 2011-06-30 14:48:31 MDT
You need to
before you can comment on or make changes to this bug.
We had a user cut and paste an example from our documentation without changing
the jobid and hostname from the example.
As it turns out, the hostname in the example, meant to be a generic unused
host, is actually a machine on campus, not running pbs_server. pbs_server on
the cluster froze while trying to reach the spurious cluster head, and would
not respond until I killed the server and deleted the offending job.
The same user has now found a similar, but new way to crash the pbs_server, by
specifying a legitimate job id other than their own to the afterok field.
(In reply to comment #1)
> The same user has now found a similar, but new way to crash the pbs_server, by
> specifying a legitimate job id other than their own to the afterok field.
It turns out that the second crash is the same issue, (the job id was mangled
by the user, and it was trying a different spurious host, again not a pbs
Would you paste the user command on this bug?
qsub -W depend=afterok:15606 /users/user1/MEME/wz16_6-30-2011.pbs
This failed normally, then the user got creative, and changed it to:
qsub -W depend=afterok:16432.user2.rice.edu
Note that our cluster head is biouman.rcsg.rice.edu and that user2.rice.edu by
some cosmic joke/coincidence, happens to be a resolvable hostname in our
campus, named after the user whose job user1 was trying to piggyback on in a
misguided attempt to make his job run sooner.
Our original example used 81223.cluster.rice.edu, which user1 had cut and
pasted, resulting in the first crash (cluster.rice.edu was meant as a <insert
cluster name here>.rice.edu, but, by the same cosmic coincidence, is a
resolvable host name on campus.
user1 and user2 uids have been changed to protect the innocent in case this
bugzilla is googled.