Bugzilla – Bug 142
pbs_server hangs trying to check spurious pbs server from depend=afterok line
Last modified: 2011-06-30 14:48:31 MDT
You need to log in before you can comment on or make changes to this bug.
We had a user cut and paste an example from our documentation without changing the jobid and hostname from the example. As it turns out, the hostname in the example, meant to be a generic unused host, is actually a machine on campus, not running pbs_server. pbs_server on the cluster froze while trying to reach the spurious cluster head, and would not respond until I killed the server and deleted the offending job.
The same user has now found a similar, but new way to crash the pbs_server, by specifying a legitimate job id other than their own to the afterok field.
(In reply to comment #1) > The same user has now found a similar, but new way to crash the pbs_server, by > specifying a legitimate job id other than their own to the afterok field. It turns out that the second crash is the same issue, (the job id was mangled by the user, and it was trying a different spurious host, again not a pbs server.)
Chandler, Would you paste the user command on this bug?
qsub -W depend=afterok:15606 /users/user1/MEME/wz16_6-30-2011.pbs This failed normally, then the user got creative, and changed it to: qsub -W depend=afterok:16432.user2.rice.edu /users/user1/MEME/user1_6-30-2011.pbs Note that our cluster head is biouman.rcsg.rice.edu and that user2.rice.edu by some cosmic joke/coincidence, happens to be a resolvable hostname in our campus, named after the user whose job user1 was trying to piggyback on in a misguided attempt to make his job run sooner. Our original example used 81223.cluster.rice.edu, which user1 had cut and pasted, resulting in the first crash (cluster.rice.edu was meant as a <insert cluster name here>.rice.edu, but, by the same cosmic coincidence, is a resolvable host name on campus. user1 and user2 uids have been changed to protect the innocent in case this bugzilla is googled.