Bug 142 - pbs_server hangs trying to check spurious pbs server from depend=afterok line
: pbs_server hangs trying to check spurious pbs server from depend=afterok line
Status: NEW
Product: TORQUE
pbs_server
: 2.5.x
: Other Linux
: P3 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2011-06-30 09:56 MDT by Chandler Wilkerson
Modified: 2011-06-30 14:48 MDT (History)
2 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Chandler Wilkerson 2011-06-30 09:56:36 MDT
We had a user cut and paste an example from our documentation without changing
the jobid and hostname from the example.

As it turns out, the hostname in the example, meant to be a generic unused
host, is actually a machine on campus, not running pbs_server. pbs_server on
the cluster froze while trying to reach the spurious cluster head, and would
not respond until I killed the server and deleted the offending job.
Comment 1 Chandler Wilkerson 2011-06-30 13:11:45 MDT
The same user has now found a similar, but new way to crash the pbs_server, by
specifying a legitimate job id other than their own to the afterok field.
Comment 2 Chandler Wilkerson 2011-06-30 13:16:54 MDT
(In reply to comment #1)
> The same user has now found a similar, but new way to crash the pbs_server, by
> specifying a legitimate job id other than their own to the afterok field.

It turns out that the second crash is the same issue, (the job id was mangled
by the user, and it was trying a different spurious host, again not a pbs
server.)
Comment 3 Ken Nielson 2011-06-30 14:13:47 MDT
Chandler,

Would you paste the user command on this bug?
Comment 4 Chandler Wilkerson 2011-06-30 14:48:31 MDT
qsub -W depend=afterok:15606 /users/user1/MEME/wz16_6-30-2011.pbs

This failed normally, then the user got creative, and changed it to:

qsub -W depend=afterok:16432.user2.rice.edu
/users/user1/MEME/user1_6-30-2011.pbs

Note that our cluster head is biouman.rcsg.rice.edu and that user2.rice.edu by
some cosmic joke/coincidence, happens to be a resolvable hostname in our
campus, named after the user whose job user1 was trying to piggyback on in a
misguided attempt to make his job run sooner.

Our original example used 81223.cluster.rice.edu, which user1 had cut and
pasted, resulting in the first crash (cluster.rice.edu was meant as a <insert
cluster name here>.rice.edu, but, by the same cosmic coincidence, is a
resolvable host name on campus.

user1 and user2 uids have been changed to protect the innocent in case this
bugzilla is googled.