[torqueusers] What causes "DIS reply failure" messages?

Kevin Murphy murphy at genome.chop.edu
Mon Jul 14 12:00:27 MDT 2008

Glen Beane wrote:
> On Fri, Jul 11, 2008 at 10:54 PM, Kevin Murphy <murphy at genome.chop.edu 
> <mailto:murphy at genome.chop.edu>> wrote:
>     I have almost got Torque (2.3.1) working on my Mac laptop, using
>     default pbs_sched.  I applied the OS X patch mentioned earlier on
>     the list.
> Is this Leopard?
> What patch did you apply?  Was it disabling WORDEXP, which has been 
> causing issued on Leopard (haven't had time to debug it yet)?
Yes, Leopard; I disabled WORDEXP in pbs_config.h.
> Do you happen to have a subversion client installed on your laptop?  
> Can you try checking out this and giving it a shot:
> svn://clusterresources.com/torque/branches/2.3-fixes 
> <http://clusterresources.com/torque/branches/2.3-fixes>
I checked out 2.3-fixes, and it's not behaving any better in my case.

1) My test pipeline of jobs pauses periodically when no jobs are 
actually running.  During the hangs, pbs_server periodically shows 
messages like:

07/14/2008 11:37:54;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, 9 [ ... or -1, which is more common ...]

As I mentioned before, the pauses can be quite long; in my tests 
yesterday, I saw pauses up to 8 minutes.  Just to reiterate, no jobs run 
during these pauses, and qstat doesn't think any jobs are running (or 
exiting), either, although many jobs are in the Q state.

2) Processing always resumes in these circumstances when the mom sends a 
hello to the server:

07/14/2008 11:38:24;0002;   
pbs_mom;n/a;mom_server_check_connection;Sending hello to server localhost

3) I attempted to speed up my testing by decreasing possibly relevant 

$ sudo cat /var/spool/torque/mom_priv/config
$check_poll_time 1
$status_update_time 1

The DIS reply failures still occur, but it seems that the duration of 
the pauses decreases considerably.

4) I've made a set of logs for one of my tests available at 
Included are the server, mom, and sched logs, plus a 'watcher' log made 
by a perl script.  The watcher log was made by a perl script that polled 
qstat and ps once a second and output ps job listings and qstat 
summaries when anything changed.  The watcher log makes the pauses easy 
to see: look for a 'NO_JOB_RUNNING' entry; often it is associated with a 
30-second pause in activity.  These tests were made with 2.3-fixes.

5) server, mom, and sched all running on the same computer, OS X 10.5.4.

-Kevin Murphy

