[torqueusers] What causes "DIS reply failure" messages?
murphy at genome.chop.edu
Mon Jul 14 12:00:27 MDT 2008
Glen Beane wrote:
> On Fri, Jul 11, 2008 at 10:54 PM, Kevin Murphy <murphy at genome.chop.edu
> <mailto:murphy at genome.chop.edu>> wrote:
> I have almost got Torque (2.3.1) working on my Mac laptop, using
> default pbs_sched. I applied the OS X patch mentioned earlier on
> the list.
> Is this Leopard?
> What patch did you apply? Was it disabling WORDEXP, which has been
> causing issued on Leopard (haven't had time to debug it yet)?
Yes, Leopard; I disabled WORDEXP in pbs_config.h.
> Do you happen to have a subversion client installed on your laptop?
> Can you try checking out this and giving it a shot:
I checked out 2.3-fixes, and it's not behaving any better in my case.
1) My test pipeline of jobs pauses periodically when no jobs are
actually running. During the hangs, pbs_server periodically shows
07/14/2008 11:37:54;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, 9 [ ... or -1, which is more common ...]
As I mentioned before, the pauses can be quite long; in my tests
yesterday, I saw pauses up to 8 minutes. Just to reiterate, no jobs run
during these pauses, and qstat doesn't think any jobs are running (or
exiting), either, although many jobs are in the Q state.
2) Processing always resumes in these circumstances when the mom sends a
hello to the server:
pbs_mom;n/a;mom_server_check_connection;Sending hello to server localhost
3) I attempted to speed up my testing by decreasing possibly relevant
$ sudo cat /var/spool/torque/mom_priv/config
The DIS reply failures still occur, but it seems that the duration of
the pauses decreases considerably.
4) I've made a set of logs for one of my tests available at
Included are the server, mom, and sched logs, plus a 'watcher' log made
by a perl script. The watcher log was made by a perl script that polled
qstat and ps once a second and output ps job listings and qstat
summaries when anything changed. The watcher log makes the pauses easy
to see: look for a 'NO_JOB_RUNNING' entry; often it is associated with a
30-second pause in activity. These tests were made with 2.3-fixes.
5) server, mom, and sched all running on the same computer, OS X 10.5.4.
More information about the torqueusers