[torqueusers] What causes "DIS reply failure" messages?

Kevin Murphy murphy at genome.chop.edu
Fri Jul 11 20:54:52 MDT 2008

I have almost got Torque (2.3.1) working on my Mac laptop, using default 
pbs_sched.  I applied the OS X patch mentioned earlier on the list.

I'm debugging a pipeline of a few hundred jobs linked by dependency 
relationships.  All the jobs are qsubbed at once, with the first job in 
a manual hold state until all are qsubbed, after which the first job is 
released.  The pipeline hangs unpredictably, in the sense that for many 
minutes (I've measured up to 15), no jobs are in the R state, while one 
or more jobs remain in the Q state, and the rest in the H state.  While 
this is going on, the server logs the following message every 30 seconds:

07/01/2008 10:02:35;0040;PBS_Server;Svr;localhost;Scheduler sent command 
07/01/2008 10:02:35;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1

(Although the very first such message says "sent command new" instead of 
"sent command time").  Note that in response to the problem, I changed 
some of the server's timing parameters in the vain hope of getting the 
problem to resolve itself more quickly; that's why the message is 
occurring so frequently.

It seems to be more common for the hang to occur before the first job 
has run (all others are dependent on it) or after the first job has 
finished running.

Probably unrelated: if I qdel the jobs en masse in their "stuck" 
condition, I get lots of errors like this:

qdel: Unknown Job Id 149.localhost
qdel(44261) malloc: *** error for object 0x100940: double free
*** set a breakpoint in malloc_error_break to debug

Kevin Murphy

APPENDIX A:  server configuration

$ qmgr -c 'p s'

create queue defqueue
set queue defqueue queue_type = Execution
set queue defqueue enabled = True
set queue defqueue started = True
# Set server attributes.
set server scheduling = True
set server acl_hosts = kevinland.local
set server acl_hosts += localhost
set server managers = murphy at localhost
set server default_queue = defqueue
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 30
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 953

APPENDIX B: /etc/services

pbs           15001/tcp           # pbs server (pbs_server)
pbs           15001/udp           # pbs server (pbs_server)
pbs_mom       15002/tcp           # mom to/from server
pbs_mom       15002/udp           # mom to/from server
pbs_resmom    15003/tcp           # mom resource management requests
pbs_resmom    15003/udp           # mom resource management requests
pbs_sched     15004/tcp           # scheduler
pbs_sched     15004/udp           # scheduler

APPENDIX C: lsof -i

pbs_mom   13487           root    5u  IPv4  0xd6cce64      0t0    TCP 
*:pbs_mom (LISTEN)
pbs_mom   13487           root    6u  IPv4 0x12b0c66c      0t0    TCP 
*:pbs_resmom (LISTEN)
pbs_mom   13487           root    7u  IPv4  0x6842870      0t0    UDP 
pbs_mom   13487           root    8u  IPv4  0x6842948      0t0    UDP *:1023
pbs_serve 13489           root    6u  IPv4 0x12ba6e64      0t0    TCP 
*:pbs (LISTEN)
pbs_serve 13489           root    8u  IPv4  0x68437a0      0t0    UDP *:pbs
pbs_serve 13489           root    9u  IPv4  0x6842000      0t0    UDP *:exp2
pbs_sched 13491           root    4u  IPv4  0x9f44a68      0t0    TCP 
localhost:pbs_sched (LISTEN)
pbs_sched 13491           root    9u  IPv4  0x6844e68      0t0    UDP *:exp1

APPENDIX D: /var/spool/torque/server_priv/nodes

localhost np=2

APPENDIX E: /var/spool/torque/server_name


APPENDIX F: sched_config (default)

round_robin: False      all
by_queue: True          prime
by_queue: True          non_prime
strict_fifo: false      ALL
fair_share: false       ALL
help_starving_jobs      true    ALL
sort_queues     true    ALL
load_balancing: false   ALL
sort_by: shortest_job_first     ALL
log_filter: 256
dedicated_prefix: ded
max_starve: 24:00:00
half_life: 24:00:00
unknown_shares: 10
sync_time: 1:00:00

More information about the torqueusers mailing list