[torqueusers] What causes "DIS reply failure" messages?
Kevin Murphy
murphy at genome.chop.edu
Fri Jul 11 20:54:52 MDT 2008
I have almost got Torque (2.3.1) working on my Mac laptop, using default
pbs_sched. I applied the OS X patch mentioned earlier on the list.
I'm debugging a pipeline of a few hundred jobs linked by dependency
relationships. All the jobs are qsubbed at once, with the first job in
a manual hold state until all are qsubbed, after which the first job is
released. The pipeline hangs unpredictably, in the sense that for many
minutes (I've measured up to 15), no jobs are in the R state, while one
or more jobs remain in the Q state, and the rest in the H state. While
this is going on, the server logs the following message every 30 seconds:
07/01/2008 10:02:35;0040;PBS_Server;Svr;localhost;Scheduler sent command
time
07/01/2008 10:02:35;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
(Although the very first such message says "sent command new" instead of
"sent command time"). Note that in response to the problem, I changed
some of the server's timing parameters in the vain hope of getting the
problem to resolve itself more quickly; that's why the message is
occurring so frequently.
It seems to be more common for the hang to occur before the first job
has run (all others are dependent on it) or after the first job has
finished running.
Probably unrelated: if I qdel the jobs en masse in their "stuck"
condition, I get lots of errors like this:
qdel: Unknown Job Id 149.localhost
qdel(44261) malloc: *** error for object 0x100940: double free
*** set a breakpoint in malloc_error_break to debug
Thanks,
Kevin Murphy
APPENDIX A: server configuration
$ qmgr -c 'p s'
create queue defqueue
set queue defqueue queue_type = Execution
set queue defqueue enabled = True
set queue defqueue started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = kevinland.local
set server acl_hosts += localhost
set server managers = murphy at localhost
set server default_queue = defqueue
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 30
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 953
APPENDIX B: /etc/services
pbs 15001/tcp # pbs server (pbs_server)
pbs 15001/udp # pbs server (pbs_server)
pbs_mom 15002/tcp # mom to/from server
pbs_mom 15002/udp # mom to/from server
pbs_resmom 15003/tcp # mom resource management requests
pbs_resmom 15003/udp # mom resource management requests
pbs_sched 15004/tcp # scheduler
pbs_sched 15004/udp # scheduler
APPENDIX C: lsof -i
pbs_mom 13487 root 5u IPv4 0xd6cce64 0t0 TCP
*:pbs_mom (LISTEN)
pbs_mom 13487 root 6u IPv4 0x12b0c66c 0t0 TCP
*:pbs_resmom (LISTEN)
pbs_mom 13487 root 7u IPv4 0x6842870 0t0 UDP
*:pbs_resmom
pbs_mom 13487 root 8u IPv4 0x6842948 0t0 UDP *:1023
pbs_serve 13489 root 6u IPv4 0x12ba6e64 0t0 TCP
*:pbs (LISTEN)
pbs_serve 13489 root 8u IPv4 0x68437a0 0t0 UDP *:pbs
pbs_serve 13489 root 9u IPv4 0x6842000 0t0 UDP *:exp2
pbs_sched 13491 root 4u IPv4 0x9f44a68 0t0 TCP
localhost:pbs_sched (LISTEN)
pbs_sched 13491 root 9u IPv4 0x6844e68 0t0 UDP *:exp1
APPENDIX D: /var/spool/torque/server_priv/nodes
localhost np=2
APPENDIX E: /var/spool/torque/server_name
localhost
APPENDIX F: sched_config (default)
round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_fifo: false ALL
fair_share: false ALL
help_starving_jobs true ALL
sort_queues true ALL
load_balancing: false ALL
sort_by: shortest_job_first ALL
log_filter: 256
dedicated_prefix: ded
max_starve: 24:00:00
half_life: 24:00:00
unknown_shares: 10
sync_time: 1:00:00
More information about the torqueusers
mailing list