[torqueusers] Segfaults and communication problems in torque 4.1.2
Johannes Zarl
johannes.zarl at jku.at
Mon Sep 24 02:28:11 MDT 2012
(resent because first mail didn't get to the list)
Hi,
I'm currently evaluating torque 4.1.2 on a small (headnode + 8-nodes) SLES
11.1 cluster. I built torque with the cpuset, gui, and nvidia-gpu options set.
I have 2 problems:
1) Communication between pbs_server and pbs_moms:
When I submit a job using qsub, it is sent to a mom and I see it as "running"
in qstat. However, the job is never started on the compute node. Instead, I
get repeated communication errors in the syslog:
Sep 17 11:53:05 n001 pbs_mom: LOG_ERROR::rm_request, unknown command 5
2) The pbs_server process segfaults on qsig:
I wanted to clean up the jobs created so far, so I tried to use "qsig -s 0 0"
(for job id 0). This lead pbs_server to crash immediately:
Sep 19 13:19:04 headnode kernel: [2924757.796745] pbs_server[2270]: segfault
at 88 ip 000000000044b1f8 sp 00007f1c17d30fa0 error 4 in
pbs_server[400000+89000]
Do you have any ideas/suggestions on how I should proceed?
Cheers,
Johannes
More information about the torqueusers
mailing list