[torqueusers] Segfaults and communication problems in torque 4.1.2

Johannes Zarl johannes.zarl at jku.at
Wed Sep 19 05:34:15 MDT 2012


I'm currently evaluating torque 4.1.2 on a small (headnode + 8-nodes) SLES 
11.1 cluster. I built torque with the cpuset, gui, and nvidia-gpu options set.

I have 2 problems:

1) Communication between pbs_server and pbs_moms:

When I submit a job using qsub, it is sent to a mom and I see it as "running" 
in qstat. However, the job is never started on the compute node. Instead, I 
get repeated communication errors in the syslog:

Sep 17 11:53:05 n001 pbs_mom: LOG_ERROR::rm_request, unknown command 5

2) The pbs_server process segfaults on qsig:

I wanted to clean up the jobs created so far, so I tried to use "qsig -s 0 0" 
(for job id 0). This lead pbs_server to crash immediately:

Sep 19 13:19:04 headnode kernel: [2924757.796745] pbs_server[2270]: segfault 
at 88 ip 000000000044b1f8 sp 00007f1c17d30fa0 error 4 in 

Do you have any ideas/suggestions on how I should proceed?


