[torqueusers] Segfaults and communication problems in torque 4.1.2

Johannes Zarl johannes.zarl at jku.at
Mon Sep 24 02:28:11 MDT 2012


(resent because first mail didn't get to the list)
Hi,

I'm currently evaluating torque 4.1.2 on a small (headnode + 8-nodes) SLES 
11.1 cluster. I built torque with the cpuset, gui, and nvidia-gpu options set.

I have 2 problems:

1) Communication between pbs_server and pbs_moms:

When I submit a job using qsub, it is sent to a mom and I see it as "running" 
in qstat. However, the job is never started on the compute node. Instead, I 
get repeated communication errors in the syslog:

Sep 17 11:53:05 n001 pbs_mom: LOG_ERROR::rm_request, unknown command 5


2) The pbs_server process segfaults on qsig:

I wanted to clean up the jobs created so far, so I tried to use "qsig -s 0 0" 
(for job id 0). This lead pbs_server to crash immediately:

Sep 19 13:19:04 headnode kernel: [2924757.796745] pbs_server[2270]: segfault 
at 88 ip 000000000044b1f8 sp 00007f1c17d30fa0 error 4 in 
pbs_server[400000+89000]


Do you have any ideas/suggestions on how I should proceed?

Cheers,
  Johannes


More information about the torqueusers mailing list