[torqueusers] pbs_mom HPUX 11.11 wait_request failed

Wilhelm Eger wilhelm.eger at gmail.com
Tue Feb 27 02:07:57 MST 2007


Hi there,

my setup is:

pbs_server: itanium witch debian linux
pbs_moms: itanium with HPUX 11.11

One of the moms system is more or less actual the other tow are behind. I
managed to compile torque on all of them with gcc 3.3.3. Now it occurs, that
the "actual" mom is running fine, wether the other to aren't accepting any
jobs. Mainly qsub -I .. doesn't give me a shell.

mom_logs give me this:

02/27/2007 09:53:20;0008;   pbs_mom;Job;process_request;request type
QueueJob from host xxx received
02/27/2007 09:53:20;0008;   pbs_mom;Job;process_request;request type
QueueJob from host xxx allowed
02/27/2007 09:53:20;0008;   pbs_mom;Job;dispatch_request;dispatching request
QueueJob on sd=9
02/27/2007 09:53:20;0008;   pbs_mom;Job;process_request;request type
ReadyToCommit from host xxx received
02/27/2007 09:53:20;0008;   pbs_mom;Job;process_request;request type
ReadyToCommit from host xxx allowed
02/27/2007 09:53:20;0008;   pbs_mom;Job;dispatch_request;dispatching request
ReadyToCommit on sd=9
02/27/2007 09:53:20;0008;   pbs_mom;Job;167.xxx;ready to commit job
02/27/2007 09:53:20;0008;   pbs_mom;Job;167.xxx;ready to commit job
completed
02/27/2007 09:53:20;0008;   pbs_mom;Job;process_request;request type Commit
from host xxx received
02/27/2007 09:53:20;0008;   pbs_mom;Job;process_request;request type Commit
from host xxx allowed
02/27/2007 09:53:20;0008;   pbs_mom;Job;dispatch_request;dispatching request
Commit on sd=9
02/27/2007 09:53:20;0008;   pbs_mom;Job;167.xxx;committing job
02/27/2007 09:53:20;0008;   pbs_mom;Job;167.xxx;starting job execution
02/27/2007 09:53:20;0001;   pbs_mom;Job;job_nodes;0: aaa1/0
02/27/2007 09:53:20;0001;   pbs_mom;Job;job_nodes;job: 167.xxx numnodes=1
numvnod=1
02/27/2007 09:53:20;0001;   pbs_mom;Job;167.xxx;phase 2 of job launch
successfully completed
02/27/2007 09:53:25;0001;   pbs_mom;Job;167.xxx;job not ready after 5 second
timeout, MOM will recheck
02/27/2007 09:53:25;0008;   pbs_mom;Job;167.xxx;job execution started
02/27/2007 09:53:26;0001;   pbs_mom;Job;167.xxx;job 167.xxx child not
started, will check later
02/27/2007 09:53:26;0001;   pbs_mom;Svr;pbs_mom;pbs_mom, wait_request failed
02/27/2007 09:53:27;0001;   pbs_mom;Job;167.xxx;job 167.xxx child not
started, will check later

The last message is repeated several times.

gdb pbs_mom gives this:HP gdb 2.1
Copyright 1986 - 1999 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 2.1 (based on GDB 5.0-hpwdb-20000630)
Wildebeest is free software, covered by the GNU General Public License, and
you are welcome to change it and/or distribute copies of it under certain
conditions.  Type "show copying" to see the conditions.  There is
absolutely no warranty for Wildebeest.  Type "show warranty" for details.
Wildebeest was built for PA-RISC 1.1 or 2.0 (narrow), HP-UX 11.00.
..
(gdb) run
Starting program: /opt/torque/sbin/pbs_mom
MOM is up
do_rpp: got a resource monitor request
do_rpp: got a resource monitor request
saving extra job info stdout=0 stderr=0 taskid=1 nodeid=0
===== MD5 FFB6F44242AB30CD43FA2A743616A6B4
mom_do_poll: entered
warning: reading `r3' register: No
data                                              <----- This seems buggy
Detaching after fork from process 16158
mom_close_poll: entered
saving extra job info stdout=-1 stderr=-1 taskid=2 nodeid=0
pbs_mom: pbs_mom, wait_request
failed                                          <--- this is bad either
mom_get_sample: entered
sessions[0]: pid 878 sid 1528
sessions[1]: pid 872 sid 1528
sessions[1]: pid 871 sid 1528
sessions[1]: pid 879 sid 1528
sessions[1]: pid 880 sid 1528
sessions[1]: pid 876 sid 1528
sessions[1]: pid 875 sid 1528
sessions[1]: pid 877 sid 1528
sessions[1]: pid 873 sid 1528
sessions[1]: pid 874 sid 1528
sessions[0]: pid 878 sid 1528
sessions[1]: pid 872 sid 1528
sessions[1]: pid 871 sid 1528
sessions[1]: pid 879 sid 1528
sessions[1]: pid 880 sid 1528
sessions[1]: pid 876 sid 1528
sessions[1]: pid 875 sid 1528
sessions[1]: pid 877 sid 1528
sessions[1]: pid 873 sid 1528
sessions[1]: pid 874 sid 1528
nusers[0]: pid 878 uid 30
nusers[1]: pid 872 uid 30
nusers[1]: pid 871 uid 30
nusers[1]: pid 879 uid 30
nusers[1]: pid 880 uid 30
nusers[1]: pid 876 uid 30
nusers[1]: pid 875 uid 30
nusers[1]: pid 877 uid 30
nusers[1]: pid 873 uid 30
nusers[1]: pid 874 uid 30
mom_get_sample: entered

obs_mom seems to be unable to start a child progress. What can it be?

Wilhelm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070227/61954d5c/attachment.html


More information about the torqueusers mailing list