[torqueusers] trouble running a job with two nodes in a multi mom.
burcarjo at ono.com
burcarjo at ono.com
Thu Aug 18 07:06:04 MDT 2011
Hello all,
I have a multi mom configuration over a single host.
There
are 4 pbs_mom daemons running in the same machine (I'm simulating 4
nodes) and all they listen at different ports as says the admin guide.
When I launch a job with an only node then all goes fine and my job
finish correctly : echo "hostname" | qsub -l nodes=1.
The problem is
when I try to launch a job over more than one node ( echo "hostname" |
qsub -l nodes=2), then the job remains in running state and never
finish.
I think exist a communication trouble with the pbs_mom daemons
running in the same machine but I don't sure. Reviewing the logs I olny
find a little information about the problem:
08/18/2011 14:29:35;
0008; pbs_mom;Job;69.localhost;JOIN JOB as node 1
08/18/2011 14:29:35;
0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1 not found
08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request,
error sending command 0 to job 69.localhost
I would like to know if
somebody had this problem with a multi mom configuration.
Thank's in
advance.
David.
[root at localhost ~]# more /etc/hosts
# Do not remove
the following line, or various programs
# that require network
functionality will fail.
127.0.0.1 localhost localhost.localdomain
hosta-0 hosta-1 hosta-2 hosta-3 hosta-4
#192.168.1.100
::1
localhost6.localdomain6 localhost6
----------------
[root at localhost ~]
# more /var/spool/torque/server_priv/nodes
hosta-0 np=1
hosta-1 np=1
mom_service_port=30001 mom_manager_port=30002
hosta-2 np=1
mom_service_port=31001 mom_manager_port=31002
hosta-3 np=1
mom_service_port=32001 mom_manager_port=32002
hosta-4 np=1
mom_service_port=33001 mom_manager_port=33002
------------------
[root at localhost ~]# more /var/spool/torque/mom_logs/20110818.32001
08/18/2011 14:25:26;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
3.0.2, loglevel = 0
08/18/2011 14:25:26;0002; pbs_mom;n/a;
mom_server_check_connection;sending hello to server localhost.
localdomain
08/18/2011 14:25:28;0002; pbs_mom;Svr;im_eof;End of File
from addr 127.0.0.1:15001
08/18/2011 14:25:28;0002; pbs_mom;n/a;
mom_server_check_connection;sending hello to server localhost.
localdomain
08/18/2011 14:29:35;0080; pbs_mom;Job;69.localhost;
removed job script
08/18/2011 14:29:35;0008; pbs_mom;Job;69.localhost;
JOIN JOB as node 1
08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;
LOG_ERROR::im_request, stream 1 not found
08/18/2011 14:29:35;0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, error sending command 0 to
job 69.localhost
08/18/2011 14:29:35;0002; pbs_mom;Svr;im_eof;No
error from addr 127.0.0.1:32002
08/18/2011 14:29:35;0002; pbs_mom;Svr;
im_eof;End of File from addr 127.0.0.1:1019
------------------
[usr1 at localhost torque]$ tracejob 68
/var/spool/torque/server_priv/accounting/20110818: Permission denied
/var/spool/torque/mom_logs/20110818.33001: No matching job records
located
/var/spool/torque/mom_logs/20110818.32001: No matching job
records located
/var/spool/torque/mom_logs/20110818.31001: No matching
job records located
/var/spool/torque/mom_logs/20110818: No matching
job records located
Job: 68.localhost
08/18/2011 14:27:40 M JOIN
JOB as node 1
08/18/2011 14:27:40 S ready to commit job
08/18/2011
14:27:40 S ready to commit job completed
08/18/2011 14:27:40 S
committing job
08/18/2011 14:27:40 S enqueuing into batch, state 1
hop 1
08/18/2011 14:27:40 S Reply sent for request type Commit on
socket 13
08/18/2011 14:27:40 S attr comment modified
08/18/2011 14:
27:40 S Job Modified at request of Scheduler at localhost
08/18/2011
14:27:40 L Job Run
08/18/2011 14:27:40 S Job Run at request of
Scheduler at localhost
08/18/2011 14:27:40 S forking in send_job
08/18/2011 14:27:40 S entering post_sendmom
08/18/2011 14:27:40
S child reported success for job after 0 seconds (dest=hosta-0),
rc=0
08/18/2011 14:27:40 M removed job script
------------------------------
[root at localhost ~]# qstat -n
localhost:
Req'd Req'd Elap
Job ID Username Queue
Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ -----
--- ------ ----- - -----
68.localhost user1 batch
STDIN -- 2 2 -- 01:00 R --
hosta-
1/0+hosta-0/0
69.localhost user1 batch STDIN
-- 2 2 -- 01:00 R --
hosta-3/0+hosta-2/0
More information about the torqueusers
mailing list