[torqueusers] trouble running a job with two nodes in a multi mom.

burcarjo at ono.com burcarjo at ono.com
Thu Aug 18 07:06:04 MDT 2011


Hello all,
I have a multi mom configuration over a single host. 
There 
are 4 pbs_mom daemons running in the same machine (I'm simulating 4 
nodes) and all they listen at different ports as says the admin guide.

When I launch a job with an only node then all goes fine and my job 
finish correctly :  echo "hostname" | qsub -l nodes=1.
The problem is 
when I try to launch a job over more than one node ( echo "hostname" | 
qsub -l nodes=2), then the job remains in running state and never 
finish.

I think exist a communication trouble with the pbs_mom daemons 
running in the same machine but I don't sure. Reviewing the logs I olny 
find a little information about the problem:

08/18/2011 14:29:35;
0008;   pbs_mom;Job;69.localhost;JOIN JOB as node 1
08/18/2011 14:29:35;
0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1 not found

08/18/2011 14:29:35;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, 
error sending command 0 to job 69.localhost

 
I would like to know if 
somebody had this problem with a multi mom configuration.

Thank's in 
advance.

David.

[root at localhost ~]# more /etc/hosts
# Do not remove 
the following line, or various programs
# that require network 
functionality will fail.
127.0.0.1 localhost localhost.localdomain 
hosta-0 hosta-1 hosta-2 hosta-3 hosta-4
#192.168.1.100
::1             
localhost6.localdomain6 localhost6


----------------
[root at localhost ~]
# more /var/spool/torque/server_priv/nodes
hosta-0 np=1
hosta-1 np=1 
mom_service_port=30001 mom_manager_port=30002
hosta-2 np=1 
mom_service_port=31001 mom_manager_port=31002
hosta-3 np=1 
mom_service_port=32001 mom_manager_port=32002
hosta-4 np=1 
mom_service_port=33001 mom_manager_port=33002

------------------


[root at localhost ~]# more /var/spool/torque/mom_logs/20110818.32001 

08/18/2011 14:25:26;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
3.0.2, loglevel = 0
08/18/2011 14:25:26;0002;   pbs_mom;n/a;
mom_server_check_connection;sending hello to server localhost.
localdomain
08/18/2011 14:25:28;0002;   pbs_mom;Svr;im_eof;End of File 
from addr 127.0.0.1:15001
08/18/2011 14:25:28;0002;   pbs_mom;n/a;
mom_server_check_connection;sending hello to server localhost.
localdomain
08/18/2011 14:29:35;0080;   pbs_mom;Job;69.localhost;
removed job script
08/18/2011 14:29:35;0008;   pbs_mom;Job;69.localhost;
JOIN JOB as node 1
08/18/2011 14:29:35;0001;   pbs_mom;Svr;pbs_mom;
LOG_ERROR::im_request, stream 1 not found
08/18/2011 14:29:35;0001;   
pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, error sending command 0 to 
job 69.localhost
08/18/2011 14:29:35;0002;   pbs_mom;Svr;im_eof;No 
error from addr 127.0.0.1:32002
08/18/2011 14:29:35;0002;   pbs_mom;Svr;
im_eof;End of File from addr 127.0.0.1:1019

------------------

[usr1 at localhost torque]$ tracejob 68

/var/spool/torque/server_priv/accounting/20110818: Permission denied

/var/spool/torque/mom_logs/20110818.33001: No matching job records 
located
/var/spool/torque/mom_logs/20110818.32001: No matching job 
records located
/var/spool/torque/mom_logs/20110818.31001: No matching 
job records located
/var/spool/torque/mom_logs/20110818: No matching 
job records located

Job: 68.localhost

08/18/2011 14:27:40  M    JOIN 
JOB as node 1
08/18/2011 14:27:40  S    ready to commit job
08/18/2011 
14:27:40  S    ready to commit job completed
08/18/2011 14:27:40  S    
committing job
08/18/2011 14:27:40  S    enqueuing into batch, state 1 
hop 1
08/18/2011 14:27:40  S    Reply sent for request type Commit on 
socket 13
08/18/2011 14:27:40  S    attr comment modified
08/18/2011 14:
27:40  S    Job Modified at request of Scheduler at localhost
08/18/2011 
14:27:40  L    Job Run
08/18/2011 14:27:40  S    Job Run at request of 
Scheduler at localhost
08/18/2011 14:27:40  S    forking in send_job

08/18/2011 14:27:40  S    entering post_sendmom
08/18/2011 14:27:40  
S    child reported success for job after 0 seconds (dest=hosta-0), 
rc=0
08/18/2011 14:27:40  M    removed job script


------------------------------

[root at localhost ~]# qstat -n


localhost: 

                                                                         
Req'd  Req'd   Elap
Job ID               Username Queue    
Jobname          SessID NDS   TSK Memory Time  S Time

-------------------- -------- -------- ---------------- ------ ----- 
--- ------ ----- - -----
68.localhost         user1 batch    
STDIN               --      2   2    --  01:00 R   -- 
   hosta-
1/0+hosta-0/0
69.localhost         user1 batch    STDIN               
--      2   2    --  01:00 R   -- 
   hosta-3/0+hosta-2/0



More information about the torqueusers mailing list