[torqueusers] trouble running a job with two nodes in a multi mom.

burcarjo at ono.com burcarjo at ono.com
Thu Aug 18 08:54:23 MDT 2011


My /etc/hosts contains:
127.0.0.1 localhost localhost.localdomain hosta-
0 hosta-1 hosta-2 hosta-3 hosta-4

I can login to hosta-X witout ssh 
password.

----Mensaje original----
De: knielson at adaptivecomputing.com

Fecha: 18/08/2011 16:49
Para: <burcarjo at ono.com>, "Torque Users Mailing 
List"<torqueusers at supercluster.org>
Asunto: Re: [torqueusers] trouble 
running a job with two nodes in a multi mom.

What does your etc/hosts 
file look like.

Ken 

----- Original Message -----
> From: 
burcarjo at ono.com
> To: torqueusers at supercluster.org
> Sent: Thursday, 
August 18, 2011 7:06:04 AM
> Subject: [torqueusers] trouble running a 
job with two nodes in a multi mom.
> Hello all,
> I have a multi mom 
configuration over a single host.
> There
> are 4 pbs_mom daemons 
running in the same machine (I'm simulating 4
> nodes) and all they 
listen at different ports as says the admin guide.
> 
> When I launch a 
job with an only node then all goes fine and my job
> finish correctly 
: echo "hostname" | qsub -l nodes=1.
> The problem is
> when I try to 
launch a job over more than one node ( echo "hostname" |
> qsub -l 
nodes=2), then the job remains in running state and never
> finish.
> 

> I think exist a communication trouble with the pbs_mom daemons
> 
running in the same machine but I don't sure. Reviewing the logs I
> 
olny
> find a little information about the problem:
> 
> 08/18/2011 14:
29:35;
> 0008; pbs_mom;Job;69.localhost;JOIN JOB as node 1
> 08/18/2011 
14:29:35;
> 0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1 
not found
> 
> 08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::
im_request,
> error sending command 0 to job 69.localhost
> 
> 
> I 
would like to know if
> somebody had this problem with a multi mom 
configuration.
> 
> Thank's in
> advance.
> 
> David.
> 
> 
[root at localhost ~]# more /etc/hosts
> # Do not remove
> the following 
line, or various programs
> # that require network
> functionality will 
fail.
> 127.0.0.1 localhost localhost.localdomain
> hosta-0 hosta-1 
hosta-2 hosta-3 hosta-4
> #192.168.1.100
> ::1
> localhost6.
localdomain6 localhost6
> 
> 
> ----------------
> [root at localhost ~]
> 
# more /var/spool/torque/server_priv/nodes
> hosta-0 np=1
> hosta-1 
np=1
> mom_service_port=30001 mom_manager_port=30002
> hosta-2 np=1
> 
mom_service_port=31001 mom_manager_port=31002
> hosta-3 np=1
> 
mom_service_port=32001 mom_manager_port=32002
> hosta-4 np=1
> 
mom_service_port=33001 mom_manager_port=33002
> 
> ------------------

> 
> 
> [root at localhost ~]# more /var/spool/torque/mom_logs/20110818.
32001
> 
> 08/18/2011 14:25:26;0002; pbs_mom;Svr;pbs_mom;Torque Mom 
Version =
> 3.0.2, loglevel = 0
> 08/18/2011 14:25:26;0002; pbs_mom;
n/a;
> mom_server_check_connection;sending hello to server localhost.
> 
localdomain
> 08/18/2011 14:25:28;0002; pbs_mom;Svr;im_eof;End of File

> from addr 127.0.0.1:15001
> 08/18/2011 14:25:28;0002; pbs_mom;n/a;
> 
mom_server_check_connection;sending hello to server localhost.
> 
localdomain
> 08/18/2011 14:29:35;0080; pbs_mom;Job;69.localhost;
> 
removed job script
> 08/18/2011 14:29:35;0008; pbs_mom;Job;69.
localhost;
> JOIN JOB as node 1
> 08/18/2011 14:29:35;0001; pbs_mom;Svr;
pbs_mom;
> LOG_ERROR::im_request, stream 1 not found
> 08/18/2011 14:29:
35;0001;
> pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, error sending 
command 0 to
> job 69.localhost
> 08/18/2011 14:29:35;0002; pbs_mom;Svr;
im_eof;No
> error from addr 127.0.0.1:32002
> 08/18/2011 14:29:35;0002; 
pbs_mom;Svr;
> im_eof;End of File from addr 127.0.0.1:1019
> 
> 
------------------
> 
> [usr1 at localhost torque]$ tracejob 68
> 
> 
/var/spool/torque/server_priv/accounting/20110818: Permission denied
> 

> /var/spool/torque/mom_logs/20110818.33001: No matching job records
> 
located
> /var/spool/torque/mom_logs/20110818.32001: No matching job
> 
records located
> /var/spool/torque/mom_logs/20110818.31001: No 
matching
> job records located
> /var/spool/torque/mom_logs/20110818: 
No matching
> job records located
> 
> Job: 68.localhost
> 
> 
08/18/2011 14:27:40 M JOIN
> JOB as node 1
> 08/18/2011 14:27:40 S 
ready to commit job
> 08/18/2011
> 14:27:40 S ready to commit job 
completed
> 08/18/2011 14:27:40 S
> committing job
> 08/18/2011 14:27:
40 S enqueuing into batch, state 1
> hop 1
> 08/18/2011 14:27:40 S 
Reply sent for request type Commit on
> socket 13
> 08/18/2011 14:27:40 
S attr comment modified
> 08/18/2011 14:
> 27:40 S Job Modified at 
request of Scheduler at localhost
> 08/18/2011
> 14:27:40 L Job Run
> 
08/18/2011 14:27:40 S Job Run at request of
> Scheduler at localhost
> 
08/18/2011 14:27:40 S forking in send_job
> 
> 08/18/2011 14:27:40 S 
entering post_sendmom
> 08/18/2011 14:27:40
> S child reported success 
for job after 0 seconds (dest=hosta-0),
> rc=0
> 08/18/2011 14:27:40 M 
removed job script
> 
> 
> ------------------------------
> 
> 
[root at localhost ~]# qstat -n
> 
> 
> localhost:
> 
> 
> Req'd Req'd 
Elap
> Job ID Username Queue
> Jobname SessID NDS TSK Memory Time S 
Time
> 
> -------------------- -------- -------- ---------------- 
------ -----
> --- ------ ----- - -----
> 68.localhost user1 batch
> 
STDIN -- 2 2 -- 01:00 R --
> hosta-
> 1/0+hosta-0/0
> 69.localhost 
user1 batch STDIN
> -- 2 2 -- 01:00 R --
> hosta-3/0+hosta-2/0
> 
> 
_______________________________________________
> torqueusers mailing 
list
> torqueusers at supercluster.org
> http://www.supercluster.
org/mailman/listinfo/torqueusers





More information about the torqueusers mailing list