[torqueusers] trouble running a job with two nodes in a multi mom.

Ken Nielson knielson at adaptivecomputing.com
Thu Aug 18 08:49:10 MDT 2011


What does your etc/hosts file look like.

Ken 

----- Original Message -----
> From: burcarjo at ono.com
> To: torqueusers at supercluster.org
> Sent: Thursday, August 18, 2011 7:06:04 AM
> Subject: [torqueusers] trouble running a job with two nodes in a multi mom.
> Hello all,
> I have a multi mom configuration over a single host.
> There
> are 4 pbs_mom daemons running in the same machine (I'm simulating 4
> nodes) and all they listen at different ports as says the admin guide.
> 
> When I launch a job with an only node then all goes fine and my job
> finish correctly : echo "hostname" | qsub -l nodes=1.
> The problem is
> when I try to launch a job over more than one node ( echo "hostname" |
> qsub -l nodes=2), then the job remains in running state and never
> finish.
> 
> I think exist a communication trouble with the pbs_mom daemons
> running in the same machine but I don't sure. Reviewing the logs I
> olny
> find a little information about the problem:
> 
> 08/18/2011 14:29:35;
> 0008; pbs_mom;Job;69.localhost;JOIN JOB as node 1
> 08/18/2011 14:29:35;
> 0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1 not found
> 
> 08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request,
> error sending command 0 to job 69.localhost
> 
> 
> I would like to know if
> somebody had this problem with a multi mom configuration.
> 
> Thank's in
> advance.
> 
> David.
> 
> [root at localhost ~]# more /etc/hosts
> # Do not remove
> the following line, or various programs
> # that require network
> functionality will fail.
> 127.0.0.1 localhost localhost.localdomain
> hosta-0 hosta-1 hosta-2 hosta-3 hosta-4
> #192.168.1.100
> ::1
> localhost6.localdomain6 localhost6
> 
> 
> ----------------
> [root at localhost ~]
> # more /var/spool/torque/server_priv/nodes
> hosta-0 np=1
> hosta-1 np=1
> mom_service_port=30001 mom_manager_port=30002
> hosta-2 np=1
> mom_service_port=31001 mom_manager_port=31002
> hosta-3 np=1
> mom_service_port=32001 mom_manager_port=32002
> hosta-4 np=1
> mom_service_port=33001 mom_manager_port=33002
> 
> ------------------
> 
> 
> [root at localhost ~]# more /var/spool/torque/mom_logs/20110818.32001
> 
> 08/18/2011 14:25:26;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 3.0.2, loglevel = 0
> 08/18/2011 14:25:26;0002; pbs_mom;n/a;
> mom_server_check_connection;sending hello to server localhost.
> localdomain
> 08/18/2011 14:25:28;0002; pbs_mom;Svr;im_eof;End of File
> from addr 127.0.0.1:15001
> 08/18/2011 14:25:28;0002; pbs_mom;n/a;
> mom_server_check_connection;sending hello to server localhost.
> localdomain
> 08/18/2011 14:29:35;0080; pbs_mom;Job;69.localhost;
> removed job script
> 08/18/2011 14:29:35;0008; pbs_mom;Job;69.localhost;
> JOIN JOB as node 1
> 08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;
> LOG_ERROR::im_request, stream 1 not found
> 08/18/2011 14:29:35;0001;
> pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, error sending command 0 to
> job 69.localhost
> 08/18/2011 14:29:35;0002; pbs_mom;Svr;im_eof;No
> error from addr 127.0.0.1:32002
> 08/18/2011 14:29:35;0002; pbs_mom;Svr;
> im_eof;End of File from addr 127.0.0.1:1019
> 
> ------------------
> 
> [usr1 at localhost torque]$ tracejob 68
> 
> /var/spool/torque/server_priv/accounting/20110818: Permission denied
> 
> /var/spool/torque/mom_logs/20110818.33001: No matching job records
> located
> /var/spool/torque/mom_logs/20110818.32001: No matching job
> records located
> /var/spool/torque/mom_logs/20110818.31001: No matching
> job records located
> /var/spool/torque/mom_logs/20110818: No matching
> job records located
> 
> Job: 68.localhost
> 
> 08/18/2011 14:27:40 M JOIN
> JOB as node 1
> 08/18/2011 14:27:40 S ready to commit job
> 08/18/2011
> 14:27:40 S ready to commit job completed
> 08/18/2011 14:27:40 S
> committing job
> 08/18/2011 14:27:40 S enqueuing into batch, state 1
> hop 1
> 08/18/2011 14:27:40 S Reply sent for request type Commit on
> socket 13
> 08/18/2011 14:27:40 S attr comment modified
> 08/18/2011 14:
> 27:40 S Job Modified at request of Scheduler at localhost
> 08/18/2011
> 14:27:40 L Job Run
> 08/18/2011 14:27:40 S Job Run at request of
> Scheduler at localhost
> 08/18/2011 14:27:40 S forking in send_job
> 
> 08/18/2011 14:27:40 S entering post_sendmom
> 08/18/2011 14:27:40
> S child reported success for job after 0 seconds (dest=hosta-0),
> rc=0
> 08/18/2011 14:27:40 M removed job script
> 
> 
> ------------------------------
> 
> [root at localhost ~]# qstat -n
> 
> 
> localhost:
> 
> 
> Req'd Req'd Elap
> Job ID Username Queue
> Jobname SessID NDS TSK Memory Time S Time
> 
> -------------------- -------- -------- ---------------- ------ -----
> --- ------ ----- - -----
> 68.localhost user1 batch
> STDIN -- 2 2 -- 01:00 R --
> hosta-
> 1/0+hosta-0/0
> 69.localhost user1 batch STDIN
> -- 2 2 -- 01:00 R --
> hosta-3/0+hosta-2/0
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list