[torqueusers] trouble running a job with two nodes in a multi mom.

Janusz Mordarski janusz.mordarski at uj.edu.pl
Thu Aug 18 10:57:42 MDT 2011


I think you should change it, 127.0.0.1 should point only to localhost 
localhost.localdomain
and then, IP address should point to host-a-0 ...

W dniu 2011-08-18 16:54, burcarjo at ono.com pisze:
> My /etc/hosts contains:
> 127.0.0.1 localhost localhost.localdomain hosta-
> 0 hosta-1 hosta-2 hosta-3 hosta-4
>
> I can login to hosta-X witout ssh
> password.
>
> ----Mensaje original----
> De: knielson at adaptivecomputing.com
>
> Fecha: 18/08/2011 16:49
> Para:<burcarjo at ono.com>, "Torque Users Mailing
> List"<torqueusers at supercluster.org>
> Asunto: Re: [torqueusers] trouble
> running a job with two nodes in a multi mom.
>
> What does your etc/hosts
> file look like.
>
> Ken
>
> ----- Original Message -----
>> From:
> burcarjo at ono.com
>> To: torqueusers at supercluster.org
>> Sent: Thursday,
> August 18, 2011 7:06:04 AM
>> Subject: [torqueusers] trouble running a
> job with two nodes in a multi mom.
>> Hello all,
>> I have a multi mom
> configuration over a single host.
>> There
>> are 4 pbs_mom daemons
> running in the same machine (I'm simulating 4
>> nodes) and all they
> listen at different ports as says the admin guide.
>> When I launch a
> job with an only node then all goes fine and my job
>> finish correctly
> : echo "hostname" | qsub -l nodes=1.
>> The problem is
>> when I try to
> launch a job over more than one node ( echo "hostname" |
>> qsub -l
> nodes=2), then the job remains in running state and never
>> finish.
>>
>> I think exist a communication trouble with the pbs_mom daemons
>>
> running in the same machine but I don't sure. Reviewing the logs I
> olny
>> find a little information about the problem:
>>
>> 08/18/2011 14:
> 29:35;
>> 0008; pbs_mom;Job;69.localhost;JOIN JOB as node 1
>> 08/18/2011
> 14:29:35;
>> 0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1
> not found
>> 08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::
> im_request,
>> error sending command 0 to job 69.localhost
>>
>>
>> I
> would like to know if
>> somebody had this problem with a multi mom
> configuration.
>> Thank's in
>> advance.
>>
>> David.
>>
>>
> [root at localhost ~]# more /etc/hosts
>> # Do not remove
>> the following
> line, or various programs
>> # that require network
>> functionality will
> fail.
>> 127.0.0.1 localhost localhost.localdomain
>> hosta-0 hosta-1
> hosta-2 hosta-3 hosta-4
>> #192.168.1.100
>> ::1
>> localhost6.
> localdomain6 localhost6
>>
>> ----------------
>> [root at localhost ~]
>>
> # more /var/spool/torque/server_priv/nodes
>> hosta-0 np=1
>> hosta-1
> np=1
>> mom_service_port=30001 mom_manager_port=30002
>> hosta-2 np=1
>>
> mom_service_port=31001 mom_manager_port=31002
>> hosta-3 np=1
>>
> mom_service_port=32001 mom_manager_port=32002
>> hosta-4 np=1
>>
> mom_service_port=33001 mom_manager_port=33002
>> ------------------
>>
>> [root at localhost ~]# more /var/spool/torque/mom_logs/20110818.
> 32001
>> 08/18/2011 14:25:26;0002; pbs_mom;Svr;pbs_mom;Torque Mom
> Version =
>> 3.0.2, loglevel = 0
>> 08/18/2011 14:25:26;0002; pbs_mom;
> n/a;
>> mom_server_check_connection;sending hello to server localhost.
>>
> localdomain
>> 08/18/2011 14:25:28;0002; pbs_mom;Svr;im_eof;End of File
>> from addr 127.0.0.1:15001
>> 08/18/2011 14:25:28;0002; pbs_mom;n/a;
>>
> mom_server_check_connection;sending hello to server localhost.
> localdomain
>> 08/18/2011 14:29:35;0080; pbs_mom;Job;69.localhost;
>>
> removed job script
>> 08/18/2011 14:29:35;0008; pbs_mom;Job;69.
> localhost;
>> JOIN JOB as node 1
>> 08/18/2011 14:29:35;0001; pbs_mom;Svr;
> pbs_mom;
>> LOG_ERROR::im_request, stream 1 not found
>> 08/18/2011 14:29:
> 35;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, error sending
> command 0 to
>> job 69.localhost
>> 08/18/2011 14:29:35;0002; pbs_mom;Svr;
> im_eof;No
>> error from addr 127.0.0.1:32002
>> 08/18/2011 14:29:35;0002;
> pbs_mom;Svr;
>> im_eof;End of File from addr 127.0.0.1:1019
>>
>>
> ------------------
>> [usr1 at localhost torque]$ tracejob 68
>>
>>
> /var/spool/torque/server_priv/accounting/20110818: Permission denied
>> /var/spool/torque/mom_logs/20110818.33001: No matching job records
>>
> located
>> /var/spool/torque/mom_logs/20110818.32001: No matching job
>>
> records located
>> /var/spool/torque/mom_logs/20110818.31001: No
> matching
>> job records located
>> /var/spool/torque/mom_logs/20110818:
> No matching
>> job records located
>>
>> Job: 68.localhost
>>
>>
> 08/18/2011 14:27:40 M JOIN
>> JOB as node 1
>> 08/18/2011 14:27:40 S
> ready to commit job
>> 08/18/2011
>> 14:27:40 S ready to commit job
> completed
>> 08/18/2011 14:27:40 S
>> committing job
>> 08/18/2011 14:27:
> 40 S enqueuing into batch, state 1
>> hop 1
>> 08/18/2011 14:27:40 S
> Reply sent for request type Commit on
>> socket 13
>> 08/18/2011 14:27:40
> S attr comment modified
>> 08/18/2011 14:
>> 27:40 S Job Modified at
> request of Scheduler at localhost
>> 08/18/2011
>> 14:27:40 L Job Run
>>
> 08/18/2011 14:27:40 S Job Run at request of
>> Scheduler at localhost
>>
> 08/18/2011 14:27:40 S forking in send_job
>> 08/18/2011 14:27:40 S
> entering post_sendmom
>> 08/18/2011 14:27:40
>> S child reported success
> for job after 0 seconds (dest=hosta-0),
>> rc=0
>> 08/18/2011 14:27:40 M
> removed job script
>>
>> ------------------------------
>>
>>
> [root at localhost ~]# qstat -n
>>
>> localhost:
>>
>>
>> Req'd Req'd
> Elap
>> Job ID Username Queue
>> Jobname SessID NDS TSK Memory Time S
> Time
>> -------------------- -------- -------- ----------------
> ------ -----
>> --- ------ ----- - -----
>> 68.localhost user1 batch
>>
> STDIN -- 2 2 -- 01:00 R --
>> hosta-
>> 1/0+hosta-0/0
>> 69.localhost
> user1 batch STDIN
>> -- 2 2 -- 01:00 R --
>> hosta-3/0+hosta-2/0
>>
>>
> _______________________________________________
>> torqueusers mailing
> list
>> torqueusers at supercluster.org
>> http://www.supercluster.
> org/mailman/listinfo/torqueusers
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
Dept of Coomputational Biophysics and Bioinformatics,
Faculty of Biochemistry, Biophysics and Biotechnology,
Jagiellonian University,
ul. Gronostajowa 7,
30-387 Krakow, Poland.
Tel: (+48-12)-664-6380



More information about the torqueusers mailing list