[torqueusers] trouble running a job with two nodes in a multi mom.

burcarjo at ono.com burcarjo at ono.com
Fri Aug 19 03:46:31 MDT 2011


Hi,
I have changed my /etc/hosts configuration but the problem 
continues.
I'm reviewing all but I don't find the solution. When I 
submit a  job with 2 nodes then the torque system assign the nodes to 
the job and start the job in the first compute node but the job don't 
finish never.
When a pbs_mom daemon need to communicate with other, 
it's possible that the pbs_mom doesn't know the number port of the 
second and so the daemon keep waiting ???
I don't know if someone has 
probed a multi mom configuration succesfully.

thanks all.
David K.

My 
logs with PBSDEBUG variable activated are:


[root at host0 torque]$ more 
/etc/hosts
# Do not remove the following line, or various programs
# 
that require network functionality will fail.
127.0.0.1 localhost 
localhost.localdomain
192.168.1.100  host0 hosta hostb hostc hostd
::
1             localhost6.localdomain6 localhost6

pbs_server:
job 
allocation debug(2): 2 requested, 4 svr_numnodes
Counted 0 gpus free on 
node hosta
starting eval gpus on node hosta need 0 free 0
Counted 0 
gpus free on node hosta
adequate virtual nodes and gpus available - 
node is ok
Counted 0 gpus free on node hostb
starting eval gpus on node 
hostb need 0 free 0
Counted 0 gpus free on node hostb
adequate virtual 
nodes and gpus available - node is ok
job allocation debug(3): 
returning 2 requested
allocated node hosta/0 to job 3.host0 (nsnfree=1)

allocated node hostb/0 to job 3.host0 (nsnfree=1)
catch_child caught 
pid 11726
catch_child found work task found for pid 11726

pbs_mom 
hostb:
MOM is up
saving extra job info stdout=0 stderr=0 taskid=1 
nodeid=0
===== MD5 07928297C747E1EDDD04CA29E25B4FF6
pbs_mom: LOG_DEBUG::
init_groups, pre-sigprocmask
pbs_mom: LOG_DEBUG::init_groups, post-
initgroups
im_request:received request 'JOIN_JOB' (1) for job 3.host0 
from 192.168.1.100:1021
pbs_mom: LOG_DEBUG::init_groups, pre-
sigprocmask
pbs_mom: LOG_DEBUG::init_groups, post-initgroups
saving 
extra job info stdout=59565 stderr=37405 taskid=1 nodeid=1
im_request:
received request 'ALL_OKAY' (0) for job 3.host0 from 192.168.1.100:
30002
pbs_mom: LOG_ERROR::im_request, stream 1 not found
pbs_mom: 
LOG_ERROR::im_request, error sending command 0 to job 3.host0
do_rpp: 
cannot get protocol End of File
pbs_mom: LOG_ERROR::Success (0) in 
do_rpp, cannot get protocol End of File






----Mensaje original----

De: janusz.mordarski at uj.edu.pl
Fecha: 18/08/2011 18:57
Para: 
<torqueusers at supercluster.org>
Asunto: Re: [torqueusers] trouble 
running a job with two nodes in a multi	mom.

I think you should change 
it, 127.0.0.1 should point only to localhost 
localhost.localdomain
and 
then, IP address should point to host-a-0 ...

W dniu 2011-08-18 16:54, 
burcarjo at ono.com pisze:
> My /etc/hosts contains:
> 127.0.0.1 localhost 
localhost.localdomain hosta-
> 0 hosta-1 hosta-2 hosta-3 hosta-4
>
> I 
can login to hosta-X witout ssh
> password.
>
> ----Mensaje 
original----
> De: knielson at adaptivecomputing.com
>
> Fecha: 18/08/2011 
16:49
> Para:<burcarjo at ono.com>, "Torque Users Mailing
> List"
<torqueusers at supercluster.org>
> Asunto: Re: [torqueusers] trouble
> 
running a job with two nodes in a multi mom.
>
> What does your 
etc/hosts
> file look like.
>
> Ken
>
> ----- Original Message -----
>> 
From:
> burcarjo at ono.com
>> To: torqueusers at supercluster.org
>> Sent: 
Thursday,
> August 18, 2011 7:06:04 AM
>> Subject: [torqueusers] 
trouble running a
> job with two nodes in a multi mom.
>> Hello all,
>> 
I have a multi mom
> configuration over a single host.
>> There
>> are 
4 pbs_mom daemons
> running in the same machine (I'm simulating 4
>> 
nodes) and all they
> listen at different ports as says the admin 
guide.
>> When I launch a
> job with an only node then all goes fine 
and my job
>> finish correctly
> : echo "hostname" | qsub -l nodes=1.

>> The problem is
>> when I try to
> launch a job over more than one 
node ( echo "hostname" |
>> qsub -l
> nodes=2), then the job remains in 
running state and never
>> finish.
>>
>> I think exist a communication 
trouble with the pbs_mom daemons
>>
> running in the same machine but I 
don't sure. Reviewing the logs I
> olny
>> find a little information 
about the problem:
>>
>> 08/18/2011 14:
> 29:35;
>> 0008; pbs_mom;Job;
69.localhost;JOIN JOB as node 1
>> 08/18/2011
> 14:29:35;
>> 0001; 
pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1
> not found
>> 
08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::
> im_request,

>> error sending command 0 to job 69.localhost
>>
>>
>> I
> would like 
to know if
>> somebody had this problem with a multi mom
> 
configuration.
>> Thank's in
>> advance.
>>
>> David.
>>
>>
> 
[root at localhost ~]# more /etc/hosts
>> # Do not remove
>> the following

> line, or various programs
>> # that require network
>> functionality 
will
> fail.
>> 127.0.0.1 localhost localhost.localdomain
>> hosta-0 
hosta-1
> hosta-2 hosta-3 hosta-4
>> #192.168.1.100
>> ::1
>> 
localhost6.
> localdomain6 localhost6
>>
>> ----------------
>> 
[root at localhost ~]
>>
> # more /var/spool/torque/server_priv/nodes
>> 
hosta-0 np=1
>> hosta-1
> np=1
>> mom_service_port=30001 
mom_manager_port=30002
>> hosta-2 np=1
>>
> mom_service_port=31001 
mom_manager_port=31002
>> hosta-3 np=1
>>
> mom_service_port=32001 
mom_manager_port=32002
>> hosta-4 np=1
>>
> mom_service_port=33001 
mom_manager_port=33002
>> ------------------
>>
>> [root at localhost ~]# 
more /var/spool/torque/mom_logs/20110818.
> 32001
>> 08/18/2011 14:25:
26;0002; pbs_mom;Svr;pbs_mom;Torque Mom
> Version =
>> 3.0.2, loglevel 
= 0
>> 08/18/2011 14:25:26;0002; pbs_mom;
> n/a;
>> 
mom_server_check_connection;sending hello to server localhost.
>>
> 
localdomain
>> 08/18/2011 14:25:28;0002; pbs_mom;Svr;im_eof;End of File

>> from addr 127.0.0.1:15001
>> 08/18/2011 14:25:28;0002; pbs_mom;n/a;

>>
> mom_server_check_connection;sending hello to server localhost.
> 
localdomain
>> 08/18/2011 14:29:35;0080; pbs_mom;Job;69.localhost;
>>
> 
removed job script
>> 08/18/2011 14:29:35;0008; pbs_mom;Job;69.
> 
localhost;
>> JOIN JOB as node 1
>> 08/18/2011 14:29:35;0001; pbs_mom;
Svr;
> pbs_mom;
>> LOG_ERROR::im_request, stream 1 not found
>> 
08/18/2011 14:29:
> 35;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::
im_request, error sending
> command 0 to
>> job 69.localhost
>> 
08/18/2011 14:29:35;0002; pbs_mom;Svr;
> im_eof;No
>> error from addr 
127.0.0.1:32002
>> 08/18/2011 14:29:35;0002;
> pbs_mom;Svr;
>> im_eof;
End of File from addr 127.0.0.1:1019
>>
>>
> ------------------
>> 
[usr1 at localhost torque]$ tracejob 68
>>
>>
> 
/var/spool/torque/server_priv/accounting/20110818: Permission denied
>> 
/var/spool/torque/mom_logs/20110818.33001: No matching job records
>>
> 
located
>> /var/spool/torque/mom_logs/20110818.32001: No matching job

>>
> records located
>> /var/spool/torque/mom_logs/20110818.31001: No
> 
matching
>> job records located
>> /var/spool/torque/mom_logs/20110818:

> No matching
>> job records located
>>
>> Job: 68.localhost
>>
>>
> 
08/18/2011 14:27:40 M JOIN
>> JOB as node 1
>> 08/18/2011 14:27:40 S
> 
ready to commit job
>> 08/18/2011
>> 14:27:40 S ready to commit job
> 
completed
>> 08/18/2011 14:27:40 S
>> committing job
>> 08/18/2011 14:
27:
> 40 S enqueuing into batch, state 1
>> hop 1
>> 08/18/2011 14:27:
40 S
> Reply sent for request type Commit on
>> socket 13
>> 08/18/2011 
14:27:40
> S attr comment modified
>> 08/18/2011 14:
>> 27:40 S Job 
Modified at
> request of Scheduler at localhost
>> 08/18/2011
>> 14:27:40 
L Job Run
>>
> 08/18/2011 14:27:40 S Job Run at request of
>> 
Scheduler at localhost
>>
> 08/18/2011 14:27:40 S forking in send_job
>> 
08/18/2011 14:27:40 S
> entering post_sendmom
>> 08/18/2011 14:27:40
>> 
S child reported success
> for job after 0 seconds (dest=hosta-0),
>> 
rc=0
>> 08/18/2011 14:27:40 M
> removed job script
>>
>> 
------------------------------
>>
>>
> [root at localhost ~]# qstat -n
>>

>> localhost:
>>
>>
>> Req'd Req'd
> Elap
>> Job ID Username Queue
>> 
Jobname SessID NDS TSK Memory Time S
> Time
>> -------------------- 
-------- -------- ----------------
> ------ -----
>> --- ------ ----- - 
-----
>> 68.localhost user1 batch
>>
> STDIN -- 2 2 -- 01:00 R --
>> 
hosta-
>> 1/0+hosta-0/0
>> 69.localhost
> user1 batch STDIN
>> -- 2 2 
-- 01:00 R --
>> hosta-3/0+hosta-2/0
>>
>>
> 
_______________________________________________
>> torqueusers mailing

> list
>> torqueusers at supercluster.org
>> http://www.supercluster.
> 
org/mailman/listinfo/torqueusers
>
>
>
> 
_______________________________________________
> torqueusers mailing 
list
> torqueusers at supercluster.org
> http://www.supercluster.
org/mailman/listinfo/torqueusers


-- 
Dept of Coomputational 
Biophysics and Bioinformatics,
Faculty of Biochemistry, Biophysics and 
Biotechnology,
Jagiellonian University,
ul. Gronostajowa 7,
30-387 
Krakow, Poland.
Tel: (+48-12)-664-6380


_______________________________________________
torqueusers mailing 
list
torqueusers at supercluster.org
http://www.supercluster.
org/mailman/listinfo/torqueusers





More information about the torqueusers mailing list