[torqueusers] trouble running a job with two nodes in a multi mom.
burcarjo at ono.com
burcarjo at ono.com
Fri Aug 19 03:46:31 MDT 2011
Hi,
I have changed my /etc/hosts configuration but the problem
continues.
I'm reviewing all but I don't find the solution. When I
submit a job with 2 nodes then the torque system assign the nodes to
the job and start the job in the first compute node but the job don't
finish never.
When a pbs_mom daemon need to communicate with other,
it's possible that the pbs_mom doesn't know the number port of the
second and so the daemon keep waiting ???
I don't know if someone has
probed a multi mom configuration succesfully.
thanks all.
David K.
My
logs with PBSDEBUG variable activated are:
[root at host0 torque]$ more
/etc/hosts
# Do not remove the following line, or various programs
#
that require network functionality will fail.
127.0.0.1 localhost
localhost.localdomain
192.168.1.100 host0 hosta hostb hostc hostd
::
1 localhost6.localdomain6 localhost6
pbs_server:
job
allocation debug(2): 2 requested, 4 svr_numnodes
Counted 0 gpus free on
node hosta
starting eval gpus on node hosta need 0 free 0
Counted 0
gpus free on node hosta
adequate virtual nodes and gpus available -
node is ok
Counted 0 gpus free on node hostb
starting eval gpus on node
hostb need 0 free 0
Counted 0 gpus free on node hostb
adequate virtual
nodes and gpus available - node is ok
job allocation debug(3):
returning 2 requested
allocated node hosta/0 to job 3.host0 (nsnfree=1)
allocated node hostb/0 to job 3.host0 (nsnfree=1)
catch_child caught
pid 11726
catch_child found work task found for pid 11726
pbs_mom
hostb:
MOM is up
saving extra job info stdout=0 stderr=0 taskid=1
nodeid=0
===== MD5 07928297C747E1EDDD04CA29E25B4FF6
pbs_mom: LOG_DEBUG::
init_groups, pre-sigprocmask
pbs_mom: LOG_DEBUG::init_groups, post-
initgroups
im_request:received request 'JOIN_JOB' (1) for job 3.host0
from 192.168.1.100:1021
pbs_mom: LOG_DEBUG::init_groups, pre-
sigprocmask
pbs_mom: LOG_DEBUG::init_groups, post-initgroups
saving
extra job info stdout=59565 stderr=37405 taskid=1 nodeid=1
im_request:
received request 'ALL_OKAY' (0) for job 3.host0 from 192.168.1.100:
30002
pbs_mom: LOG_ERROR::im_request, stream 1 not found
pbs_mom:
LOG_ERROR::im_request, error sending command 0 to job 3.host0
do_rpp:
cannot get protocol End of File
pbs_mom: LOG_ERROR::Success (0) in
do_rpp, cannot get protocol End of File
----Mensaje original----
De: janusz.mordarski at uj.edu.pl
Fecha: 18/08/2011 18:57
Para:
<torqueusers at supercluster.org>
Asunto: Re: [torqueusers] trouble
running a job with two nodes in a multi mom.
I think you should change
it, 127.0.0.1 should point only to localhost
localhost.localdomain
and
then, IP address should point to host-a-0 ...
W dniu 2011-08-18 16:54,
burcarjo at ono.com pisze:
> My /etc/hosts contains:
> 127.0.0.1 localhost
localhost.localdomain hosta-
> 0 hosta-1 hosta-2 hosta-3 hosta-4
>
> I
can login to hosta-X witout ssh
> password.
>
> ----Mensaje
original----
> De: knielson at adaptivecomputing.com
>
> Fecha: 18/08/2011
16:49
> Para:<burcarjo at ono.com>, "Torque Users Mailing
> List"
<torqueusers at supercluster.org>
> Asunto: Re: [torqueusers] trouble
>
running a job with two nodes in a multi mom.
>
> What does your
etc/hosts
> file look like.
>
> Ken
>
> ----- Original Message -----
>>
From:
> burcarjo at ono.com
>> To: torqueusers at supercluster.org
>> Sent:
Thursday,
> August 18, 2011 7:06:04 AM
>> Subject: [torqueusers]
trouble running a
> job with two nodes in a multi mom.
>> Hello all,
>>
I have a multi mom
> configuration over a single host.
>> There
>> are
4 pbs_mom daemons
> running in the same machine (I'm simulating 4
>>
nodes) and all they
> listen at different ports as says the admin
guide.
>> When I launch a
> job with an only node then all goes fine
and my job
>> finish correctly
> : echo "hostname" | qsub -l nodes=1.
>> The problem is
>> when I try to
> launch a job over more than one
node ( echo "hostname" |
>> qsub -l
> nodes=2), then the job remains in
running state and never
>> finish.
>>
>> I think exist a communication
trouble with the pbs_mom daemons
>>
> running in the same machine but I
don't sure. Reviewing the logs I
> olny
>> find a little information
about the problem:
>>
>> 08/18/2011 14:
> 29:35;
>> 0008; pbs_mom;Job;
69.localhost;JOIN JOB as node 1
>> 08/18/2011
> 14:29:35;
>> 0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, stream 1
> not found
>>
08/18/2011 14:29:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::
> im_request,
>> error sending command 0 to job 69.localhost
>>
>>
>> I
> would like
to know if
>> somebody had this problem with a multi mom
>
configuration.
>> Thank's in
>> advance.
>>
>> David.
>>
>>
>
[root at localhost ~]# more /etc/hosts
>> # Do not remove
>> the following
> line, or various programs
>> # that require network
>> functionality
will
> fail.
>> 127.0.0.1 localhost localhost.localdomain
>> hosta-0
hosta-1
> hosta-2 hosta-3 hosta-4
>> #192.168.1.100
>> ::1
>>
localhost6.
> localdomain6 localhost6
>>
>> ----------------
>>
[root at localhost ~]
>>
> # more /var/spool/torque/server_priv/nodes
>>
hosta-0 np=1
>> hosta-1
> np=1
>> mom_service_port=30001
mom_manager_port=30002
>> hosta-2 np=1
>>
> mom_service_port=31001
mom_manager_port=31002
>> hosta-3 np=1
>>
> mom_service_port=32001
mom_manager_port=32002
>> hosta-4 np=1
>>
> mom_service_port=33001
mom_manager_port=33002
>> ------------------
>>
>> [root at localhost ~]#
more /var/spool/torque/mom_logs/20110818.
> 32001
>> 08/18/2011 14:25:
26;0002; pbs_mom;Svr;pbs_mom;Torque Mom
> Version =
>> 3.0.2, loglevel
= 0
>> 08/18/2011 14:25:26;0002; pbs_mom;
> n/a;
>>
mom_server_check_connection;sending hello to server localhost.
>>
>
localdomain
>> 08/18/2011 14:25:28;0002; pbs_mom;Svr;im_eof;End of File
>> from addr 127.0.0.1:15001
>> 08/18/2011 14:25:28;0002; pbs_mom;n/a;
>>
> mom_server_check_connection;sending hello to server localhost.
>
localdomain
>> 08/18/2011 14:29:35;0080; pbs_mom;Job;69.localhost;
>>
>
removed job script
>> 08/18/2011 14:29:35;0008; pbs_mom;Job;69.
>
localhost;
>> JOIN JOB as node 1
>> 08/18/2011 14:29:35;0001; pbs_mom;
Svr;
> pbs_mom;
>> LOG_ERROR::im_request, stream 1 not found
>>
08/18/2011 14:29:
> 35;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::
im_request, error sending
> command 0 to
>> job 69.localhost
>>
08/18/2011 14:29:35;0002; pbs_mom;Svr;
> im_eof;No
>> error from addr
127.0.0.1:32002
>> 08/18/2011 14:29:35;0002;
> pbs_mom;Svr;
>> im_eof;
End of File from addr 127.0.0.1:1019
>>
>>
> ------------------
>>
[usr1 at localhost torque]$ tracejob 68
>>
>>
>
/var/spool/torque/server_priv/accounting/20110818: Permission denied
>>
/var/spool/torque/mom_logs/20110818.33001: No matching job records
>>
>
located
>> /var/spool/torque/mom_logs/20110818.32001: No matching job
>>
> records located
>> /var/spool/torque/mom_logs/20110818.31001: No
>
matching
>> job records located
>> /var/spool/torque/mom_logs/20110818:
> No matching
>> job records located
>>
>> Job: 68.localhost
>>
>>
>
08/18/2011 14:27:40 M JOIN
>> JOB as node 1
>> 08/18/2011 14:27:40 S
>
ready to commit job
>> 08/18/2011
>> 14:27:40 S ready to commit job
>
completed
>> 08/18/2011 14:27:40 S
>> committing job
>> 08/18/2011 14:
27:
> 40 S enqueuing into batch, state 1
>> hop 1
>> 08/18/2011 14:27:
40 S
> Reply sent for request type Commit on
>> socket 13
>> 08/18/2011
14:27:40
> S attr comment modified
>> 08/18/2011 14:
>> 27:40 S Job
Modified at
> request of Scheduler at localhost
>> 08/18/2011
>> 14:27:40
L Job Run
>>
> 08/18/2011 14:27:40 S Job Run at request of
>>
Scheduler at localhost
>>
> 08/18/2011 14:27:40 S forking in send_job
>>
08/18/2011 14:27:40 S
> entering post_sendmom
>> 08/18/2011 14:27:40
>>
S child reported success
> for job after 0 seconds (dest=hosta-0),
>>
rc=0
>> 08/18/2011 14:27:40 M
> removed job script
>>
>>
------------------------------
>>
>>
> [root at localhost ~]# qstat -n
>>
>> localhost:
>>
>>
>> Req'd Req'd
> Elap
>> Job ID Username Queue
>>
Jobname SessID NDS TSK Memory Time S
> Time
>> --------------------
-------- -------- ----------------
> ------ -----
>> --- ------ ----- -
-----
>> 68.localhost user1 batch
>>
> STDIN -- 2 2 -- 01:00 R --
>>
hosta-
>> 1/0+hosta-0/0
>> 69.localhost
> user1 batch STDIN
>> -- 2 2
-- 01:00 R --
>> hosta-3/0+hosta-2/0
>>
>>
>
_______________________________________________
>> torqueusers mailing
> list
>> torqueusers at supercluster.org
>> http://www.supercluster.
>
org/mailman/listinfo/torqueusers
>
>
>
>
_______________________________________________
> torqueusers mailing
list
> torqueusers at supercluster.org
> http://www.supercluster.
org/mailman/listinfo/torqueusers
--
Dept of Coomputational
Biophysics and Bioinformatics,
Faculty of Biochemistry, Biophysics and
Biotechnology,
Jagiellonian University,
ul. Gronostajowa 7,
30-387
Krakow, Poland.
Tel: (+48-12)-664-6380
_______________________________________________
torqueusers mailing
list
torqueusers at supercluster.org
http://www.supercluster.
org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list