[torqueusers] Problem with one node : " pbs_mom; Job; 46.master; task not started, '/bin/sh', stdio setup failed (see syslog) "

Abraham Zamudio abraham.zamudio at gmail.com
Tue Sep 28 09:34:28 MDT 2010


The output of the 46.master job

*mpiexec: Warning: poll_or_block_event: evt 10 task 8 on quad2: remote
system error.*
*mpiexec: Warning: poll_or_block_event: evt 11 task 9 on quad2: remote
system error.*
*mpiexec: Warning: poll_or_block_event: evt 12 task 10 on quad2: remote
system error.*
*mpiexec: Warning: poll_or_block_event: evt 13 task 11 on quad2: remote
system error.*
*=>> PBS: job killed: walltime 3633 exceeded limit 3600*
*mpiexec: Warning: task 0 exited before completing MPI startup.*
*
*
*
The problem apparently is with mpich2 ,
when using two nodes .

*
On Tue, Sep 28, 2010 at 10:24 AM, Abraham Zamudio <abraham.zamudio at gmail.com
> wrote:

> *Syslog*
> *
> *
> *[root at quad2 mpiX]# cat /var/log/messages | grep pbs*
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=51586:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=51586:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=51586:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=51586:51586
> Sep 28 08:48:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=34675:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=34675:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=34675:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=34675:34675
> Sep 28 09:12:07 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=48625:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=48625:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=48625:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::No route to host (113) in
> open_demux, open_demux: connect 10.10.10.3:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in search_env_and_open, failed connect to stdio on
> MPIEXEC_STDOUT_PORT=48625:48625
> Sep 28 09:29:29 quad2 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25) in start_process, cannot locate MPIEXEC_STDOUT_PORT
>
>
>
> My /etc/hosts :
>
> *root at quad2 mpiX]# cat /etc/hosts*
> #127.0.0.1   jro-operations localhost localhost.localdomain localhost4
> localhost4.localdomain4
> #::1         localhost localhost.localdomain localhost6
> localhost6.localdomain6
> 127.0.0.1 localhost
>
> 10.10.10.241 master
>
>
> ############# NODOS MPICH V2 ########################
> 10.10.10.3 quad4
> 10.10.10.4 quad2 jro-operations localhost localhost.localdomain
> 10.10.10.236 gauss
> ############# NODOS MPICH V2 ########################
>
>
>
> On Tue, Sep 28, 2010 at 10:17 AM, Abraham Zamudio <
> abraham.zamudio at gmail.com> wrote:
>
>> The output of qstat :
>> *
>> *
>> *[mpiX at master mpi_fitting]$ qstat *
>> Job id                    Name             User            Time Use S
>> Queue
>> ------------------------- ---------------- --------------- -------- -
>> -----
>> 46.master                 mpi_fitting      mpiX            00:00:00 R
>> batch
>>
>>
>> I will ask permission from the administrator to view syslog
>> (/var/log/messages)
>>
>>
>> On Tue, Sep 28, 2010 at 10:04 AM, Ken Nielson <
>> knielson at adaptivecomputing.com> wrote:
>>
>>>  On 09/28/2010 08:57 AM, Abraham Zamudio wrote:
>>>
>>> Hi everybody ,
>>>
>>>   I have a problem with one of my nodes :
>>>
>>> *[mpiX at quad2 ~]$ cat /var/spool/torque/mom_logs/20100928 | grep
>>> 46.master*09/28/2010 09:29:29;0008;   pbs_mom;Job;46.master;JOIN JOB as
>>> node 109/28/2010 09:29:29;0001;   pbs_mom;Job;46.master;task not
>>> started, '/bin/sh', stdio setup failed (see syslog)09/28/2010
>>> 09:29:29;0008;   pbs_mom;Job;46.master;ERROR:    received request
>>> 'SPAWN_TASK' from 10.10.10.3:1023 for job '46.master' (cannot start
>>> task)09/28/2010 09:29:29;0001;   pbs_mom;Job;46.master;task not started,
>>> '/bin/sh', stdio setup failed (see syslog)09/28/2010 09:29:29;0008;
>>> pbs_mom;Job;46.master;ERROR:    received request 'SPAWN_TASK' from
>>> 10.10.10.3:1023 for job '46.master' (cannot start task)09/28/2010
>>> 09:29:29;0001;   pbs_mom;Job;46.master;task not started, '/bin/sh', stdio
>>> setup failed (see syslog)09/28/2010 09:29:29;0008;
>>> pbs_mom;Job;46.master;ERROR:    received request 'SPAWN_TASK' from
>>> 10.10.10.3:1023 for job '46.master' (cannot start task)09/28/2010
>>> 09:29:29;0001;   pbs_mom;Job;46.master;task not started, '/bin/sh', stdio
>>> setup failed (see syslog)09/28/2010 09:29:29;0008;
>>> pbs_mom;Job;46.master;ERROR:    received request 'SPAWN_TASK' from
>>> 10.10.10.3:1023 for job '46.master' (cannot start task)
>>>
>>>  The status of job is active
>>>
>>>  *[mpiX at master mpi_fitting]$ showq*
>>> ACTIVE JOBS--------------------
>>> JOBNAME            USERNAME      STATE  PROC   REMAINING
>>>  STARTTIME
>>>
>>>  46                     mpiX    Running    12    00:35:52  Tue Sep 28
>>> 09:32:56
>>>
>>>       1 Active Job       12 of   12 Processors Active (100.00%)
>>>                          2 of    2 Nodes Active      (100.00%)
>>>
>>>  IDLE JOBS----------------------
>>> JOBNAME            USERNAME      STATE  PROC     WCLIMIT
>>>  QUEUETIME
>>>
>>>
>>>  0 Idle Jobs
>>>
>>>  BLOCKED JOBS----------------
>>> JOBNAME            USERNAME      STATE  PROC     WCLIMIT
>>>  QUEUETIME
>>>
>>>
>>>  Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0
>>>
>>>  The same software (mpich2+gsl) run on a single node of 8 cores, This
>>> problem occurs when two nodes use .
>>>
>>>
>>>
>>> --
>>> Abraham Zamudio Ch.
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>  What does qstat show? Did you look at syslog?
>>>
>>> Ken Nielson
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Abraham Zamudio Ch.
>>
>>
>
>
> --
> Abraham Zamudio Ch.
>
>


-- 
Abraham Zamudio Ch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100928/d1371157/attachment-0001.html 


More information about the torqueusers mailing list