[torqueusers] pbs_mom crashes
James J Coyle
jjc at iastate.edu
Tue Feb 26 11:45:56 MST 2008
Jozef,
I'd check the /etc/passwd file on that node, and compare it to some other
node in the cluster. Note that the fields are separated by :
It looks like some username has a home directory of /bin/sh
The next to last entry in a line of /etc/passwd should be the home directory
and the last entry should be the login shell for the user.
You might just need to copy the /etc/passwd from one node to another
if they should be the same.
See http://www.cyberciti.biz/faq/understanding-etcpasswd-file-format/
for an explanation of the format of /etc/passwd.
- Jim C.
--
James Coyle, PhD
High Performance Computing Group
235 Durham Center
Iowa State Univ. phone: (515)-294-2099
Ames, Iowa 50011 web: http://www.public.iastate.edu/~jjc
> --===============1663275951==
> Content-Type: multipart/alternative;
> boundary="----=_Part_12641_8359381.1204039783143"
>
> ------=_Part_12641_8359381.1204039783143
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hello,
>
> I have a problem with one node. pbs_mom crashes everytime a job is run from
> that node (not physically, just torque decides to run it there).
> There's also a problem when it comes to run a process on that node. I found
> out that this machine causes the job to stay in state 'running'.
>
> I searched mom_logs and I'm curious about this line:
> invalid home directory '/bin/sh' specified, not a directory
> What does it mean?
> I have a working torque/maui environment with NFS enabled. I'm running the
> compiled program, which you can find in my post to the list
> with subject "Torque with Open MPI". Every node has been configured the same
> way so I don't understand why this is happening.
>
> Thank you for a reply.
> Jozef
>
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type QueueJob request received from
> PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type JobScript request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type ReadyToCommit request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type Commit request received from
> PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type StatusJob request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type ModifyJob request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=14
> 02/26/2008 16:27:38;0008; pbs_mom;Job;164.f135-3;Job Modified at request
> of PBS_Server at f135-3.informatika.fpv.umb.sk
> 02/26/2008 16:27:38;0001; pbs_mom;Job;TMomFinalizeJob3;job not started,
> Failur
> e job exec failure, after files staged, no retry (see syslog for more
> information)
> 02/26/2008 16:27:38;0001; pbs_mom;Job;164.f135-3;ALERT: job failed phase
> 3 start
> 02/26/2008 16:27:38;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 02/26/2008 16:27:38;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 02/26/2008 16:27:38;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 02/26/2008 16:27:38;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 02/26/2008 16:27:38;0008; pbs_mom;Job;scan_for_terminated;checking job
> post-processing routine
> 02/26/2008 16:27:38;0080; pbs_mom;Job;164.f135-3;obit sent to server
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type CopyFiles request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0001; pbs_mom;Svr;pbs_mom;Unknown resource type
> (15035) in fork_to_user, invalid home directory '/bin/sh' specified
> , not a directory
> 02/26/2008 16:27:38;0080; pbs_mom;Req;req_reject;Reject reply
> code=15035(Unknown resource type REJHOST=f135-13.informatika.fpv.umb.sk M
> SG=invalid home directory '/bin/sh' specified, not a directory), aux=0,
> type=CopyFiles, from PBS_Server at f135-3.informatika.fpv.umb.sk
> 02/26/2008 16:27:38;0001; pbs_mom;Svr;pbs_mom;Inappropriate ioctl for
> device (25) in req_cpyfile, fork_to_user failed with rc=-15035 'in
> valid home directory '/bin/sh' specified, not a directory' - exiting
>
> ------=_Part_12641_8359381.1204039783143
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hello,<br><br>I have a problem with one node. pbs_mom crashes everytime a job is run from that node (not physically, just torque decides to run it there).<br>There's also a problem when it comes to run a process on that node. I found out that this machine causes the job to stay in state 'running'.<br>
> <br>I searched mom_logs and I'm curious about this line:<br> invalid home directory '/bin/sh' specified, not a directory <br>What does it mean? <br>I have a working torque/maui environment with NFS enabled. I'm running the compiled program, which you can find in my post to the list<br>
> with subject "Torque with Open MPI". Every node has been configured the same way so I don't understand why this is happening.<br><br>Thank you for a reply.<br>Jozef<br><br>02/26/2008 16:27:38;0100; pbs_mom;Req;;Type QueueJob request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type JobScript request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>02/26/2008 16:27:38;0100; pbs_mom;Req;;Type ReadyToCommit request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type Commit request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>02/26/2008 16:27:38;0100; pbs_mom;Req;;Type StatusJob request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type ModifyJob request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=14<br>02/26/2008 16:27:38;0008; pbs_mom;Job;164.f135-3;Job Modified at request of <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a><br>
> 02/26/2008 16:27:38;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Failur<br>e job exec failure, after files staged, no retry (see syslog for more information)<br>02/26/2008 16:27:38;0001; pbs_mom;Job;164.f135-3;ALERT: job failed phase 3 start<br>
> 02/26/2008 16:27:38;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters<br>02/26/2008 16:27:38;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply<br>02/26/2008 16:27:38;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop<br>
> 02/26/2008 16:27:38;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat<br>02/26/2008 16:27:38;0008; pbs_mom;Job;scan_for_terminated;checking job post-processing routine<br>02/26/2008 16:27:38;0080; pbs_mom;Job;164.f135-3;obit sent to server<br>
> 02/26/2008 16:27:38;0100; pbs_mom;Req;;Type CopyFiles request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>02/26/2008 16:27:38;0001; pbs_mom;Svr;pbs_mom;Unknown resource type (15035) in fork_to_user, invalid home directory '/bin/sh' specified<br>
> , not a directory<br>02/26/2008 16:27:38;0080; pbs_mom;Req;req_reject;Reject reply code=15035(Unknown resource type REJHOST=<a href="http://f135-13.informatika.fpv.umb.sk">f135-13.informatika.fpv.umb.sk</a> M<br>SG=invalid home directory '/bin/sh' specified, not a directory), aux=0, type=CopyFiles, from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a><br>
> 02/26/2008 16:27:38;0001; pbs_mom;Svr;pbs_mom;Inappropriate ioctl for device (25) in req_cpyfile, fork_to_user failed with rc=-15035 'in<br>valid home directory '/bin/sh' specified, not a directory' - exiting<br>
> <br>
>
> ------=_Part_12641_8359381.1204039783143--
>
> --===============1663275951==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --===============1663275951==--
>
More information about the torqueusers
mailing list