[torqueusers] pbs_mom crashes

James J Coyle jjc at iastate.edu
Tue Feb 26 11:45:56 MST 2008


Jozef,

  I'd check the /etc/passwd file on that node, and compare it to some other 
node in the cluster.  Note that the fields are separated by :

  It looks like some username has a home directory of /bin/sh
The next to last entry in a line of /etc/passwd should be the home directory
and the last entry should be the login shell for the user.

  You might just need to copy the /etc/passwd from one node to another
if they should be the same.


See   http://www.cyberciti.biz/faq/understanding-etcpasswd-file-format/

for an explanation of the format of /etc/passwd.

 - Jim C.


-- 
 James Coyle, PhD
 High Performance Computing Group     
 235 Durham Center            
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc


> --===============1663275951==
> Content-Type: multipart/alternative; 
> 	boundary="----=_Part_12641_8359381.1204039783143"
> 
> ------=_Part_12641_8359381.1204039783143
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> Hello,
> 
> I have a problem with one node. pbs_mom crashes everytime a job is run from
> that node (not physically, just torque decides to run it there).
> There's also a problem when it comes to run a process on that node. I found
> out that this machine causes the job to stay in state 'running'.
> 
> I searched mom_logs and I'm curious about this line:
> invalid home directory '/bin/sh' specified, not a directory
> What does it mean?
> I have a working torque/maui environment with NFS enabled. I'm running the
> compiled program, which you can find in my post to the list
> with subject "Torque with Open MPI". Every node has been configured the same
> way so I don't understand why this is happening.
> 
> Thank you for a reply.
> Jozef
> 
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type QueueJob request received from
> PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type JobScript request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type ReadyToCommit request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type Commit request received from
> PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type StatusJob request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type ModifyJob request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=14
> 02/26/2008 16:27:38;0008;   pbs_mom;Job;164.f135-3;Job Modified at request
> of PBS_Server at f135-3.informatika.fpv.umb.sk
> 02/26/2008 16:27:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started,
> Failur
> e job exec failure, after files staged, no retry (see syslog for more
> information)
> 02/26/2008 16:27:38;0001;   pbs_mom;Job;164.f135-3;ALERT:  job failed phase
> 3 start
> 02/26/2008 16:27:38;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 02/26/2008 16:27:38;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 02/26/2008 16:27:38;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 02/26/2008 16:27:38;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 02/26/2008 16:27:38;0008;   pbs_mom;Job;scan_for_terminated;checking job
> post-processing routine
> 02/26/2008 16:27:38;0080;   pbs_mom;Job;164.f135-3;obit sent to server
> 02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type CopyFiles request received
> from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
> 02/26/2008 16:27:38;0001;   pbs_mom;Svr;pbs_mom;Unknown resource type
> (15035) in fork_to_user, invalid home directory '/bin/sh' specified
> , not a directory
> 02/26/2008 16:27:38;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15035(Unknown resource type  REJHOST=f135-13.informatika.fpv.umb.sk M
> SG=invalid home directory '/bin/sh' specified, not a directory), aux=0,
> type=CopyFiles, from PBS_Server at f135-3.informatika.fpv.umb.sk
> 02/26/2008 16:27:38;0001;   pbs_mom;Svr;pbs_mom;Inappropriate ioctl for
> device (25) in req_cpyfile, fork_to_user failed with rc=-15035 'in
> valid home directory '/bin/sh' specified, not a directory' - exiting
> 
> ------=_Part_12641_8359381.1204039783143
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> Hello,<br><br>I have a problem with one node. pbs_mom crashes everytime a job is run from that node (not physically, just torque decides to run it there).<br>There&#39;s also a problem when it comes to run a process on that node. I found out that this machine causes the job to stay in state &#39;running&#39;.<br>
> <br>I searched mom_logs and I&#39;m curious about this line:<br> invalid home directory &#39;/bin/sh&#39; specified, not a directory <br>What does it mean? <br>I have a working torque/maui environment with NFS enabled. I&#39;m running the compiled program, which you can find in my post to the list<br>
> with subject &quot;Torque with Open MPI&quot;. Every node has been configured the same way so I don&#39;t understand why this is happening.<br><br>Thank you for a reply.<br>Jozef<br><br>02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type QueueJob request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>
> 02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type JobScript request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type ReadyToCommit request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>
> 02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type Commit request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type StatusJob request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>
> 02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type ModifyJob request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=14<br>02/26/2008 16:27:38;0008;&nbsp;&nbsp; pbs_mom;Job;164.f135-3;Job Modified at request of <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a><br>
> 02/26/2008 16:27:38;0001;&nbsp;&nbsp; pbs_mom;Job;TMomFinalizeJob3;job not started, Failur<br>e job exec failure, after files staged, no retry (see syslog for more information)<br>02/26/2008 16:27:38;0001;&nbsp;&nbsp; pbs_mom;Job;164.f135-3;ALERT:&nbsp; job failed phase 3 start<br>
> 02/26/2008 16:27:38;0008;&nbsp;&nbsp; pbs_mom;Req;send_sisters;sending ABORT to sisters<br>02/26/2008 16:27:38;0080;&nbsp;&nbsp; pbs_mom;Svr;preobit_reply;top of preobit_reply<br>02/26/2008 16:27:38;0080;&nbsp;&nbsp; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop<br>
> 02/26/2008 16:27:38;0080;&nbsp;&nbsp; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat<br>02/26/2008 16:27:38;0008;&nbsp;&nbsp; pbs_mom;Job;scan_for_terminated;checking job post-processing routine<br>02/26/2008 16:27:38;0080;&nbsp;&nbsp; pbs_mom;Job;164.f135-3;obit sent to server<br>
> 02/26/2008 16:27:38;0100;&nbsp;&nbsp; pbs_mom;Req;;Type CopyFiles request received from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a>, sock=10<br>02/26/2008 16:27:38;0001;&nbsp;&nbsp; pbs_mom;Svr;pbs_mom;Unknown resource type&nbsp; (15035) in fork_to_user, invalid home directory &#39;/bin/sh&#39; specified<br>
> , not a directory<br>02/26/2008 16:27:38;0080;&nbsp;&nbsp; pbs_mom;Req;req_reject;Reject reply code=15035(Unknown resource type&nbsp; REJHOST=<a href="http://f135-13.informatika.fpv.umb.sk">f135-13.informatika.fpv.umb.sk</a> M<br>SG=invalid home directory &#39;/bin/sh&#39; specified, not a directory), aux=0, type=CopyFiles, from <a href="mailto:PBS_Server at f135-3.informatika.fpv.umb.sk">PBS_Server at f135-3.informatika.fpv.umb.sk</a><br>
> 02/26/2008 16:27:38;0001;&nbsp;&nbsp; pbs_mom;Svr;pbs_mom;Inappropriate ioctl for device (25) in req_cpyfile, fork_to_user failed with rc=-15035 &#39;in<br>valid home directory &#39;/bin/sh&#39; specified, not a directory&#39; - exiting<br>
> <br>
> 
> ------=_Part_12641_8359381.1204039783143--
> 
> --===============1663275951==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> --===============1663275951==--
> 




More information about the torqueusers mailing list