[torqueusers] pbs_mom crashes

Jozef Káčer quickparser at gmail.com
Tue Feb 26 08:29:43 MST 2008


Hello,

I have a problem with one node. pbs_mom crashes everytime a job is run from
that node (not physically, just torque decides to run it there).
There's also a problem when it comes to run a process on that node. I found
out that this machine causes the job to stay in state 'running'.

I searched mom_logs and I'm curious about this line:
invalid home directory '/bin/sh' specified, not a directory
What does it mean?
I have a working torque/maui environment with NFS enabled. I'm running the
compiled program, which you can find in my post to the list
with subject "Torque with Open MPI". Every node has been configured the same
way so I don't understand why this is happening.

Thank you for a reply.
Jozef

02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type QueueJob request received from
PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type JobScript request received
from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type ReadyToCommit request received
from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type Commit request received from
PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type StatusJob request received
from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type ModifyJob request received
from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=14
02/26/2008 16:27:38;0008;   pbs_mom;Job;164.f135-3;Job Modified at request
of PBS_Server at f135-3.informatika.fpv.umb.sk
02/26/2008 16:27:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started,
Failur
e job exec failure, after files staged, no retry (see syslog for more
information)
02/26/2008 16:27:38;0001;   pbs_mom;Job;164.f135-3;ALERT:  job failed phase
3 start
02/26/2008 16:27:38;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters
02/26/2008 16:27:38;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
02/26/2008 16:27:38;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
02/26/2008 16:27:38;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
02/26/2008 16:27:38;0008;   pbs_mom;Job;scan_for_terminated;checking job
post-processing routine
02/26/2008 16:27:38;0080;   pbs_mom;Job;164.f135-3;obit sent to server
02/26/2008 16:27:38;0100;   pbs_mom;Req;;Type CopyFiles request received
from PBS_Server at f135-3.informatika.fpv.umb.sk, sock=10
02/26/2008 16:27:38;0001;   pbs_mom;Svr;pbs_mom;Unknown resource type
(15035) in fork_to_user, invalid home directory '/bin/sh' specified
, not a directory
02/26/2008 16:27:38;0080;   pbs_mom;Req;req_reject;Reject reply
code=15035(Unknown resource type  REJHOST=f135-13.informatika.fpv.umb.sk M
SG=invalid home directory '/bin/sh' specified, not a directory), aux=0,
type=CopyFiles, from PBS_Server at f135-3.informatika.fpv.umb.sk
02/26/2008 16:27:38;0001;   pbs_mom;Svr;pbs_mom;Inappropriate ioctl for
device (25) in req_cpyfile, fork_to_user failed with rc=-15035 'in
valid home directory '/bin/sh' specified, not a directory' - exiting
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080226/dfd23930/attachment.html


More information about the torqueusers mailing list