[torqueusers] transient errors - no password entry for user someuser

Alexander Piavlo lolitushka at gmail.com
Thu Jan 17 03:02:36 MST 2008


 Hi, recently sometimes any node can sporadicly fail to run a single
job with error "start_exec, no password entry for user someuser" while
the same node runs the next job of the same user ok
The pbs_mom nodes use nis with nscd caching daemon for user auth.

A couple of sample pbs_mom logs
------------------------------------------------------------------------
Jan 16 15:27:43 localhost pbs_mom: start_exec, no password entry for user amilev
Jan 16 15:27:43 localhost pbs_mom: open_std_file, cannot determine filename
Jan 16 15:27:43 localhost pbs_mom: open_std_file, cannot determine filename
Jan 16 15:27:43 localhost pbs_mom: No such file or directory (2) in
fork_to_user, cannot find user 'amilev' in password file
Jan 16 15:27:43 localhost pbs_mom: Inappropriate ioctl for device (25)
in req_cpyfile, fork_to_user failed with rc=-15023 'cannot find user
'amilev' in password file' - returning failure
------------------------------------------------------------------------
Jan 16 12:42:25 localhost pbs_mom: start_exec, no password entry for user amilev
Jan 16 12:42:25 localhost pbs_mom: open_std_file, cannot determine filename
Jan 16 12:42:25 localhost pbs_mom: open_std_file, cannot determine filename
Jan 16 12:42:29 localhost pbs_mom: sys_copy, command '/bin/cp -rp
/var/spool/pbs/spool/4071.clustr.OU
/users/studs/phd/amilev/freespace/meshi/current/tests/HydrogenTest/1WAP/RunOneTimeClustrix.csh.o4071'
failed with status=1, giving up after 4 attempts
Jan 16 12:42:29 localhost pbs_mom: req_cpyfile, Unable to copy file
/var/spool/pbs/spool/4071.clustr.OU to
/users/studs/phd/amilev/freespace/meshi/current/tests/HydrogenTest/1WAP/RunOneTimeClustrix.csh.o4071
Jan 16 12:42:29 localhost pbs_mom: No such file or directory (2) in
req_cpyfile, Unable to rename /var/spool/pbs/spool/4071.clustr.OU to
/var/spool/pbs/undelivered/4071.clustr.OU
Jan 16 12:42:33 localhost pbs_mom: sys_copy, command '/bin/cp -rp
/var/spool/pbs/spool/4071.clustr.ER
/users/studs/phd/amilev/freespace/meshi/current/tests/HydrogenTest/1WAP/RunOneTimeClustrix.csh.e4071'
failed with status=1, giving up after 4 attempts
Jan 16 12:42:33 localhost pbs_mom: req_cpyfile, Unable to copy file
/var/spool/pbs/spool/4071.clustr.ER to
/users/studs/phd/amilev/freespace/meshi/current/tests/HydrogenTest/1WAP/RunOneTimeClustrix.csh.e4071
Jan 16 12:42:33 localhost pbs_mom: No such file or directory (2) in
req_cpyfile, Unable to rename /var/spool/pbs/spool/4071.clustr.ER to
/var/spool/pbs/undelivered/4071.clustr.ER
------------------------------------------------------------------------

 Any ideas?

 Thanks
 Alex


More information about the torqueusers mailing list