[torqueusers] Whats the cause of toolong execv?
Ronny T. Lampert
telecaadmin at gmail.com
Tue Jul 31 02:25:07 MDT 2007
> My /var/spool/torque/sched_logs/ has the following message and then the
> scheculer died. What causes this, the message is uncelar to me and even
> google does not help much.
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;Log;Log opened
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;toolong;alarm call
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;Log;Log closed
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;toolong;restart dir /root object
> 07/28/2007 19:12:17;0001; pbs_sched;Svr;pbs_sched;No such file or
> directory (2) in toolong, execv
Just a grep thru the sources would help you.
"toolong" means a scheduling iteration took to long to complete, so the
pbs_sched thinks it's best to re-start itself (there's a function with
that name in the pbs_sched sources, toolong()).
As you notice, the execv() (man execv) does NOT search your $PATH, but
instead uses more or less hard-coded paths. For you this means it tries
to execute pbs_sched in /root/pbs_sched, which, of course, isn't there.
To remedy this:
1) increase the scheduler iteration time for pbs_sched, via the -a
command line option. I think I had it with -a 600 running quite good.
2) use maui / moab.
3) you could also patch pbs_sched to use exevp() instead of execv(), so
it searches your patch in case of another toolong!
More information about the torqueusers