[torqueusers] Whats the cause of toolong execv?

Ronny T. Lampert telecaadmin at gmail.com
Tue Jul 31 02:25:07 MDT 2007


> My /var/spool/torque/sched_logs/ has the following message and then the
> scheculer died. What causes this, the message is uncelar to me and even
> google does not help much. 
> 
> 
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;Log;Log opened
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;toolong;alarm call
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;Log;Log closed
> 07/28/2007 19:12:17;0002; pbs_sched;Svr;toolong;restart dir /root object
> pbs_sched
> 07/28/2007 19:12:17;0001; pbs_sched;Svr;pbs_sched;No such file or
> directory (2) in toolong, execv

Just a grep thru the sources would help you.
"toolong" means a scheduling iteration took to long to complete, so the 
pbs_sched thinks it's best to re-start itself (there's a function with 
that name in the pbs_sched sources, toolong()).

As you notice, the execv() (man execv) does NOT search your $PATH, but 
instead uses more or less hard-coded paths. For you this means it tries 
to execute pbs_sched in /root/pbs_sched, which, of course, isn't there.

To remedy this:

1) increase the scheduler iteration time for pbs_sched, via the -a 
command line option. I think I had it with -a 600 running quite good.

2) use maui / moab.

3) you could also patch pbs_sched to use exevp() instead of execv(), so 
it searches your patch in case of another toolong!

Cheers,
Ronny


More information about the torqueusers mailing list