[torqueusers] Scheduler dying when processing jobs
Piotr Siwczak
psiwczak at man.poznan.pl
Wed Oct 19 01:04:23 MDT 2005
Hi,
We run a torque cluster based on itanium2 architecture. We use the
standard torque scheduler (FIFO). The scheduler was running fine for about
half a year and this week it crashed while processing a job. Here's
the log produced by tracejob command on one of them:
----------
Job: 24086.sherwood
10/03/2005 16:12:08 S enqueuing into big_mem, state 1 hop 1
10/03/2005 16:12:08 S Requeueing job, substate: 10 Requeued in queue:
big_mem
10/04/2005 09:46:08 S enqueuing into big_mem, state 1 hop 1
10/04/2005 09:46:08 S Requeueing job, substate: 10 Requeued in queue:
big_mem
10/05/2005 14:15:17 S enqueuing into big_mem, state 1 hop 1
10/05/2005 14:15:17 S Requeueing job, substate: 10 Requeued in queue:
big_mem
10/06/2005 08:58:22 S enqueuing into big_mem, state 1 hop 1
10/06/2005 08:58:22 S Requeueing job, substate: 10 Requeued in queue:
big_mem
10/14/2005 04:31:32 L pbs_resquery error: 15031
10/14/2005 04:31:33 L Internal Scheduling Error
10/17/2005 08:55:16 S enqueuing into big_mem, state 1 hop 1
10/17/2005 08:55:16 S Requeueing job, substate: 10 Requeued in queue:
big_mem
10/17/2005 09:06:31 S Job Modified at request of Scheduler at sherwood
10/17/2005 09:06:31 S Job Run at request of Scheduler at sherwood
10/17/2005 09:06:32 L Job Run
----------
In the above log you can see the "Internal Scheduling Error" message. This
is when the scheduler died.
Could anybody have a guess what is going on? Is this somehow connected to
this particular job, or is this scheduler's failure?
Best regards,
Piotr
--
Piotr Siwczak <psiwczak at man.poznan.pl>
System Administrator
Poznan Supercomputing and Networking Center
Supercomputing Department
(www.eu-egee.org <piotr.siwczak at cern.ch>)
--
More information about the torqueusers
mailing list