[torqueusers] Scheduler dying when processing jobs

Piotr Siwczak psiwczak at man.poznan.pl
Wed Oct 19 01:04:23 MDT 2005


Hi,

We run a torque cluster based on itanium2 architecture. We use the 
standard torque scheduler (FIFO). The scheduler was running fine for about 
half a year and this week it crashed while processing a job. Here's 
the log produced by tracejob command on one of them:

----------
Job: 24086.sherwood

10/03/2005 16:12:08  S    enqueuing into big_mem, state 1 hop 1
10/03/2005 16:12:08  S    Requeueing job, substate: 10 Requeued in queue: 
big_mem
10/04/2005 09:46:08  S    enqueuing into big_mem, state 1 hop 1
10/04/2005 09:46:08  S    Requeueing job, substate: 10 Requeued in queue: 
big_mem
10/05/2005 14:15:17  S    enqueuing into big_mem, state 1 hop 1
10/05/2005 14:15:17  S    Requeueing job, substate: 10 Requeued in queue: 
big_mem
10/06/2005 08:58:22  S    enqueuing into big_mem, state 1 hop 1
10/06/2005 08:58:22  S    Requeueing job, substate: 10 Requeued in queue: 
big_mem
10/14/2005 04:31:32  L    pbs_resquery error: 15031
10/14/2005 04:31:33  L    Internal Scheduling Error
10/17/2005 08:55:16  S    enqueuing into big_mem, state 1 hop 1
10/17/2005 08:55:16  S    Requeueing job, substate: 10 Requeued in queue: 
big_mem
10/17/2005 09:06:31  S    Job Modified at request of Scheduler at sherwood
10/17/2005 09:06:31  S    Job Run at request of Scheduler at sherwood
10/17/2005 09:06:32  L    Job Run
----------

In the above log you can see the "Internal Scheduling Error" message. This 
is when the scheduler died.

Could anybody have a guess what is going on? Is this somehow connected to 
this particular job, or is this scheduler's failure?

Best regards,
Piotr

  --
  Piotr Siwczak <psiwczak at man.poznan.pl>
  System Administrator

  Poznan Supercomputing and Networking Center
  Supercomputing Department

  (www.eu-egee.org <piotr.siwczak at cern.ch>)
  --


More information about the torqueusers mailing list