[torqueusers] Torque queue stalls, strange server log messages

Motin mailinglists at demomusic.nu
Sat Mar 3 05:35:01 MST 2007


My queue of around 1400 jobs is totally stalled. In the full job listing 
all jobs are listed with status "Q"...

I have tried qrun but I find it lacking the option "force run next one
up for running in queue", which would make it easier to use. 

Still, qrun is only a cure to the symptoms not the problems. The queue
runs great when there are appr maximum 300 queue items. After that, it
borks and refuses to run jobs most of the time. Sometimes however, it
sets of jobs - very seldomly though.

Here are my logs, but they are rather strange, can you make any sense
out of them?

I added items up to around 03/02/2007 09:24, then paused adding, then
added again at 03/02/2007 09:59

03/02/2007 09:23:37;0040; pbs_sched;Job;5431.tiger001;Not enough cpus
available
03/02/2007 09:23:37;0080; pbs_sched;Svr;main;brk point 167481344
03/02/2007 09:23:38;0040; pbs_sched;Job;5432.tiger001;Not enough cpus
available
03/02/2007 09:23:38;0040; pbs_sched;Job;5433.tiger001;Not enough cpus
available
03/02/2007 09:23:38;0040; pbs_sched;Job;5434.tiger001;Not enough cpus
available
03/02/2007 09:23:39;0040; pbs_sched;Job;5435.tiger001;Not enough cpus
available
03/02/2007 09:23:39;0040; pbs_sched;Job;5436.tiger001;Not enough cpus
available
03/02/2007 09:23:39;0040; pbs_sched;Job;5437.tiger001;Not enough cpus
available
03/02/2007 09:23:39;0080; pbs_sched;Svr;main;brk point 167485440
03/02/2007 09:23:41;0040; pbs_sched;Job;5438.tiger001;Not enough cpus
available
03/02/2007 09:23:41;0040; pbs_sched;Job;5439.tiger001;Not enough cpus
available
03/02/2007 09:23:41;0040; pbs_sched;Job;5440.tiger001;Not enough cpus
available
03/02/2007 09:23:41;0040; pbs_sched;Job;5441.tiger001;Not enough cpus
available
03/02/2007 09:23:41;0040; pbs_sched;Job;5442.tiger001;Not enough cpus
available
03/02/2007 09:23:41;0080; pbs_sched;Svr;main;brk point 167862272
03/02/2007 09:23:41;0040; pbs_sched;Job;5443.tiger001;Not enough cpus
available
03/02/2007 09:23:42;0040; pbs_sched;Job;5444.tiger001;Not enough cpus
available
03/02/2007 09:23:42;0040; pbs_sched;Job;5445.tiger001;Not enough cpus
available
03/02/2007 09:23:42;0080; pbs_sched;Svr;main;brk point 167923712
03/02/2007 09:23:43;0040; pbs_sched;Job;5446.tiger001;Not enough cpus
available
03/02/2007 09:23:43;0040; pbs_sched;Job;5447.tiger001;Not enough cpus
available
03/02/2007 09:23:43;0040; pbs_sched;Job;5448.tiger001;Not enough cpus
available
03/02/2007 09:23:46;0040; pbs_sched;Job;5449.tiger001;Not enough cpus
available
03/02/2007 09:23:46;0040; pbs_sched;Job;5450.tiger001;Not enough cpus
available
03/02/2007 09:23:46;0040; pbs_sched;Job;5451.tiger001;Not enough cpus
available
03/02/2007 09:23:46;0040; pbs_sched;Job;5452.tiger001;Not enough cpus
available
03/02/2007 09:23:46;0080; pbs_sched;Svr;main;brk point 167927808
03/02/2007 09:24:06;0040; pbs_sched;Job;3895.tiger001;Job Run
03/02/2007 09:24:06;0040; pbs_sched;Job;5453.tiger001;Not enough cpus
available
03/02/2007 09:34:06;0040; pbs_sched;Job;3896.tiger001;Job Run
03/02/2007 09:34:07;0040; pbs_sched;Job;3897.tiger001;Job Run
03/02/2007 09:34:07;0080; pbs_sched;Svr;main;brk point 167931904
03/02/2007 09:44:07;0040; pbs_sched;Job;3898.tiger001;Job Run
03/02/2007 09:54:07;0040; pbs_sched;Job;3899.tiger001;Job Run
03/02/2007 09:54:08;0040; pbs_sched;Job;3900.tiger001;Job Run
03/02/2007 09:54:08;0080; pbs_sched;Svr;main;brk point 167936000
03/02/2007 09:54:38;0040; pbs_sched;Job;3901.tiger001;Job Run
03/02/2007 09:59:46;0040; pbs_sched;Job;3902.tiger001;Job Run
03/02/2007 09:59:46;0040; pbs_sched;Job;5454.tiger001;Not enough cpus
available
03/02/2007 09:59:46;0040; pbs_sched;Job;3903.tiger001;Job Run
03/02/2007 09:59:46;0040; pbs_sched;Job;5455.tiger001;Not enough cpus
available
03/02/2007 09:59:46;0040; pbs_sched;Job;5456.tiger001;Not enough cpus
available
03/02/2007 09:59:46;0040; pbs_sched;Job;5457.tiger001;Not enough cpus
available
03/02/2007 09:59:46;0040; pbs_sched;Job;5458.tiger001;Not enough cpus
available
03/02/2007 09:59:46;0040; pbs_sched;Job;5459.tiger001;Not enough cpus
available
03/02/2007 09:59:46;0080; pbs_sched;Svr;main;brk point 168206336
03/02/2007 09:59:47;0040; pbs_sched;Job;3904.tiger001;Job Run
03/02/2007 09:59:47;0040; pbs_sched;Job;5460.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5461.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5462.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5463.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5464.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5465.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0080; pbs_sched;Svr;main;brk point 168222720
03/02/2007 09:59:47;0040; pbs_sched;Job;3905.tiger001;Job Run
03/02/2007 09:59:47;0040; pbs_sched;Job;5466.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5467.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5468.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5469.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5470.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0040; pbs_sched;Job;5471.tiger001;Not enough cpus
available
03/02/2007 09:59:47;0080; pbs_sched;Svr;main;brk point 168312832
03/02/2007 09:59:48;0040; pbs_sched;Job;3906.tiger001;Job Run

First queue items are:
3899.tiger001     flv_info6360     www-data               0 Q batch
3900.tiger001     flv_info6357     www-data               0 Q batch
3901.tiger001     flv_info6357     www-data               0 Q batch

Last items are:
5442.tiger001     flv_info41       www-data               0 Q batch
5443.tiger001     flv_info41       www-data               0 Q batch
5444.tiger001     flv_info10       www-data               0 Q batch
5445.tiger001     flv_info10       www-data               0 Q batch
5446.tiger001     flv_info10       www-data               0 Q batch
5447.tiger001     flv_info10       www-data               0 Q batch
5448.tiger001     flv_info10       www-data               0 Q batch
5449.tiger001     hiflv_in7        www-data               0 Q batch
5450.tiger001     hiflv_in7        www-data               0 Q batch
5451.tiger001     hiflv_in7        www-data               0 Q batch
5452.tiger001     hiflv_in7        www-data               0 Q batch
5453.tiger001     hiflv_in7        www-data               0 Q batch

Tim Miller skrev:

> > One can always do qrun <jobid> (at least with pbs_sched -- I've never
> > used Maui). Have you looked at the full job listing and scheduler logs
> > to determine why the jobs aren't running?
> >
> > Best,
> > Tim
> >
> > Motin wrote:
>   
>> >> Sometimes the queue just sits there, without running any jobs. The
>> >> machine is by no mean overloaded, only sparsely used. How can one force
>> >> the machine to run the available jobs in the queue?
>>     
> >
> >
>   

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070303/f295d1ae/attachment.html


More information about the torqueusers mailing list