[torqueusers] Torque queue stalls, strange server log messages

Motin mailinglists at demomusic.nu
Sat Mar 3 06:53:24 MST 2007


I think I have a good overview over the problem now. The scheduler seems
to die when I add hundreds of jobs at once. I had to restart the
scheduler and then the server to get things running again.

I guess torque isn't meant to receive hundreds of job requests at a time.

Motin skrev:
> My queue of around 1400 jobs is totally stalled. In the full job listing 
> all jobs are listed with status "Q"...
>
> I have tried qrun but I find it lacking the option "force run next one
> up for running in queue", which would make it easier to use. 
>
> Still, qrun is only a cure to the symptoms not the problems. The queue
> runs great when there are appr maximum 300 queue items. After that, it
> borks and refuses to run jobs most of the time. Sometimes however, it
> sets of jobs - very seldomly though.
>
> Here are my logs, but they are rather strange, can you make any sense
> out of them?
>
> I added items up to around 03/02/2007 09:24, then paused adding, then
> added again at 03/02/2007 09:59
>
> 03/02/2007 09:23:37;0040; pbs_sched;Job;5431.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:37;0080; pbs_sched;Svr;main;brk point 167481344
> 03/02/2007 09:23:38;0040; pbs_sched;Job;5432.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:38;0040; pbs_sched;Job;5433.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:38;0040; pbs_sched;Job;5434.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:39;0040; pbs_sched;Job;5435.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:39;0040; pbs_sched;Job;5436.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:39;0040; pbs_sched;Job;5437.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:39;0080; pbs_sched;Svr;main;brk point 167485440
> 03/02/2007 09:23:41;0040; pbs_sched;Job;5438.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:41;0040; pbs_sched;Job;5439.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:41;0040; pbs_sched;Job;5440.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:41;0040; pbs_sched;Job;5441.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:41;0040; pbs_sched;Job;5442.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:41;0080; pbs_sched;Svr;main;brk point 167862272
> 03/02/2007 09:23:41;0040; pbs_sched;Job;5443.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:42;0040; pbs_sched;Job;5444.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:42;0040; pbs_sched;Job;5445.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:42;0080; pbs_sched;Svr;main;brk point 167923712
> 03/02/2007 09:23:43;0040; pbs_sched;Job;5446.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:43;0040; pbs_sched;Job;5447.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:43;0040; pbs_sched;Job;5448.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:46;0040; pbs_sched;Job;5449.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:46;0040; pbs_sched;Job;5450.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:46;0040; pbs_sched;Job;5451.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:46;0040; pbs_sched;Job;5452.tiger001;Not enough cpus
> available
> 03/02/2007 09:23:46;0080; pbs_sched;Svr;main;brk point 167927808
> 03/02/2007 09:24:06;0040; pbs_sched;Job;3895.tiger001;Job Run
> 03/02/2007 09:24:06;0040; pbs_sched;Job;5453.tiger001;Not enough cpus
> available
> 03/02/2007 09:34:06;0040; pbs_sched;Job;3896.tiger001;Job Run
> 03/02/2007 09:34:07;0040; pbs_sched;Job;3897.tiger001;Job Run
> 03/02/2007 09:34:07;0080; pbs_sched;Svr;main;brk point 167931904
> 03/02/2007 09:44:07;0040; pbs_sched;Job;3898.tiger001;Job Run
> 03/02/2007 09:54:07;0040; pbs_sched;Job;3899.tiger001;Job Run
> 03/02/2007 09:54:08;0040; pbs_sched;Job;3900.tiger001;Job Run
> 03/02/2007 09:54:08;0080; pbs_sched;Svr;main;brk point 167936000
> 03/02/2007 09:54:38;0040; pbs_sched;Job;3901.tiger001;Job Run
> 03/02/2007 09:59:46;0040; pbs_sched;Job;3902.tiger001;Job Run
> 03/02/2007 09:59:46;0040; pbs_sched;Job;5454.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:46;0040; pbs_sched;Job;3903.tiger001;Job Run
> 03/02/2007 09:59:46;0040; pbs_sched;Job;5455.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:46;0040; pbs_sched;Job;5456.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:46;0040; pbs_sched;Job;5457.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:46;0040; pbs_sched;Job;5458.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:46;0040; pbs_sched;Job;5459.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:46;0080; pbs_sched;Svr;main;brk point 168206336
> 03/02/2007 09:59:47;0040; pbs_sched;Job;3904.tiger001;Job Run
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5460.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5461.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5462.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5463.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5464.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5465.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0080; pbs_sched;Svr;main;brk point 168222720
> 03/02/2007 09:59:47;0040; pbs_sched;Job;3905.tiger001;Job Run
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5466.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5467.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5468.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5469.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5470.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0040; pbs_sched;Job;5471.tiger001;Not enough cpus
> available
> 03/02/2007 09:59:47;0080; pbs_sched;Svr;main;brk point 168312832
> 03/02/2007 09:59:48;0040; pbs_sched;Job;3906.tiger001;Job Run
>
> First queue items are:
> 3899.tiger001     flv_info6360     www-data               0 Q batch
> 3900.tiger001     flv_info6357     www-data               0 Q batch
> 3901.tiger001     flv_info6357     www-data               0 Q batch
>
> Last items are:
> 5442.tiger001     flv_info41       www-data               0 Q batch
> 5443.tiger001     flv_info41       www-data               0 Q batch
> 5444.tiger001     flv_info10       www-data               0 Q batch
> 5445.tiger001     flv_info10       www-data               0 Q batch
> 5446.tiger001     flv_info10       www-data               0 Q batch
> 5447.tiger001     flv_info10       www-data               0 Q batch
> 5448.tiger001     flv_info10       www-data               0 Q batch
> 5449.tiger001     hiflv_in7        www-data               0 Q batch
> 5450.tiger001     hiflv_in7        www-data               0 Q batch
> 5451.tiger001     hiflv_in7        www-data               0 Q batch
> 5452.tiger001     hiflv_in7        www-data               0 Q batch
> 5453.tiger001     hiflv_in7        www-data               0 Q batch
>
> Tim Miller skrev:
>
>   
>>> One can always do qrun <jobid> (at least with pbs_sched -- I've never
>>> used Maui). Have you looked at the full job listing and scheduler logs
>>> to determine why the jobs aren't running?
>>>
>>> Best,
>>> Tim
>>>
>>> Motin wrote:
>>>       
>>   
>>     
>>>>> Sometimes the queue just sits there, without running any jobs. The
>>>>> machine is by no mean overloaded, only sparsely used. How can one force
>>>>> the machine to run the available jobs in the queue?
>>>>>           
>>>     
>>>
>>>
>>>       
>>   
>>     
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070303/645d0158/attachment-0001.html


More information about the torqueusers mailing list