[torqueusers] Help! One Puzzle At a Time...

Gus Correa gus at ldeo.columbia.edu
Tue Sep 6 08:25:43 MDT 2011


Regarding the long time in Q state after H state.
If you are using the maui scheduler, this may be due to the default
defertime of 1 hour.
In this case, try setting it to less.
For instance, if you want it to be one minute, add this line:
DEFERTIME 00:01:00
to your ${MAUI}/maui.cfg file and restart maui.

See also:
http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php

Not sure if I understood it right, but
for the 'resource temporarily unavailable' problem,
qnodes is reporting the 'naboo' node as 'down', hence unavailable.
It may need a reboot.

I hope this helps,
Gus Correa

sam oubari wrote:
> Hello,
>  
> I am no expert at TORQUE and one key puzzle for us is why, on occasions, 
> a waiting job moves from H to Q but not R when it's scheduled time 
> comes?  When I attempt to force it with qrun I get:
>  
> qrun: Resource temporarily unavailable MSG=job allocation request 
> exceeds currently available cluster nodes, 1 requested, 0 available 
> 3030.naboo.linnbenton.edu
> Below is the output of 'printserverdb' and 'qnodes' during the 
> "freeze".  To fix, I had to kill mom, restart it, then qrun the first Q job.
>  
> Any hints would be greatly appreciated.  Thx! Sam.
>  
> PS. I've provided more details on 8/28/11.
>  
>  
> ------
> Sam Oubari, Manager of Systems & Application Programming
> Linn-Benton Community College -- Information Services
> 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321
> Tel: 541-917-4355/Fax: 541-917-4379
>  
> ======
>  
> # printserverdb
> ---------------------------------------------------
> numjobs:                26
> numque:         5
> jobidnumber:            3575
> savetm:         1314100391
> --attributes--
> scheduling = True
> max_running = 23
> total_jobs = 22
> state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0
> default_queue = sys_tst
> log_events = 511
> mail_from = adm
> query_other_jobs = False
> resources_assigned.nodect = 0
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = False
> pbs_version = 2.5.6
> keep_completed = 600
> allow_node_submit = True
> next_job_number = 1
> net_counter = 7 1 0
>  
> # qnodes
> naboo
>      state = down
>      np = 40
>      ntype = cluster
>      status = rectime=1315288785,varattr=,jobs=3448.naboo.linnbenton.edu 
> 3449.naboo.linnbenton.edu 3450.
> naboo.linnbenton.edu,state=free,netload=1345146873471,gres=,loadave=0.08,ncpus=4,physmem=17040092kb,avai
> lmem=23485296kb,totmem=29739432kb,idletime=459327,nusers=5,nsessions=115,sessions=361 
> 363 365 367 369 37
> 1 373 375 377 379 381 383 385 387 389 391 393 395 397 399 401 407 409 
> 413 422 424 426 428 430 432 434 43
> 6 438 440 442 444 446 448 450 452 454 456 460 462 466 471 474 476 479 
> 481 483 485 487 489 491 493 495 49
> 7 499 501 503 505 507 518 520 522 527 529 531 533 535 537 539 546 548 
> 550 552 554 556 558 560 562 564 56
> 7 578 585 587 589 660 662 956 960 1637 1648 1657 1863 1891 5763 5839 
> 5875 13067 18926 18986 19028 24492
> 24541 24588 24639 24684 24740 24787 29226 29631 30517 30521,uname=Linux 
> naboo.linnbenton.edu 2.6.18-238.
> 12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 x86_64,opsys=linux
>      gpus = 0
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list