[torqueusers] Help! One Puzzle At a Time...
Gus Correa
gus at ldeo.columbia.edu
Tue Sep 6 08:25:43 MDT 2011
Regarding the long time in Q state after H state.
If you are using the maui scheduler, this may be due to the default
defertime of 1 hour.
In this case, try setting it to less.
For instance, if you want it to be one minute, add this line:
DEFERTIME 00:01:00
to your ${MAUI}/maui.cfg file and restart maui.
See also:
http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php
Not sure if I understood it right, but
for the 'resource temporarily unavailable' problem,
qnodes is reporting the 'naboo' node as 'down', hence unavailable.
It may need a reboot.
I hope this helps,
Gus Correa
sam oubari wrote:
> Hello,
>
> I am no expert at TORQUE and one key puzzle for us is why, on occasions,
> a waiting job moves from H to Q but not R when it's scheduled time
> comes? When I attempt to force it with qrun I get:
>
> qrun: Resource temporarily unavailable MSG=job allocation request
> exceeds currently available cluster nodes, 1 requested, 0 available
> 3030.naboo.linnbenton.edu
> Below is the output of 'printserverdb' and 'qnodes' during the
> "freeze". To fix, I had to kill mom, restart it, then qrun the first Q job.
>
> Any hints would be greatly appreciated. Thx! Sam.
>
> PS. I've provided more details on 8/28/11.
>
>
> ------
> Sam Oubari, Manager of Systems & Application Programming
> Linn-Benton Community College -- Information Services
> 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321
> Tel: 541-917-4355/Fax: 541-917-4379
>
> ======
>
> # printserverdb
> ---------------------------------------------------
> numjobs: 26
> numque: 5
> jobidnumber: 3575
> savetm: 1314100391
> --attributes--
> scheduling = True
> max_running = 23
> total_jobs = 22
> state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0
> default_queue = sys_tst
> log_events = 511
> mail_from = adm
> query_other_jobs = False
> resources_assigned.nodect = 0
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = False
> pbs_version = 2.5.6
> keep_completed = 600
> allow_node_submit = True
> next_job_number = 1
> net_counter = 7 1 0
>
> # qnodes
> naboo
> state = down
> np = 40
> ntype = cluster
> status = rectime=1315288785,varattr=,jobs=3448.naboo.linnbenton.edu
> 3449.naboo.linnbenton.edu 3450.
> naboo.linnbenton.edu,state=free,netload=1345146873471,gres=,loadave=0.08,ncpus=4,physmem=17040092kb,avai
> lmem=23485296kb,totmem=29739432kb,idletime=459327,nusers=5,nsessions=115,sessions=361
> 363 365 367 369 37
> 1 373 375 377 379 381 383 385 387 389 391 393 395 397 399 401 407 409
> 413 422 424 426 428 430 432 434 43
> 6 438 440 442 444 446 448 450 452 454 456 460 462 466 471 474 476 479
> 481 483 485 487 489 491 493 495 49
> 7 499 501 503 505 507 518 520 522 527 529 531 533 535 537 539 546 548
> 550 552 554 556 558 560 562 564 56
> 7 578 585 587 589 660 662 956 960 1637 1648 1657 1863 1891 5763 5839
> 5875 13067 18926 18986 19028 24492
> 24541 24588 24639 24684 24740 24787 29226 29631 30517 30521,uname=Linux
> naboo.linnbenton.edu 2.6.18-238.
> 12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 x86_64,opsys=linux
> gpus = 0
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list