[torqueusers] Hanging jobs (queued forever)

Micha onsager at gmx.net
Fri Apr 11 01:58:08 MDT 2008


@list ,

Systems:
OpenSuse 10.1. (mixed 32/64Bit)

Some contributed rpm's
torque-2.3.0-17.1.x86_64.rpm
+ clients, servers etc.

Problem is, I cannot get even a simple job running, even not on a node
identical with the server (hopefully excluding ssh, rsh, access right
issues). The job will sit in the queue forever. I read about wrong
resources and played with many of the qmgr options without success. A
typical example would be:

---snip---

m7:/home/micha/programming/frame/bin # qstat -f
Job Id: 341.m7.mbee.net
    Job_Name = test.sh
    Job_Owner = micha at m7.mbee.net
    job_state = Q
    queue = batch
    server = m7.mbee.net
    Checkpoint = u
    ctime = Fri Apr 11 08:27:32 2008
    Error_Path = m7.mbee.net:/home/micha/programming/frame/bin/test.sh.e341
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Fri Apr 11 08:27:32 2008
    Output_Path = m7.mbee.net:/home/micha/programming/frame/bin/test.sh.o341
    Priority = 0
    qtime = Fri Apr 11 08:31:02 2008
    Rerunable = True
    Resource_List.cput = 00:00:30
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 00:01:00
    substate = 10
    Variable_List = PBS_O_HOME=/home/micha,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=micha,
        PBS_O_PATH=/home/micha/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bi
        n:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/jvm/jre/bin:/usr/li
        b/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin:/home/micha/programming/f
        rame/bin:/home/micha/programming/frame/bin,
        PBS_O_MAIL=/var/spool/mail/micha,PBS_O_SHELL=/bin/bash,
        PBS_SERVER=m7.mbee.net,PBS_O_HOST=m7.mbee.net,
        PBS_O_WORKDIR=/home/micha/programming/frame/bin,PBS_O_QUEUE=batch
    euser = micha
    egroup = users
    queue_rank = 13
    queue_type = E
    etime = Fri Apr 11 08:27:32 2008
    submit_args = -l nodes=1,walltime=00:01:00,cput=00:00:30 test.sh

m7:/home/micha/programming/frame/bin # qmgr -c "list queue batch"
Queue batch
        queue_type = Execution
        total_jobs = 1
        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
        resources_default.nodes = 1
        mtime = Mon Apr  7 14:39:24 2008
        enabled = True
        started = True

m7:/home/micha/programming/frame/bin # qmgr -c "list server"
Server m7.mbee.net
        server_state = Idle
        total_jobs = 1
        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
        acl_hosts = m7
        default_queue = batch
        log_events = 511
        mail_from = adm
        scheduler_iteration = 600
        node_check_rate = 150
        tcp_timeout = 6
        pbs_version = 2.3.0
        next_job_number = 342
        net_counter = 1 0 0

m7:/home/micha/programming/frame/bin # qmgr -c "list node m7"
Node m7
        state = free
        np = 2
        ntype = cluster
        status = opsys=linux,
                 uname=Linux m7 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64,
                 sessions=4271 4319 3063 11722 19799,nsessions=5,nusers=2,
                 idletime=1,totmem=961176kb,availmem=793120kb,physmem=382876kb,
                 ncpus=? 0,loadave=0.00,netload=1583429,state=free,jobs=,
                 varattr=,rectime=1207898179

m7:/home/micha/programming/frame/bin #            

---snap--

The result is the same with other - even rather 'standard' - arguments, seen in qstat.
BTW, all is well on gentoo systems with an older torque version (2.1.6).

Beside the particular problem, what can I do in general to tackle the
problem? I raised the debug levels (as advised in the FAQ) to the
highest possible level, but I simply cannot gather any further information.

Micha
-- 



More information about the torqueusers mailing list