[torqueusers] qsub ... queue hang

Zhiliang Hu zhu at iastate.edu
Mon Dec 3 18:47:01 MST 2007


Somehow all jobs submitted via "qsub" hangs on queue on my linux cluster, and I can't see why.  Here are some details:

I have a "run" file containing one line to run a "hello" program:
"/opt/openmpi.gcc/bin/mpirun -np 6 -machinefile machines ./hello"

It runs fine on command line:

> sh run 
Comm_size is 6 with return value 0
Received Hello from process 1 from process 1
Received Hello from process 2 from process 2
Received Hello from process 3 from process 3
Received Hello from process 4 from process 4
Received Hello from process 5 from process 5

However when submitted to "qsub":

> sh run  | qsub
49.cluster2.xxxx.xxxxxxx.xxx

-- it hangs there forever:

> qstat
Job id          Name          User         Time Use S Queue
--------------- ------------- ------------ -------- - -----
49.cluster2     STDIN         cuser               0 Q default   

When I check on the queue server, it seems it's running:

> qstat -q
server: cluster2
Queue       Memory CPU Time Walltime Node  Run Que Lm  State
----------- ------ -------- -------- ----  --- --- --  -----
default       --      --       --      --    0   0 --   E R
                                          ----- -----
                                              0     0
> qstat -fB  
Server: cluster2.xxxx.xxxxxxx.xxx
    server_state = Active
    scheduling = True
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 
    default_queue = default
    log_events = 511
    mail_from = adm
    query_other_jobs = True
    resources_assigned.nodect = 0
    scheduler_iteration = 600
    node_check_rate = 150
    tcp_timeout = 6
    pbs_version = 2.1.8

Could someone tell why the jobs are hanging there? (every time when I retry, I did delete everything clean from queue with "qdel").

Thanks in advance..

Zhiliang



More information about the torqueusers mailing list