[torqueusers] qsub ... queue hang
Zhiliang Hu
zhu at iastate.edu
Mon Dec 3 18:47:01 MST 2007
Somehow all jobs submitted via "qsub" hangs on queue on my linux cluster, and I can't see why. Here are some details:
I have a "run" file containing one line to run a "hello" program:
"/opt/openmpi.gcc/bin/mpirun -np 6 -machinefile machines ./hello"
It runs fine on command line:
> sh run
Comm_size is 6 with return value 0
Received Hello from process 1 from process 1
Received Hello from process 2 from process 2
Received Hello from process 3 from process 3
Received Hello from process 4 from process 4
Received Hello from process 5 from process 5
However when submitted to "qsub":
> sh run | qsub
49.cluster2.xxxx.xxxxxxx.xxx
-- it hangs there forever:
> qstat
Job id Name User Time Use S Queue
--------------- ------------- ------------ -------- - -----
49.cluster2 STDIN cuser 0 Q default
When I check on the queue server, it seems it's running:
> qstat -q
server: cluster2
Queue Memory CPU Time Walltime Node Run Que Lm State
----------- ------ -------- -------- ---- --- --- -- -----
default -- -- -- -- 0 0 -- E R
----- -----
0 0
> qstat -fB
Server: cluster2.xxxx.xxxxxxx.xxx
server_state = Active
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
default_queue = default
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.nodect = 0
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
pbs_version = 2.1.8
Could someone tell why the jobs are hanging there? (every time when I retry, I did delete everything clean from queue with "qdel").
Thanks in advance..
Zhiliang
More information about the torqueusers
mailing list