[torqueusers] Torque + Maui Configuration
tomas at lsd.ufcg.edu.br
Thu Oct 22 08:23:43 MDT 2009
Arnau Bria wrote:
> On Thu, 22 Oct 2009 10:24:00 -0300
> Tomás Soares wrote:
>> Hello All,
>> I have 9 nodes with torque and maui installed and I'm using this
>> versions: torque-2.3.0 and maui-3.2.6p20.
>> I was trying to submit a simple job using "qsub -q prod script.sh"
>> but when I run qstat the jobs status is switching between Queued, E
>> and Running but don't return an output and never goes out of the
>> qstat. Someone suspects whats wrong?
> status goes from Q -> E -> R?
Yes, the status goes from Q -> E -> R -> Q. Waits for a sleep time and
again does the same way.
> what does a qstat -f show¿ (on R status)¿
It's switching fast all the time but I had:
Job Id: 6647.server1.my.host.name
Job_Name = script.sh
Job_Owner = prod000 at server1.my.host.name
job_state = E
queue = prod
server = server1.my.host.name
Checkpoint = u
ctime = Thu Oct 22 10:11:10 2009
Error_Path = server1.my.host.name:/home/prod000/script.sh.e6647
exec_host = WN9.my.host.name/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Thu Oct 22 11:12:16 2009
Output_Path = server1.my.host.name:/home/prod000/script.sh.o6647
Priority = 0
qtime = Thu Oct 22 10:11:10 2009
Rerunable = True
Resource_List.cput = 48:00:00
Resource_List.neednodes = WN9.my.host.name
Resource_List.walltime = 72:00:00
session_id = 19632
substate = 60
Variable_List = PBS_O_HOME=/home/prod000,PBS_O_LANG=en_US.UTF-8,
euser = prod000
egroup = prod
hashname = 6647.ce
queue_rank = 49
queue_type = E
etime = Thu Oct 22 10:11:10 2009
exit_status = -4
submit_args = -q prod script.sh
start_time = Thu Oct 22 10:17:55 2009
start_count = 525
> have you check undelivered dir under WN for stderr/stdout?
In the /var/spool/pbs/mom_logs/ I found:
10/22/2009 11:12:13;0080; pbs_mom;Req;scan_for_exiting;no contact with
server at hostaddr 96a50f32, port 15001, jobid 6647.
server1.my.host.name errno 111
10/22/2009 11:12:16;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
10/22/2009 11:12:16;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
10/22/2009 11:12:16;0008; pbs_mom;Job;scan_for_terminated;checking job
10/22/2009 11:12:16;0080; pbs_mom;Job;6647. server1.my.host.name;obit
sent to server
> same happens if you only submit dummy command out of teh script?
> echo sleep 1000|qsub -q prod
> if answer is no:
Yes, the same problem with this command.
> is copy with no passwd allowed between hosts?
Yes, the copy is allowed.
>> Thanks a lot!!!
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers