[torqueusers] torque intermittent failure (read error)

hgc-01134@hkedcity.net wong emeplease at gmail.com
Thu Aug 19 03:52:36 MDT 2010


Hi,

I am using torque-2.3.10-1  on Centos 5.5 i386 ,  my  platform
consists of an execution node, scheduler and server. (Running inside
virtual machine, 2 cores )

To enter the interactive mode and execute a start-up script immediately,
I use "qsub -I -l nodes=1:ppn=2  -v startup=myscript" ,  and inside
.bashrc  "exec $startup".
This works fine usually but sometimes the job will be terminated
immediately or having something like "connection reset by peer, read
error".


tracejob result:

8/19/2010 17:10:49  S    enqueuing into batch, state 1 hop 1
08/19/2010 17:10:49  S    Job Queued at request of cyw at cdevm2centos55x86,
                          owner = cyw at cdevm2centos55x86, job name = STDIN,
                          queue = batch
08/19/2010 17:10:49  S    Job Modified at request of Scheduler at cdevm2centos55x86
08/19/2010 17:10:49  S    Job Run at request of Scheduler at cdevm2centos55x86
08/19/2010 17:10:49  A    queue=batch
08/19/2010 17:10:54  S    Exit_status=-1 resources_used.cput=00:00:00
                          resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:05 Error_Path=/dev/pts/4
                          Output_Path=/dev/pts/4
08/19/2010 17:10:54  L    Job Run
08/19/2010 17:10:54  M    checking job post-processing routine
08/19/2010 17:10:54  S    dequeuing from batch, state COMPLETE
08/19/2010 17:10:54  M    obit sent to server
08/19/2010 17:10:54  A    user=cywong group=cywong jobname=STDIN queue=batch
                          ctime=1282209049 qtime=1282209049 etime=1282209049
                          start=1282209054 owner=cyw at cdevm2centos55x86
                          exec_host=cdevm2centos55x86/1+cdevm2centos55x86/0
                          Resource_List.neednodes=1:ppn=2 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=2
                          Resource_List.walltime=01:00:00
08/19/2010 17:10:54  A    user=cywong group=cywong jobname=STDIN queue=batch
                          ctime=1282209049 qtime=1282209049 etime=1282209049
                          start=1282209054 owner=cyw at cdevm2centos55x86
                          exec_host=cdevm2centos55x86/1+cdevm2centos55x86/0
                          Resource_List.neednodes=1:ppn=2 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=2
                          Resource_List.walltime=01:00:00 session=0
                          end=1282209054 Exit_status=-1
                          resources_used.cput=00:00:00 resources_used.mem=0kb
                          resources_used.vmem=0kb
                          resources_used.walltime=00:00:05 Error_Path=/dev/pts/4
                          Output_Path=/dev/pts/4



Thank you in advance.


More information about the torqueusers mailing list