[torqueusers] Submitting jobs from a 32-bit OS to a 64-bit Torque
server
Wayne Mallett
wayne.mallett at jcu.edu.au
Thu Jan 29 17:06:57 MST 2009
G'day all,
Ever since upgrading Torque to 2.3.6 I find that servers running 32-bit
O/Ses will no longer submit jobs (successfully). I get the message:
qsub: read error: connection reset by peer
The mom logs show:
01/30/2009 10:01:07;0008; pbs_mom;Job;95037.head3.cluster;ready to
commit job
01/30/2009 10:01:07;0008; pbs_mom;Job;95037.head3.cluster;ready to
commit job completed
01/30/2009 10:01:07;0008; pbs_mom;Job;95037.head3.cluster;committing job
01/30/2009 10:01:07;0008; pbs_mom;Job;95037.head3.cluster;starting job
execution
01/30/2009 10:01:07;0001; pbs_mom;Job;job_nodes;job:
95037.head3.cluster numnodes=1 numvnod=1
01/30/2009 10:01:07;0008; pbs_mom;Job;95037.head3.cluster;evaluating
limits for job
01/30/2009 10:01:07;0001; pbs_mom;Job;95037.head3.cluster;about to
fork child which will become job
01/30/2009 10:01:07;0001; pbs_mom;Job;TMomFinalizeJob2;job:
95037.head3.cluster numnodes=1 numvnod=1
01/30/2009 10:01:07;0001; pbs_mom;Job;95037.head3.cluster;phase 2 of
job launch successfully completed
01/30/2009 10:01:12;0001; pbs_mom;Job;95037.head3.cluster;job not
ready after 5 second timeout, MOM will recheck
01/30/2009 10:01:12;0008; pbs_mom;Job;95037.head3.cluster;job
execution started
01/30/2009 10:01:12;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "jobs=95037.head3.cluster"
01/30/2009 10:01:12;0008; pbs_mom;Job;95037.head3.cluster;checking job
start in TMOMScanForStarting - examining pipe from child
01/30/2009 10:01:12;0001; pbs_mom;Job;95037.head3.cluster;task/session
info loaded
01/30/2009 10:01:12;0008; pbs_mom;Req;send_sisters;sending command
ABORT_JOB for job 95037.head3.cluster (10)
01/30/2009 10:01:12;0008; pbs_mom;Job;kill_job;scan_for_exiting:
sending signal 9, "KILL" to job 95037.head3.cluster, reason: local task
termination detected
01/30/2009 10:01:12;0008; pbs_mom;Job;95037.head3.cluster;kill_job
done (killed 0 processes)
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;sending
preobit jobstat
01/30/2009 10:01:12;0080; pbs_mom;n/a;cput_sum;proc_array loop start -
jobid = 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;n/a;mem_sum;proc_array loop start -
jobid = 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;n/a;resi_sum;proc_array loop start -
jobid = 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;checking job
w/subtask pid=0 (child pid=9992)
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;performing
job clean-up
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;epilog
subtask created with pid 9993 - substate set to JOB_SUBSTATE_OBIT -
registered post_epilogue
01/30/2009 10:01:12;0080; pbs_mom;n/a;cput_sum;proc_array loop start -
jobid = 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;n/a;mem_sum;proc_array loop start -
jobid = 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;n/a;resi_sum;proc_array loop start -
jobid = 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;checking job
w/subtask pid=9993 (child pid=9993)
01/30/2009 10:01:12;0008; pbs_mom;Job;95037.head3.cluster;checking job
post-processing routine
01/30/2009 10:01:12;0080; pbs_mom;Req;post_epilogue;preparing obit
message for job 95037.head3.cluster
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;encoding
"send flagged" attr: Error_Path
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;encoding
"send flagged" attr: Output_Path
01/30/2009 10:01:12;0080; pbs_mom;Job;95037.head3.cluster;obit sent to
server
01/30/2009 10:01:12;0001; pbs_mom;Job;95037.head3.cluster;setting job
substate to EXITED
Thanks in advance,
Wayne
--
Dr. Wayne Mallett
High Performance & Research Computing Support
Phone: 0747815084
Email: Wayne.Mallett at jcu.edu.au
Smail: James Cook University
Townsville Qld 4811
Australia
More information about the torqueusers
mailing list