[torqueusers] Post job file processing error

Steve Young chemadm at hamilton.edu
Fri Mar 13 06:27:24 MDT 2009


Hi,
	You need to make sure that you can ssh/scp without a password between  
the server and the nodes. Depending on how you have things configured  
you may need to make sure that you do it with short name and/or FQDN.  
You should be able to try a copy by hand using the error you posted.  
Go to ciarlab14.cluster.net and try to copy the file from   /var/spool/ 
torque/spool/32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/ 
local/out. I'm also guessing that your using NFS partitions so be sure  
that you can write to those partitions on the nodes. You might need to  
utilize the mom directive usecp. From the Torque admin guide:

$usecp  	<HOST>:<SRCDIR> <DSTDIR>  	Specifies which directories should  
be staged (see TORQUE Data Management)  	$usecp *.fte.com:/data /usr/ 
local/data

Also, you state your nodes are node0-node5 but the error message says  
ciarlab11.cluster.net and ciarlab14.cluster.net so that is a little  
bit confusing. I know this has been covered on the list before so  
searching the archives might give you some more answers to this type  
of problem. I hope this helps.

-Steve

On Mar 12, 2009, at 9:55 PM, tracy_luofengji wrote:

> Dear all,
> Hello, I did a fresh installation of torque 2.3.0 on my cluster, and  
> I met a strange post job file processing problem. I did the same  
> installation procedure on all the 5 compute nodes (node1, node2,  
> node3, node4, node5) and node0 acts as the master. On the compute  
> nodes, I just installed the packages:
>
> /usr/local/torque-package-mom-linux-i686.sh --install
> /usr/local/torque-package-clients-linux-i686.sh --install
>
> and then, on the compute nodes, I ran: pbs_mom
>
> The problem is, when I submit test jobs, only the node1 could send  
> the output file back to the master node. Then other 4 compute nodes  
> could not send the output file back. I ran the command qstat -f and  
> saw following sentences:
> ......
> sched_hint:Post job file processing  
> error;job32.ciarlab11.cluster.net on host ciarlab14.cluster.net/0
> Unable to copy file /var/spool/torque/spool/ 
> 32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/local/out
> Unable to copy file /var/spool/torque/spool/ 
> 32.ciarlab11.cluster.net.ER to ciarlab11.cluster.net:/usr/local/err
> comment=Job started on Thu Mar 12 at 21:09
> etime=Thu Mar 12 21:09:18 2009
> exit_status = -1
> submit_args=pbsjob
> start_time=Thu Mar 12 21:09:18 2007
> start_count=1
>
> And my job scipt is:
> #!/bin/sh
> #PBS -N exampleJob
> #PBS -o /usr/local/out
> #PBS -e /usr/local/err
> #PBS -V
> echo 'helloworld'
>
> I have spent 2 days on this issue, and I hope I can get some support  
> from this mailling list.
> Any help will be appraciated.
>
> Thanks!
> Regards,
> Tracy
>
>
>
>
> ÍøÒ×ÓÊÏ䣬ÖйúµÚÒ»´óµç×ÓÓʼþ·þÎñÉÌ  
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090313/577e9bb9/attachment.html


More information about the torqueusers mailing list