[torqueusers] Strange problem when sumitting job : cannot stat .OU
and .ER files
Constantin CHARISSIS
cch at dataswift.fr
Thu Feb 16 06:41:27 MST 2006
Hi,
I'm experiencing a strange problem with Torque (torque-2.0.0p7-1 bundled
inside Rocks 4.1.0 64bits) :
On site we have the following problem :
----- Forwarded message from root <adm at master.cluster.org> -----
X-Original-To: test at master.cluster.org
Delivered-To: test at master.cluster.org
To: test at master.cluster.org
Subject: PBS JOB 8.master.cluster.org
Precedence: bulk
Date: Tue, 14 Feb 2006 16:24:06 +0100 (CET)
From: adm at master.cluster.org (root)
PBS Job Id: 8.master.cluster.org
Job Name: sub.sh
An error has occurred processing your job, see below.
Post job file processing error; job 8.master.cluster.org on host
amd-0-6.local/1+amd-0-6.local/0+amd-0-5.local/1+amd-0-5.local/0+amd-0-4.loca
l/1+amd-0-4.local/0+amd-0-3.local/1+amd-0-3.local/0
Unable to copy file /opt/torque/spool/8.master..OU to
/home/test/rocks_test/output.txt
>>> error from copy
/bin/cp: cannot stat `/opt/torque/spool/8.master..OU': No such file or
directory
>>> end error output
Unable to copy file /opt/torque/spool/8.master..ER to
/home/test/rocks_test/error.txt
>>> error from copy
/bin/cp: cannot stat `/opt/torque/spool/8.master..ER': No such file or
directory
>>> end error output
----- End forwarded message -----
I use the default rocks configuration which is basic : 1 queue with no
ressource restriction :
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
The spool directory has the following rights :
[root at amd-0-0 torque]# pwd
/opt/torque
drwxrwxrwt 2 root root 4096 Feb 16 14:20 spool
Mom is running as root.
There is 1GB free space on the /opt/torque/spool partition on the node.
I have searched on the mailing list and google but can't find another
example where the SOURCE file cannot be "stated", only permission problems.
The strange thing is that I cannot reproduce that on our dev/test cluster
wich is exactly the same network, partition, naming, queue, user
configuration.
I have also copied the "faulty" torque folders of the master & compute nodes
on my dev cluster to replace my working test one. And it works great.
Any idea why mom is not creating the .OU and .ER files ?
Thanks for your help,
Constantin Charissis
More information about the torqueusers
mailing list