[torqueusers] Strange problem when sumitting job : cannot stat .OU and .ER files

Constantin CHARISSIS cch at dataswift.fr
Thu Feb 16 06:41:27 MST 2006


Hi,

I'm experiencing a strange problem with Torque (torque-2.0.0p7-1 bundled
inside Rocks 4.1.0 64bits) :

On site we have the following problem :

----- Forwarded message from root <adm at master.cluster.org> -----

X-Original-To: test at master.cluster.org
Delivered-To: test at master.cluster.org
To: test at master.cluster.org
Subject: PBS JOB 8.master.cluster.org
Precedence: bulk
Date: Tue, 14 Feb 2006 16:24:06 +0100 (CET)
From: adm at master.cluster.org (root)

PBS Job Id: 8.master.cluster.org
Job Name:   sub.sh
An error has occurred processing your job, see below.
Post job file processing error; job 8.master.cluster.org on host
amd-0-6.local/1+amd-0-6.local/0+amd-0-5.local/1+amd-0-5.local/0+amd-0-4.loca
l/1+amd-0-4.local/0+amd-0-3.local/1+amd-0-3.local/0

Unable to copy file /opt/torque/spool/8.master..OU to
/home/test/rocks_test/output.txt
>>> error from copy
/bin/cp: cannot stat `/opt/torque/spool/8.master..OU': No such file or
directory
>>> end error output

Unable to copy file /opt/torque/spool/8.master..ER to
/home/test/rocks_test/error.txt
>>> error from copy
/bin/cp: cannot stat `/opt/torque/spool/8.master..ER': No such file or
directory
>>> end error output

----- End forwarded message -----

I use the default rocks configuration which is basic : 1 queue with no
ressource restriction :

create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True

The spool directory has the following rights :

[root at amd-0-0 torque]# pwd
/opt/torque

drwxrwxrwt  2 root root 4096 Feb 16 14:20 spool

Mom is running as root.

There is 1GB free space on the /opt/torque/spool partition on the node.

I have searched on the mailing list and google but can't find another
example where the SOURCE file cannot be "stated", only permission problems.

The strange thing is that I cannot reproduce that on our dev/test cluster
wich is exactly the same network, partition, naming, queue, user
configuration.

I have also copied the "faulty" torque folders of the master & compute nodes
on my dev cluster to replace my working test one. And it works great.

Any idea why mom is not creating the .OU and .ER files ?

Thanks for your help,

Constantin Charissis



More information about the torqueusers mailing list