[torqueusers] torque tmpdir on Lustre filesystem
l.flis at cyf-kr.edu.pl
Fri Feb 24 03:24:30 MST 2012
Hello Christopher, Hi *
> We don't use Lustre (we have Panasas and GPFS), but just wondering
> does this happen all the time, or only occasionally ?
It happens occasionaly. But as I said - this seems like bug in Lustre
FS, and it has nothing to do with torque code. Torque is using unlucky
sequence of stat/mkdir functions which exposes lustre misbehaviour.
> If occasionaly then if the job fails once, will it always fail, or
> will it work if you try again?
Another call to the mkdirtree() function should succeed after few
seconds of sleep.
I belive this behaviour in Lustre client appeared in 1.8.x line and
remains in 2.1.X. HP SFS IIRC is based on 1.4 and 1.6 so it's not affected.
We have observed the BUG in Lustre 1.8.(4,5.6) infrastructure. Then we
moved to 2.1 line replacing all the components (servers,arrays,fabric)
and the bug remained.
The problem with lustre is that mkdir() call on EXISTING directory
returns EPERM error instead of EEXIST once in a while, usually when
stat() is called before mkdir.
I belive doing mkdir on a existing path is not very common practice and
that's the reason the BUG was unnoticed for a long time
More information about the torqueusers