[torqueusers] Server crashes because of a wrong group

Danny Sternkopf dsternkopf at hpce.nec.com
Fri Nov 9 11:27:14 MST 2007


we experience a PBS server crash when a user specifies a wrong user
group during job submission.

Example: qsub -lnodes=1,walltime=00:01:00 -W group_list=asl

The user is not in asl group and gets:

$ qsub -lnodes=1,walltime=00:01:00 -W group_list=asl
echo blubb
qsub: End of File
$ echo $?

The pbs_server crashes right after it. Then I can't start the pbs_server
anymore. I have to remove the created job files
/var/spool/pbs/server_priv/jobs/*.SC and *.JB first.

That also happens if the group doesn't exist.

I've been aware of that since a couple of years. But I never really
followed up it because it happened very seldom.

The affected platform is EM64T(or x86_64) running with RHEL4 (Scientific
Linux 4.1).

I did some tests on other system:

1. On ia64 running with RHEL3(Whitebox Linux)
- The job is running fine. And in the accounting you can see 'group=<null>'.

2. On x86_64 running with RHEL3 (Fedora 3)
- The jobs is 'rejected by all possible destination'.

On all three system we have Torque version 2.1.6 running.

The different behaviors are very strange. Might be that configuration
plays also a role.

Any ideas what could go wrong? Is there a known issue with that?

Best regards,

Danny Sternkopf                         dsternkopf at hpce.nec.com
NEC HPC Europe GmbH,           http://www.teraflop-workbench.de
Stuttgart, Germany phone: +49-711-68770-35 fax: +49-711-6877145

More information about the torqueusers mailing list