[torqueusers] Torque 4.1.2 pbs_server crashes when running jobs with files to stage-in

andrew.lahiff at stfc.ac.uk andrew.lahiff at stfc.ac.uk
Fri Sep 28 00:08:24 MDT 2012


I've setup a small test batch system using Torque 4.1.2. If I just run very simple test jobs, e.g.

qsub -q gridS sleep.sh

where the script sleep.sh is shown below (*), everything is fine.  However, whenever I try to submit a job including stage in, e.g.

qsub -q gridS -W stagein="hosts at lcgvm17:/etc/hosts" sleep.sh

then pbs_server crashes. The last few lines of the pbs_server log file look like this:

09/27/2012 22:41:37;0080;PBS_Server.29253;Req;dis_request_read;decoding command AlternateUserAuthentication from dteam087
09/27/2012 22:41:37;0100;PBS_Server.29253;Req;;Type AlternateUserAuthentication request received from dteam087 at lcgvm17, sock=10
09/27/2012 22:41:37;0001;PBS_Server.29254;Svr;PBS_Server;svr_setjobstate: setting job 79509.cloud041 state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
09/27/2012 22:41:37;0008;PBS_Server.29254;Job;reply_send_svr;Reply sent for request type RunJob on socket 9
09/27/2012 22:41:37;0001;PBS_Server.29255;Svr;PBS_Server;svr_setjobstate: setting job 79509.cloud041 state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40)

When running pbs_server with gdb and submitting the same type of job, I see this:

allocated node cloud126/0 to job 79509.cloud041 (nsnfree=24)

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffee1fc700 (LWP 29255)]
0x000000000044c97b in send_job_to_mom (pjob_ptr=0x7fffee1fbc08, preq=0x0, parent_job=0x0) at req_runjob.c:1110
1110      if (preq->rq_reply.brp_un.brp_txt.brp_str != NULL)

With Torque 4.1.0 everything was fine and I didn't experience this problem, but with Torque 4.1.1 pbs_server crashes as well. I'm using Linux 2.6.32-220.17.1.el6.x86_64.

Has anyone else experienced this issue, or know what could be causing it?

Many Thanks,

sleep 10
Scanned by iCritical.

More information about the torqueusers mailing list