[torqueusers] Job Checkpoint and Restart with torque 2.4.6 & BLCR

Rajiv Rajaian rajiv.care at gmail.com
Fri Mar 26 07:42:11 MDT 2010


Hi Jazcek Braden
Hi  Ive modified that * blcr_checkpoint_script  *and removed that "depth"
variable and again submitted that test job and I cant hold that jobs ..Job
still remains in running state .. The steps ive done are as follows .. Pls
help me to solve this issue

[guser02 at gcluster ~]$ qsub -c enabled,periodic,shutdown,interval=1 test.sh

guser02 at gcluster ~]$ qhold 8

[guser02 at gcluster ~]$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
8.gcluster                test.sh          guser02                0 R workq


[guser02 at gcluster ~]$ qstat -f
Job Id: 8.gcluster.grid
    Job_Name = test.sh
    Job_Owner = guser02 at gcluster.grid
    job_state = R
    queue = workq
    server = gcluster.grid
    Checkpoint = enabled,periodic,shutdown,interval=1
    ctime = Fri Mar 26 19:07:07 2010
    Error_Path = gcluster.grid:/home/guser02/test.sh.e8
    exec_host = gcluster.grid/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Fri Mar 26 19:07:12 2010
    Output_Path = gcluster.grid:/home/guser02/test.sh.o8
    Priority = 0
    qtime = Fri Mar 26 19:07:07 2010
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    session_id = 20882
    Variable_List = PBS_O_HOME=/home/guser02,PBS_O_LOGNAME=guser02,

PBS_O_PATH=/usr/local/firefox/:/opt/mpich-1.2.6/bin:/usr/local/jdk1.5

.0_03/bin/:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local

/tomcat-5.0.27/bin:/usr/local/ant-1.6.4/bin:/usr/local/globus-4.0.3/bi

n:/usr/local/globus-4.0.3/sbin:/bin:/usr/local/maui/bin:/usr/local/gw/

bin:/usr/local/rrdtool/bin:/opt/ganglia/bin:/usr/local/sbin:/usr/local
        /bin:/usr/local/pdftk-1.41/pdftk:/home/guser02/bin,
        PBS_O_MAIL=/var/spool/mail/guser02,PBS_O_SHELL=/bin/bash,
        PBS_O_HOST=gcluster.grid,PBS_SERVER=gcluster.grid,
        PBS_O_WORKDIR=/home/guser02,PBS_O_QUEUE=workq
    comment = Usage: /var/spool/PBS/mom_priv/blcr_checkpoint_script

    etime = Fri Mar 26 19:07:07 2010
    submit_args = -c enabled,periodic,shutdown,interval=1 test.sh
    start_time = Fri Mar 26 19:07:07 2010
    start_count = 1
    fault_tolerant = False


On Fri, Mar 26, 2010 at 5:15 PM, Jazcek Braden <jazcek at gmail.com> wrote:

> There is a typo in the script in the documentation, they try to use a
> variable called depth without defining it in the my statement a few
> lines up
>
> -- Jazcek
>
> On Fri, Mar 26, 2010 at 6:59 AM, Rajiv Rajaian <rajiv.care at gmail.com>
> wrote:
> > Hi all
> > I ve installed torque 2.4.6 and enabled the blcr with the following
> option
> > while installing
> >
> > ./configure --disable-gui --with-server-home=/var/spool/PBS
> > --with-default-server=gcluster.grid --enable-unixsockets=no --enable-blcr
> > --disable-gcc-warnings
> >
> > Also my mom_priv/config looks like
> >
> > /var/spool/PBS/mom_priv/config
> > $checkpoint_script  /var/spool/PBS/mom_priv/blcr_checkpoint_script
> > $restart_script  /var/spool/PBS/mom_priv/blcr_restart_script
> > $checkpoint_run_exe /usr/local/bin/cr_run
> > $pbsserver gcluster.grid
> > $loglevel 7
> >
> >
> > I ve created blcr_checkpoint_script & blcr_restart_script scripts too
> >
> > While job submission Im getting the following error .. Please help me to
> > solve this error.. Is there any thing else to be configured for this??
> >
> > [guser02 at gcluster ~]$ qsub -c enabled,periodic,shutdown,interval=1
> test.sh
> > 1.gcluster.grid
> >
> > [guser02 at gcluster ~]$ qhold 1
> >
> > [guser02 at gcluster ~]$ qstat
> > Job id                    Name             User            Time Use S
> Queue
> > ------------------------- ---------------- --------------- -------- -
> -----
> > 1.gcluster                test.sh          guser02                0 R
> workq
> >
> > [guser02 at gcluster ~]$ qstat -f
> > Job Id: 1.gcluster.grid
> >     Job_Name = test.sh
> >     Job_Owner = guser02 at gcluster.grid
> >     job_state = R
> >     queue = workq
> >     server = gcluster.grid
> >     Checkpoint = enabled,periodic,shutdown,interval=1
> >     ctime = Fri Mar 26 17:20:03 2010
> >     Error_Path = gcluster.grid:/home/guser02/test.sh.e1
> >     exec_host = gcluster.grid/0
> >     Hold_Types = n
> >     Join_Path = n
> >     Keep_Files = n
> >     Mail_Points = a
> >     mtime = Fri Mar 26 17:20:05 2010
> >     Output_Path = gcluster.grid:/home/guser02/test.sh.o1
> >     Priority = 0
> >     qtime = Fri Mar 26 17:20:03 2010
> >     Rerunable = True
> >     Resource_List.nodect = 1
> >     Resource_List.nodes = 1
> >     session_id = 19993
> >     Variable_List = PBS_O_HOME=/home/guser02,PBS_O_LOGNAME=guser02,
> >
> > PBS_O_PATH=/usr/local/firefox/:/opt/mpich-1.2.6/bin:/usr/local/jdk1.5
> >
> > .0_03/bin/:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local
> >
> > /tomcat-5.0.27/bin:/usr/local/ant-1.6.4/bin:/usr/local/globus-4.0.3/bi
> >
> > n:/usr/local/globus-4.0.3/sbin:/bin:/usr/local/maui/bin:/usr/local/gw/
> >
> > bin:/usr/local/rrdtool/bin:/opt/ganglia/bin:/usr/local/sbin:/usr/local
> >         /bin:/usr/local/pdftk-1.41/pdftk:/home/guser02/bin,
> >         PBS_O_MAIL=/var/spool/mail/guser02,PBS_O_SHELL=/bin/bash,
> >         PBS_O_HOST=gcluster.grid,PBS_SERVER=gcluster.grid,
> >         PBS_O_WORKDIR=/home/guser02,PBS_O_QUEUE=workq
> >     comment = Scalar found where operator expected at
> > /var/spool/PBS/mom_priv/
> >         blcr_checkpoint_script line 31,
> >          near "$signalNum $depth"
> >         (Missing operator before $depth?)
> > syntax e
> >         rror at /var/spool/PBS/mom_priv/blcr_checkpoint_script line 31,
> >          near "$signalNum $depth"
> > Global symbol "$depth" requires explicit pa
> >         ckage name at /var/spool/PBS/mom_priv/blcr_checkpoint_script line
> > 31.
> >
> >         Execution of /var/spool/PBS/mom_priv/blcr_checkpoint_script
> aborted
> > du
> >         e to compilation errors.
> >
> >     etime = Fri Mar 26 17:20:03 2010
> >     submit_args = -c enabled,periodic,shutdown,interval=1 test.sh
> >     start_time = Fri Mar 26 17:20:03 2010
> >     start_count = 1
> >     fault_tolerant = False
> >
> >
> > Regards
> > Rajiv R
> > Project Associate,
> > CARE,MIT,
> > Anna university ,Chennai
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
>
>
>
> --
> Jazcek Braden
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100326/39896f85/attachment-0001.html 


More information about the torqueusers mailing list