[torqueusers] Checkpoint script failed with return value of 13

Sreedhar Manchu sm4082 at nyu.edu
Tue Jan 31 09:24:23 MST 2012


adding

no_root_squash

to /etc/exports fixed the issue.

Sreedhar.

On Jan 31, 2012, at 10:32 AM, Sreedhar Manchu wrote:

> Hi,
> 
> When I try to checkpoint a simple job I see the error 
> 
> Checkpoint script failed with return value of 13
> 
> in qstat -f output. 
> 
> I see this in system messages
> 
> Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
> Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13 
> Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
> Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13
> 
> I found this in checkpoint_script.
> 
> # Note also that a request was made to identify whether this script was invoked
> # by the job's owner or by a system administrator.  While this information is
> # known to pbs_server, it is not propagated to pbs_mom and thus it is not
> # possible to pass this to the script.  Therefore, a workaround is to invoke
> # qmgr and attempt to set a trivial variable. This will fail if the invoker is
> # not a manager.
> 
> Anyone know what exactly do I need to do here? I am not sure what trivial variable I need to set wtih qmgr.
> 
> Our Server Attributes:
> 
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_host_enable = False
> set server acl_hosts = crunch.its.nyu.edu
> set server acl_hosts += crunch.local
> set server managers = root at crunch.local
> set server operators = root at crunch.local
> set server default_queue = route
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server submit_hosts = login-0-1
> set server submit_hosts += login-0-0
> set server submit_hosts += login-0-3
> set server submit_hosts += login-0-2
> set server allow_node_submit = False
> set server next_job_number = 139165
> 
> If anyone knows how to get around this error, please let me know. I'd appreciate your help.
> 
> Thanks,
> Sreedhar.
> 
> ---
> Sreedhar Manchu
> HPC Support Specialist
> New York University
> 251 Mercer Street
> New York, NY 10012-1110
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

---
Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110




More information about the torqueusers mailing list