[Cesm2control] Fwd: no log files ???

Jim Edwards jedwards at ucar.edu
Thu Feb 15 07:33:46 MST 2018


Hi all,

Due to problems in the pbs system on cheyenne it is possible that the
archiving script starts before the model run completes - this will cause
the model to stop and the incomplete logs to be moved to the archive
directory.   I would recommend that you run with DOUT_S=FALSE and then run
case.st_archive by hand until this problem is resolved.

Jim

---------- Forwarded message ----------
From: Richard Valent <valent at ucar.edu>
Date: Wed, Feb 14, 2018 at 4:28 PM
Subject: Re: no log files ???
To: Mick Coady <mickc at ucar.edu>
Cc: Jim Edwards <jedwards at ucar.edu>, Mariana Vertenstein <mvertens at ucar.edu>,
Cecile Hannay <hannay at ucar.edu>, Gokhan Danabasoglu <gokhan at ucar.edu>,
Siddhartha Ghosh <sghosh at ucar.edu>


Hi Jim, Mariana, Cecile, Gokhan, Siddhartha and Mick;

My understanding of the job-dependency problem that bit you and others
today, is that it is a consequence of a PBS error. CISL has requested for a
fix for this particular problem, but it will require a PBS update, and we
are not sure when the vendor will provide it. While we are waiting for the
fix, the single strategy suggested in today's meeting with systems
administrators, is for you to monitor your jobs, and resubmit them when you
see that they run incorrectly, as was obvious from today's failures.

I am sorry I cannot offer anything more positive. Know that getting this
fix is high on our todo list.  Mick, if you have anything to add to this
email from today's meeting, please jump in.  --Dick Valent



On Wed, Feb 14, 2018 at 11:51 AM, Mick Coady <mickc at ucar.edu> wrote:

> Colin reported what sounds like the exact same problem.  Dick has created
> High priority tickets for SSG on this and we'll keep you posted as best we
> can.
>
> On Wed, Feb 14, 2018 at 11:39 AM, Jim Edwards <jedwards at ucar.edu> wrote:
>
>> I just went and spoke with Dick and Pat in user services - we think that
>> it may be a PBS problem.
>>
>> Cecile I suggest that you qdel jobs
>>
>> 4658078
>>
>> 4658135
>>
>> 4658121
>>
>> These are your short term archiver jobs and it appears that the problem
>> is that they
>>
>> are starting before the model run completes.  So we want you to run short
>> term archiving by hand until we figure this out.   Does that make sense?
>>
>>
>> On Wed, Feb 14, 2018 at 11:32 AM, Mariana Vertenstein <mvertens at ucar.edu>
>> wrote:
>>
>>> Since this has not happened before - I am concluding that this must be a
>>> cheyenne issue. Does everyone agree?
>>>
>>> On Wed, Feb 14, 2018 at 11:25 AM, Cecile Hannay <hannay at ucar.edu> wrote:
>>>
>>>> Jim,
>>>> I had another case. It is even worse than that.
>>>> I had a CESM run going, the archiving start running before the job
>>>> completed.
>>>> Then the archiving job completed and it resubmitted the CESM run (while
>>>> the previous run was still running).
>>>> So it means I had two instances of CESM running - basically. overwriting
>>>> each other.
>>>> Cecile
>>>>
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>> Cecile Hannay
>>>> National Center for Atmospheric Research
>>>> email: hannay at ucar.edu
>>>> phone: 303-497-1327 <(303)%20497-1327>
>>>> webpage: http://www.cgd.ucar.edu/staff/hannay/
>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>> On Wed, Feb 14, 2018 at 11:18 AM, Jim Edwards <jedwards at ucar.edu>
>>>> wrote:
>>>>
>>>>>
>>>>> We've had a couple of jobs exhibit some strange behavior on cheyenne
>>>>> this morning,
>>>>> both stopped running without any indication as to why in the logs.
>>>>> could you please check and see if you can see what the issue might
>>>>> have been?
>>>>>
>>>>> An interesting piece of these is that they both had dependent jobs
>>>>> that should not have run because the model job did not complete, however in
>>>>> both of these cases the dependent jobs did run.
>>>>>
>>>>>
>>>>> The jobid's were:
>>>>>
>>>>> 4651153 and 4644799
>>>>>
>>>>> On Wed, Feb 14, 2018 at 11:09 AM, Cecile Hannay <hannay at ucar.edu>
>>>>> wrote:
>>>>>
>>>>>> It did happen again in 2 runs:
>>>>>> /glade/p/cesmdata/cseg/runs/cesm2_0/b.e20.B1850.f09_g17.pi_c
>>>>>> ontrol.all.274
>>>>>> /glade/p/cesmdata/cseg/runs/cesm2_0/b.e20.B1850.f09_g17.pi_c
>>>>>> ontrol.all.265_fix
>>>>>>
>>>>>> As far as I can tell, it is not a problem of quota.
>>>>>>
>>>>>>   Space                             Used       Quota    % Full
>>>>>> ------------------------------ ----------- ----------- ---------
>>>>>> /glade/p/cesm                    198.49 TB   200.22 TB   99.14 %
>>>>>> /glade/p/cesm/amwg_dev            26.00 TB    42.22 TB   61.58 %
>>>>>> /glade/p/cesm/bgcwg_dev           57.60 TB    58.60 TB   98.29 %
>>>>>> /glade/p/cesm/chwg_dev            73.85 TB    78.02 TB   94.66 %
>>>>>> /glade/p/cesm/liwg_dev             0.86 TB     5.37 TB   16.01 %
>>>>>> /glade/p/cesm/lmwg_dev             6.59 TB     6.60 TB   99.85 %
>>>>>> /glade/p/cesm/omwg_dev            83.65 TB    87.56 TB   95.53 %
>>>>>> /glade/p/cesm/palwg_dev          119.39 TB   122.56 TB   97.41 %
>>>>>> /glade/p/cesm/pcwg_dev            39.13 TB    41.00 TB   95.44 %
>>>>>> /glade/p/cesm/sdwg_dev            44.61 TB    45.79 TB   97.42 %
>>>>>> /glade/p/cesm/wawg_dev            81.26 TB    83.00 TB   97.90 %
>>>>>> /glade/u/cesm-scripts            300.16 GB  1024.00 GB   29.31 %
>>>>>> /glade/p/cesmLE                  124.38 TB   124.38 TB  100.00 %
>>>>>> /glade/p/cesmLME                  99.47 TB   105.50 TB   94.28 %
>>>>>> /glade/p/cesm0005                879.63 TB   900.00 TB   97.74 %
>>>>>> /glade/p/cgd                     255.98 TB   265.00 TB   96.60 %
>>>>>> /glade/p/cesmdata                 47.56 TB    57.07 TB   83.34 %
>>>>>> /glade/p/cwmk0001                  5.93 TB     6.11 TB   97.05 %
>>>>>> /glade/p/p05010048                14.25 TB    14.74 TB   96.68 %
>>>>>> /glade/scratch/hannay              8.30 TB    50.00 TB   16.60 %
>>>>>> /glade/p/work/hannay             507.27 GB   512.00 GB   99.08 %
>>>>>> /glade/u/home/hannay              31.15 GB    50.00 GB   62.30 %
>>>>>>
>>>>>>
>>>>>>
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Cecile Hannay
>>>>>> National Center for Atmospheric Research
>>>>>> email: hannay at ucar.edu
>>>>>> phone: 303-497-1327 <(303)%20497-1327>
>>>>>> webpage: http://www.cgd.ucar.edu/staff/hannay/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 14, 2018 at 9:54 AM, Jim Edwards <jedwards at ucar.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> The only thing I can think of is that you ran out of disk space
>>>>>>> someplace, not sure why else it would behave this way.
>>>>>>>
>>>>>>> On Wed, Feb 14, 2018 at 9:52 AM, Cecile Hannay <hannay at ucar.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks. This is strange that these files were moved to the archive
>>>>>>>> directory when it failed. I guess the archiving script got a message saying
>>>>>>>> the job completed successfully.
>>>>>>>> I am going to restart it.
>>>>>>>>
>>>>>>>>
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Cecile Hannay
>>>>>>>> National Center for Atmospheric Research
>>>>>>>> email: hannay at ucar.edu
>>>>>>>> phone: 303-497-1327 <(303)%20497-1327>
>>>>>>>> webpage: http://www.cgd.ucar.edu/staff/hannay/
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 14, 2018 at 9:47 AM, Mariana Vertenstein <
>>>>>>>> mvertens at ucar.edu> wrote:
>>>>>>>>
>>>>>>>>> Thanks so much for your quick response to this!!!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 14, 2018 at 9:45 AM, Jim Edwards <jedwards at ucar.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> It seems that your log files were moved to the archive directory,
>>>>>>>>>> not sure why...
>>>>>>>>>>
>>>>>>>>>> /glade/p/cesm0005/archive/b.e20.B1850.f09_g17.pi_control.all
>>>>>>>>>> .265_fix/logs/cesm.log.4644799.chadmin1.180214-060212
>>>>>>>>>>
>>>>>>>>>> I can't find any indication in any of the logs as to why it
>>>>>>>>>> failed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 14, 2018 at 9:34 AM, Cecile Hannay <hannay at ucar.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I have the strangest thing happening.
>>>>>>>>>>> I had a run that crashed this morning. I am looking in the run
>>>>>>>>>>> directory and I don't see any log files ?
>>>>>>>>>>> I don't understand how it is possible (?)
>>>>>>>>>>> /glade2/scratch2/hannay/b.e20.B1850.f09_g17.pi_control.all.2
>>>>>>>>>>> 65_fix/run
>>>>>>>>>>>
>>>>>>>>>>> The case directory is in:
>>>>>>>>>>> /glade/p/cesmdata/cseg/runs/cesm2_0/b.e20.B1850.f09_g17.pi_c
>>>>>>>>>>> ontrol.all.265_fix
>>>>>>>>>>>
>>>>>>>>>>> Could you have a look ? I want to restart this run but I would
>>>>>>>>>>> like you to look first because it looks strange to me.
>>>>>>>>>>>
>>>>>>>>>>> Am I missing something ???
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I didn't erase by mistake. I was in the seminar room when I
>>>>>>>>>>> received an email the run crashed. Here is the command I issued. It is all
>>>>>>>>>>> I did.
>>>>>>>>>>>
>>>>>>>>>>>   1003  9:26    cd /glade/scratch/hannay/
>>>>>>>>>>>   1004  9:27    cd *265_fix
>>>>>>>>>>>   1005  9:27    cd run/
>>>>>>>>>>>   1006  9:27    ls -all -crt
>>>>>>>>>>>   1007  9:27    ls -all -crt *log*
>>>>>>>>>>>   1008  9:27    ls -all -crt
>>>>>>>>>>>   1009  9:27    ls
>>>>>>>>>>>   1010  9:27    ls -all -crt
>>>>>>>>>>>   1011  9:28    pwd
>>>>>>>>>>>   1012  9:30    history
>>>>>>>>>>>
>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>> Cecile Hannay
>>>>>>>>>>> National Center for Atmospheric Research
>>>>>>>>>>> email: hannay at ucar.edu
>>>>>>>>>>> phone: 303-497-1327 <(303)%20497-1327>
>>>>>>>>>>> webpage: http://www.cgd.ucar.edu/staff/hannay/
>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jim Edwards
>>>>>>>>>>
>>>>>>>>>> CESM Software Engineer
>>>>>>>>>> National Center for Atmospheric Research
>>>>>>>>>> Boulder, CO
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jim Edwards
>>>>>>>
>>>>>>> CESM Software Engineer
>>>>>>> National Center for Atmospheric Research
>>>>>>> Boulder, CO
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Edwards
>>>>>
>>>>> CESM Software Engineer
>>>>> National Center for Atmospheric Research
>>>>> Boulder, CO
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Jim Edwards
>>
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>>
>
>
>
> --
> Mick Coady
> NCAR Computational & Information Services Laboratory
> Consulting Services Group Head
> mickc at ucar.edu
> 303.497.1828 <(303)%20497-1828>
>




-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cesm2control/attachments/20180215/71ac1817/attachment-0001.html>


More information about the Cesm2control mailing list