[CF-metadata] high sample rate (seismic) data conventions

Seth McGinnis mcginnis at ucar.edu
Mon Apr 10 11:54:18 MDT 2017


Hi Jonathan,

Oh, climate model outputs are also supposed to have a uniform sample
rate for the whole time series -- emphasis on *SUPPOSED TO*.  To my
dismay, I have encountered multiple cases where something went wrong
with the generation of the data files, resulting in missing or repeated
or weirdly-spaced timesteps, and sorting out the resulting problems is
how I came to appreciate the value of the explicit coordinate...

As far as I know, you are correct that CF does not have a standardized
way to represent a coordinate solely in terms of a formula without
reference to a corresponding coordinate variable.

However, that doesn't mean you couldn't do it and still have the file be
CF-compliant.  As far as I am aware (and somebody correct me if I'm
wrong), coordinate variables are not actually mandatory.

So if, for reasons of feasibility, you found it necessary to do
something like the following, I believe that strictly speaking it would
be not just allowed but fully CF-compliant:

dimensions:
  time = UNLIMITED; // (1892160000 currently)
variables:
  double acceleration(time);
    acceleration:long_name = "ground acceleration";
    acceleration:units = "m s-2";
    acceleration:start_time = "2017-01-01 00:00:00.01667"
    acceleration:sampling_rate = "60 hz"
data:
    acceleration = 1.324145e-6, ...


I actually have some files without any coordinate variables sitting
around from the intermediate stage of some processing I did; I checked
one with Rosalyn Hatcher's cf-checker, and it didn't complain, so I
think it is technically legal.  It's kind of a letter-of-the-law rather
than spirit-of-the-law thing, but it's at least theoretically compliant.
 Up to you whether that would count as sufficiently suitable for your
use case.

Cheers,

--Seth



On 4/10/17 10:54 AM, Maccarthy, Jonathan K wrote:
> Hi Seth,
> 
> Thanks for the very helpful response.  I can understand the argument for
> explicit coordinates, as opposed to using formulae; I think it solves
> several problems.  The assumption of a uniform sample rate for the
> length of a continuous time series is deeply engrained in most seismic
> software, however.  Changing that assumption may lead to other problems
> (but maybe not!).  Data volumes for a single channel can be 40-100
> 4-byte samples per second, which is something like 5-12 GB per channel
> per year uncompressed.  Commonly, dozens of channels are used at once,
> though some of them may share time coordinates.  It sounds like this
> use-case is similar in volume to what you've used, and may be worth
> trying out.
> 
> Just to be clear, however, would I be correct in saying that CF has no
> accepted way of representing the data as I've described?
> 
> Thanks again,
> Jonathan
> 
>> On Apr 7, 2017, at 4:43 PM, Seth McGinnis <mcginnis at ucar.edu
>> <mailto:mcginnis at ucar.edu>> wrote:
>>
>> Hi Jonathan,
>>
>> I would interpret the CF stance as being that the value in having
>> explicit coordinate variables and other ancillary data to accompany the
>> data outweighs the cost of increased storage.
>>
>> There are some cases where CF bends away from that for the sake of
>> practicality (see, e.g., the discussion about external file references
>> for cell_bounds in CMIP5), but overall, my sense is that the community
>> feels that it's better to have things explicitly written out in the file
>> than it is to provide them implicitly via a formula to calculate them.
>>
>> Based on my personal experiences, I think this is the right approach.
>> (In fact, I take it even further: I prefer to avoid data compression
>> entirely and to keep like data with like as much as possible, rather
>> than splitting big files into smaller pieces.)
>>
>> I have endured far, far more suffering and toil from (a) trying to
>> figure out what's wrong with a file that violates some implicit
>> assumption (like "there are never gaps in the time coordinate") and (b)
>> dealing with the complications of various tactics for keeping file sizes
>> small than I ever have from storing and working with very large files.
>>
>> YMMV, of course.  What are your data volumes like?  I'm working at the
>> terabyte scale, and as long as my file sizes stay under a few dozen GB,
>> I don't really even bother thinking about anything that affects the file
>> size by less than an order of magnitude.
>>
>> Cheers,
>>
>> Seth McGinnis
>>
>> ----
>> NARCCAP / NA-CORDEX Data Manager
>> RISC - IMAGe - CISL - NCAR
>> ----
>>
>>
>> On 4/7/17 9:55 AM, Maccarthy, Jonathan K wrote:
>>> Hi all,
>>>
>>> I’m curious about the suitability of CF metadata conventions for
>>> seismic sensor data.  I’ve done a bit of searching, but can’t find
>>> any mention of how CF conventions would store high sample-rate data
>>> sensor data.  I do see descriptions of time series conventions, where
>>> hourly or daily sensor data samples are stored along with their
>>> timestamps, but storing individual timestamps for each sample of a
>>> high sample rate sensor would unnecessarily double the storage.
>>> Seismic formats typically don’t store time vectors, but instead just
>>> store vectors of samples with an associated start time and sampling
>>> rate.
>>>
>>> Could someone please point me towards a discussion or existing
>>> conventions on this topic?  Any help or suggestion is appreciated.
>>>
>>> Best, Jon _______________________________________________ CF-metadata
>>> mailing list CF-metadata at cgd.ucar.edu <mailto:CF-metadata at cgd.ucar.edu> 
>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>>
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu <mailto:CF-metadata at cgd.ucar.edu>
>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
> 



More information about the CF-metadata mailing list