[CF-metadata] a different (but perhaps unoriginal) approach to standard name construction

Karl Taylor taylor13 at llnl.gov
Mon Oct 27 19:05:21 MDT 2008


Dear all,

It seems to me that the issue of possibly wanting to store several 
different chemical species in a single array (with a coordinate variable 
identifying the species) is only one limitation of the current 
constraints placed by standard names.  We've also run up against the 
following difficulties:

1)  Currently it is impossible to identify with a single standard name, 
closely related variables that one might want to store in a single 
array).  For example, such quantities as:

     *  temperatures measured with several different instruments.

     *  precipitation separated into categories of snow and ice

     *  concentrations of a molecule (e.g., CO2) separated into 
components defined by its source (fossil fuel combustion, volcanic 
emissions, respiration, decay, etc.) [some have talked about 
distinguishing these contributions by am artificial "color" label]

     *  the contributions of various "processes" to a particular 
quantity (e.g., temperature tendency due to advection, deep_convection, 
short_wave_radiation, etc.)

     *  a variable, as simulated by several different models

     * and the like

2) thresholds and similar things

3) various combinations of variables or operations/transformations, such as:

     * anomalies and more generally differences

     * products (e.g., transports, correlations, etc.)

     * and the like

I think it is perhaps time to consider devising an alternative way of 
providing the information that is currently in the standard name 
(instead of forcing all the information into a single attribute).  As 
you will see, this will eliminate the above limitations, but perhaps 
more importantly it provides a way of more quickly converging on new 
standard names.

The idea, which I'm sure must have been discussed at length already (but 
I've forgotten by now or I've missed it entirely), is to parse the 
quantity identification information into separate elements (or 
"categories" or "components").  We already do this to a certain extent 
by providing some information in the cell_methods attribute.  I would 
build on the bits of independent information already listed in the 
Guidelines for Construction of CF Standard Names 
(http://cf-pcmdi.llnl.gov/documents/cf-standard-names/guidelines).
We might, for example, parse air_temperature and sea_water_temperature 
into two independent attributes.  medium="sea_water" or "air" and 
quantity="temperature"

These independent bits of information could be automatically assembled 
together to create the "standard name".  The current standard names 
would in some cases be identical to the names created from the elements, 
and in other cases we could establish aliases. This would make it 
obvious in many cases how to construct new standard names, and in any 
case would impose a structure on the standard names.

The main job of the standard name committee would be to agree on when a 
new *component* should be added and agree on the list of acceptable 
values for each component.  This would force everyone to think about 
whether a new variable can be distinguished from others simply by adding 
a new value to one of the components, or if an entirely new category 
(i.e., component) is needed.

As a first step, it might be useful to consider the following components 
(many of which appear in the "guidelines" document referred to above):

1. quantity:  the fundamental quantity (e.g., temperature, pressure, 
geopotential_height, precipitation_rate, concentration)

2. medium:  where the quantity is "measured" (e.g., sea, atmosphere (or 
air?), sea_ice, troposphere, lake, stream land_ice, cloud, 
ocean_surface_mixed_layer)

3. constituent: e.g., hydrometeor, ice, snow, rain, CO2, SO4, ozone, 
aerosol, sulfate_aerosol, soot.

4. specie_color:  for when we want to distinguish constituents by what 
produced them (e.g. the sulfate aerosol in the atmosphere that comes 
from different sources: anthropogenic, natural, fossil_fuel, etc.)

5. surface: a quasi-horizontal surface that cannot easily be described 
by a vertical coordinate (e.g., sea_floor, top_of_atmosphere, 
tropopause, adiabatic_condensation_level, surface)

6. process: identifying what process is responsible for the quantity 
(e.g., for temperature tendencies: radiation, convection, 
latent_heating, etc.) [I wonder if specie_color might be combined with 
"process" into a single category?]

7. vector_component: indicating the component of a vector and its 
positive direction (e.g., eastward, northward, upward)

8. radiative_flux_component: indicating whether only the downwelling 
(incoming) or upwelling (outgoing) or net radiative flux is stored

9. tensor_component: ????

10. assumption: indicating that the quantity has been calculated under 
some assumption (e.g., assuming_clear_sky, assuming_no_snow)

11. threshold: indicating that the quantity has been calculated only 
when certain conditions are satisfied.  The form of this attribute would 
have to be worked out, but presumably would identify both the 
condition(s) and the values (or variables containing the values) of the 
thresholds could be specified.

The remaining 6 categories might not be considered part of the 
"standard_name" information, but might better be defined as new variable 
attributes:

12. formula (or transformation?): indicating that in some sense the 
quantity is a "compound" quantity derivable from more fundamental 
quantities.  surface_net_downward_radiative_flux would have a 
formula="sw + lw", and the data writer would also store in the file a 
dummy variable (i.e., it would be either a scalar or array with possibly 
only one element, which would be  set to missing_value), and the 
attributes associated with these two variables would define the quantity 
stored (e.g., in this example, "sw" would have a standard name of 
surface_net_downward_shortwave_radiative_flux, and similarly for "lw") 
As another example, a temporal correlation of quantity "a" and quantity 
"b" could be indicated by formula="correlation(a,b)".   As a third 
example, an "anomaly" could be represented as the difference between two 
variables, and the attributes associated with the variable representing 
the "base" state could explicitly indicate how it was calculated (e.g., 
for a climatology, the climatological period).  For the formula 
attribute,  we might consider adopting the syntax for the formula from 
something like matlab, I guess.  Note that the formula attribute makes 
it possible to express many different quantities without agreeing 
explicitly on their standard names (just the standard_names of their 
formula terms). Note also, that It is possible that the threshold 
information (#11 above) might be represented instead by an appropriate 
formula.

13. measurement_method: indicating what type of sensor was used to 
measure the quantity (e.g., for sea surface temperature observations, 
bucket or ship_intake_temperature, and for models where there are 
multiple methods of defining cloud radiative forcing, specifying which 
of two well-know procedures known as "method 1" or "method 2" is used.

14. area_type: indicating that instead of applying to the whole grid 
cell (which would be the default), the quantity applies only to a 
certain portion, as in the current "where_type" construction  (e.g., 
where_land would be indicated by "land", and where_sea_ice would be 
indicated by "sea_ice")

15. region: specifying the geographic region from which the quantity is 
extracted (e.g., asia, africa, australia)

16. experiment: containing the name of the experiment that produced the 
output.

17. source:  containing some indication of the source of the data, 
whether it be from observations (e.g., ERBE) or from a model (e.g., 
CCSM3). A variable containing output from a multi-model ensemble 
(regridded to a common grid) could be stored with "source" as a 
dimension and the names of the models recorded as coordinate labels.


Any of these components of the standard_name might be omitted if either 
unnecessary, *or* if they themselves appeared as standard_names attached 
to one of the coordinate variables of the quantity.  Thus, for example, 
if the "process' were left unspecified for a variable containing 
"tendency_or_air_temperature", but one of the coordinates of that 
variable had the standard_name "process", then one would find stored in 
this variable all the different processes identified by the coordinate 
labels for that coordinate.  This allows us to store many different 
tendencies in a single variable, but allows us to identify each of them 
through the "process" dimension of that variable.

Turning now to the procedure for constructing new standard names:

When constructing a new name, one would fill in the appropriate 
information for each of the components listed above (omitting those that 
are not needed).  If information seeming to lie outside the categories 
already listed in the table were necessary to fully define the quantity, 
then the requester would propose that a new category be adopted.  Within 
each category there would be a limited set of accepted designations 
(i.e. values), and again if none of the current acceptable values was 
appropriate, the requester would suggest a new one.

The standard_name discussion would focus on 1) whether a new category 
was indeed needed and what the new category should be called, 2) whether 
a new value under a given category was needed and what that value should 
be, and 3) in many cases simply whether the user had correctly filled 
out the table.  [Alison could make a decision about 3) on her own in 
most cases, I suspect.]

The second step would be to form a standard_name from the information in 
the table, but this should be nearly automatic, following some simple 
construction rules.

If someone outside the current focus of CF wanted to use CF to store 
data (say someone from the biological community), they might begin by 
augmenting the components of the standard name with additional ones 
needed by their community.  They would be required to adopt existing 
"categories" when applicable to their discipline.

Sorry for the length of this and sorry if it duplicates material in 
related discussions, but I think this standard_name business seems a bit 
out of control and perhaps there are alternatives out there that might 
make it more straight-forward to propose and adopt new names.

Anyway, I hope someone out there cares enough to comment on or improve 
on or suggest alternatives to this proposal.  Whatever we do, we should 
be mindful that we must be able to determine when existing standard 
names are equivalent to any future representation of the standard name 
information.

Best regards,
Karl




More information about the CF-metadata mailing list