[CF-metadata] Pre-proposal for "charset"

Jonathan Gregory j.m.gregory at reading.ac.uk
Fri Feb 17 10:46:45 MST 2017


Dear Bob

I agree that sometimes char data is characters and sometimes strings, and one
can't tell which it is without knowing the intended use of the array concerned.
When you do know the role of this array e.g. as a quality flag data variable,
or a string-valued auxiliary coordinary variable, then you know also whether
it's a string or an array of characters. Can you give an example where one
needs to know how a char array should be interpreted but you *don't* know what
its purpose is within the CF-netCDF file?

Best wishes

Jonathan

----- Forwarded message from Bob Simons - NOAA Federal <bob.simons at noaa.gov> -----

> Date: Wed, 8 Feb 2017 10:00:32 -0800
> From: Bob Simons - NOAA Federal <bob.simons at noaa.gov>
> To: CF Metadata <CF-metadata at cgd.ucar.edu>
> Subject: Re: [CF-metadata] Pre-proposal for "charset"
> 
> I think my original pre-proposal has a significant flaw and needs to be
> revised.
> The problem is: charset needs to be specifiable for all char arrays,
> regardless of whether the values should be interpreted as Strings or
> individual chars.
> 
> I see two basic solutions:
> 
> 1) Two attributes, but a given variable would only use one of them. The
> first part of the attribute name specifies the data type:
>   char_charset = "ISO-8859-1";   //identifies a char variable using
> ISO-8859-1
> or
>   string_charset = "ISO-8859-1";   //identifies a String variable using
> ISO-8859-1
> 
> 2) Two attributes that would both be specified for every char/String
> variable, e.g.,
>   charset = "ISO-8859-1";
>   data_type = "String";             //or "char"
> 
> In either case, the charsets allowed for char (not String) data must be
> restricted to single code page (e.g, "ISO-8859-1") because other encodings
> (e.g., "UTF-8") need multiple bytes for some characters..
> 
> ---
> I have a slight preference (2), because it is cleaner and might be better
> in the future (I don't know the implications for nc4 and CF2).
> 
> Thoughts? Votes?
> 
> 
> 
> 
> On Mon, Feb 6, 2017 at 3:08 PM, Bob Simons - NOAA Federal <
> bob.simons at noaa.gov> wrote:
> 
> > Before I make a formal CF proposal for a "charset" attribute, I would like
> > to get comments and suggestions from all of you.
> >
> > This is a proposal to solve the problem of distinguishing strings from
> > arrays of characters and the problem of identifying the string's character
> > encoding. Presumably, it would be appended to section 2.2.
> >
> > An example of actual need is: Many/most current uses of multidimensional
> > char arrays are intended to be interpreted as Strings. But some files,
> > e.g., Argo profile float profiles, have single char data that are stored in
> > char arrays.
> >
> > Another example, while most nc files just use 7-bit ASCII characters in
> > strings, some use 8-bit characters. Some such files appear to use
> > charset=Windows-1252, others use Mac OS Roman, others use ISO-8859-1, but
> > the the charset is not specified and there is currently no official CF way
> > to specify it.
> >
> > Another advantage of this proposal is that it provides a way to support
> > Unicode (and thus all of the world's languages) via the UTF-8 encoding
> > which is useful as we increasingly work with people from non-US,
> > non-European countries.
> >
> > A possible extension of this is to allow a few special additional
> > pseudo-charset names:
> > * "HTML" - the chars are to be interpreted as an array of Strings with
> > HTML content, using the ISO-8859-1 charset. Non-ISO-8859-1  must be encoded
> > using the &#d; format where d is the decimal number of a Unicode character.
> > * "XML" -  the chars are to be interpreted as a an array of Strings with
> > XML content, using the ISO-8859-1 charset. Non-ISO-8859-1 characters must
> > be encoded using the &#d; format where d is the decimal number of a Unicode
> > character.
> >
> > Thank you for considering this.
> >
> >
> > --- The Actual Pre-Proposal
> > Use the "charset" attribute to indicate that a multidimensional
> > char array should be interpreted as an array of Strings,
> > not an array of individual characters.
> > The value of "charset" also serves to specify the character set
> > used to encode the strings
> > and must be the name of one of the 8-bit encodings
> > (since CF chars are 8-bits) listed at
> > http://www.iana.org/assignments/character-sets/character-sets.xhtml .
> > Charset names are case-insensitive.
> > The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
> > For backwards compatibility, if "charset" is not defined,
> > it remains ambiguous whether a char array should be interpreted as
> > holding an array of individual characters or an array of Strings.
> >
> >
> > --- An Example: Encoding three Strings: "It", "Book", and "5 €".
> > The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
> > which is 8364 (in decimal).
> > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in hexadecimal).
> > So a file would store these strings in a char array as:
> >   dimensions
> >     words = 3;
> >     strLen = 5;
> >   char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5 [E2][82][AC]";
> >     charset = "UTF-8";
> >
> >
> > --
> > Sincerely,
> >
> > Bob Simons
> > IT Specialist
> > Environmental Research Division
> > NOAA Southwest Fisheries Science Center
> > 99 Pacific St., Suite 255A      (New!)
> > Monterey, CA 93940               (New!)
> > Phone: (831)333-9878 <(831)%20333-9878>            (New!)
> > Fax:   (831)648-8440 <(831)%20648-8440>
> > Email: bob.simons at noaa.gov
> >
> > The contents of this message are mine personally and
> > do not necessarily reflect any position of the
> > Government or the National Oceanic and Atmospheric Administration.
> > <>< <>< <>< <>< <>< <>< <>< <>< <><
> >
> >
> 
> 
> -- 
> Sincerely,
> 
> Bob Simons
> IT Specialist
> Environmental Research Division
> NOAA Southwest Fisheries Science Center
> 99 Pacific St., Suite 255A      (New!)
> Monterey, CA 93940               (New!)
> Phone: (831)333-9878            (New!)
> Fax:   (831)648-8440
> Email: bob.simons at noaa.gov
> 
> The contents of this message are mine personally and
> do not necessarily reflect any position of the
> Government or the National Oceanic and Atmospheric Administration.
> <>< <>< <>< <>< <>< <>< <>< <>< <><

> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata


----- End forwarded message -----



More information about the CF-metadata mailing list