[CF-metadata] Pre-proposal for "charset"
Bob Simons - NOAA Federal
bob.simons at noaa.gov
Wed Feb 8 11:00:32 MST 2017
I think my original pre-proposal has a significant flaw and needs to be
The problem is: charset needs to be specifiable for all char arrays,
regardless of whether the values should be interpreted as Strings or
I see two basic solutions:
1) Two attributes, but a given variable would only use one of them. The
first part of the attribute name specifies the data type:
char_charset = "ISO-8859-1"; //identifies a char variable using
string_charset = "ISO-8859-1"; //identifies a String variable using
2) Two attributes that would both be specified for every char/String
charset = "ISO-8859-1";
data_type = "String"; //or "char"
In either case, the charsets allowed for char (not String) data must be
restricted to single code page (e.g, "ISO-8859-1") because other encodings
(e.g., "UTF-8") need multiple bytes for some characters..
I have a slight preference (2), because it is cleaner and might be better
in the future (I don't know the implications for nc4 and CF2).
On Mon, Feb 6, 2017 at 3:08 PM, Bob Simons - NOAA Federal <
bob.simons at noaa.gov> wrote:
> Before I make a formal CF proposal for a "charset" attribute, I would like
> to get comments and suggestions from all of you.
> This is a proposal to solve the problem of distinguishing strings from
> arrays of characters and the problem of identifying the string's character
> encoding. Presumably, it would be appended to section 2.2.
> An example of actual need is: Many/most current uses of multidimensional
> char arrays are intended to be interpreted as Strings. But some files,
> e.g., Argo profile float profiles, have single char data that are stored in
> char arrays.
> Another example, while most nc files just use 7-bit ASCII characters in
> strings, some use 8-bit characters. Some such files appear to use
> charset=Windows-1252, others use Mac OS Roman, others use ISO-8859-1, but
> the the charset is not specified and there is currently no official CF way
> to specify it.
> Another advantage of this proposal is that it provides a way to support
> Unicode (and thus all of the world's languages) via the UTF-8 encoding
> which is useful as we increasingly work with people from non-US,
> non-European countries.
> A possible extension of this is to allow a few special additional
> pseudo-charset names:
> * "HTML" - the chars are to be interpreted as an array of Strings with
> HTML content, using the ISO-8859-1 charset. Non-ISO-8859-1 must be encoded
> using the &#d; format where d is the decimal number of a Unicode character.
> * "XML" - the chars are to be interpreted as a an array of Strings with
> XML content, using the ISO-8859-1 charset. Non-ISO-8859-1 characters must
> be encoded using the &#d; format where d is the decimal number of a Unicode
> Thank you for considering this.
> --- The Actual Pre-Proposal
> Use the "charset" attribute to indicate that a multidimensional
> char array should be interpreted as an array of Strings,
> not an array of individual characters.
> The value of "charset" also serves to specify the character set
> used to encode the strings
> and must be the name of one of the 8-bit encodings
> (since CF chars are 8-bits) listed at
> http://www.iana.org/assignments/character-sets/character-sets.xhtml .
> Charset names are case-insensitive.
> The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
> For backwards compatibility, if "charset" is not defined,
> it remains ambiguous whether a char array should be interpreted as
> holding an array of individual characters or an array of Strings.
> --- An Example: Encoding three Strings: "It", "Book", and "5 €".
> The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
> which is 8364 (in decimal).
> The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in hexadecimal).
> So a file would store these strings in a char array as:
> words = 3;
> strLen = 5;
> char myWords[words][strLen] = "It", "Book", "5 [E2][AC]";
> charset = "UTF-8";
> Bob Simons
> IT Specialist
> Environmental Research Division
> NOAA Southwest Fisheries Science Center
> 99 Pacific St., Suite 255A (New!)
> Monterey, CA 93940 (New!)
> Phone: (831)333-9878 <(831)%20333-9878> (New!)
> Fax: (831)648-8440 <(831)%20648-8440>
> Email: bob.simons at noaa.gov
> The contents of this message are mine personally and
> do not necessarily reflect any position of the
> Government or the National Oceanic and Atmospheric Administration.
> <>< <>< <>< <>< <>< <>< <>< <>< <><
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Email: bob.simons at noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CF-metadata