[CF-metadata] Pre-proposal for "charset"
chris.barker at noaa.gov
Wed Feb 22 11:03:48 MST 2017
NOTE: this looks like it got tacked on to another thread -- please start a
new thread for a new topic. (or gamil messed up...)
Sorry for being dense here, but I'm confused. I see in the netCDF(4) spec:
The atomic external types supported by the netCDF interface are:
NC_CHAR 8-bit character byte
NC_STRING variable length character string *
So shouldn't one use a 2-D (or higher dim) array of NC_CHAR type if that's
indeed what you have?
Or is this about supporting netcdf3, which doesn't (I don't think) have a
It does have a BYTE type, which I would be inclined to use for a CHAR. But
then I suppose you'd need to tell readers that it was intended to be a
Do folks want/need to support full Unicode characters? If so I think you'd
need a 4 byte type -- cal it NC_UCHAR? -- and anything else would be
variable-length, which would kind of kill the whole point of a character
Small note: I'd prefer "encoding" to "charset" -- at least if you want to
support "full" unicode, rather than only one-byte-per-char encodings.
> > The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
UTF-8 is problematic because it uses a variable number of bytes per
If we want to support proper Unicode, then we need to either:
use a variable-length string type (the netcdf 4 NC_STRING type?)
Use 4 bytes per char.
Since UTF-* is a superset of ascii, it can be dangerous -- folks can say
"this is UTF-*", and if they only happen to use the ASCII subset, al works
fine, and then someone goes and tries to put a weird high-codepoint
character in there, and all goes to heck.
I see that netcdf4 supports UTF-8 for names within the file (variable
names, dimension names, etc), but that works because the number of bytes is
known and constant once created.
Again, I'm maybe speaking from ignorance, I haven't dug into Unicode and CF
And netcdf in any depth at all.
> > --- An Example: Encoding three Strings: "It", "Book", and "5 €".
> > > The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
>> > > which is 8364 (in decimal).
>> > > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in
>> > > So a file would store these strings in a char array as:
>> > > dimensions
>> > > words = 3;
>> > > strLen = 5;
>> > > char myWords[words][strLen] = "It", "Book", "5
>> > > charset = "UTF-8";
this is tough -- how do you know what strLen should be? You could get UTF-8
characters chopped off if it was too short.
Though I suppose that's a problem for the file writer to figure out.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CF-metadata