[CF-metadata] Pre-proposal for "charset"

Bob Simons - NOAA Federal bob.simons at noaa.gov
Wed Feb 22 12:38:48 MST 2017


On Wed, Feb 22, 2017 at 10:56 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

>
> Another note:
>
> On Mon, Feb 6, 2017 at 3:08 PM, Bob Simons - NOAA Federal <
> bob.simons at noaa.gov> wrote:
>
>> * "HTML" - the chars are to be interpreted as an array of Strings with
>> HTML content, using the ISO-8859-1 charset. Non-ISO-8859-1  must be encoded
>> using the &#d; format where d is the decimal number of a Unicode character.
>> * "XML" -  the chars are to be interpreted as a an array of Strings with
>> XML content, using the ISO-8859-1 charset. Non-ISO-8859-1 characters must
>> be encoded using the &#d; format where d is the decimal number of a Unicode
>> character.
>>
>
> Don't HTML and XML both use an ASCII-compatible header that specified the
> encoding?
>

HTTP (which is used to transmit HTML and other documents) includes
information in the header, notably the Content-type, e.g.,
Content-type: application/json; charset=utf-8

Yes XML documents have a "prolog", e.g.,
<?xml version="1.0" encoding="UTF-8"?>
which uses the word "encoding".

I'm proposing that we add something like that to CF so that the charset is
known.


> (and XML uses "encoding", rather than "charset"):
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> and "the default character encoding was changed to UTF-8 in HTML5."
>
> So if there is going to be a default, it should probably be UTF-8
>

I am not suggesting a default charset. For all the existing CF files, the
charset is unknown and it would be dangerous to specify any specific
charset to apply retroactively (other than that the lower 7bits are
compatible with 7bit ASCII, which is true of 8859-1 and UTF-8 and many
other charsets).
I am suggesting that new files could be written and include a charset
attribute to specify the charset in use.


>
> We need to either specify the "string" dimension, or have a consistent
> convention:
>
> A 10x8 CHAR array could be either 10 8 character strings or 8 ten
> character strings. And it gets more confusing with higher dimensions.
>

There is no standard naming system in CF to denote a String dimension (ie,
the number of chars, vs a char array). That is a different approach to
solving the problem. I don't like that approach as much because so many
people have written so much software that writes and reads files using
dimension names of their choice. I don't want to tell everyone to rewrite
all their exiting files and software/scripts to read/write those files in
order to comply with new CF rules.

Instead, I'm proposing a separate, new attribute (data_type=string|char),
partly because it doesn't interfere with existing dimension names or
attribute names.


>
> -CHB
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>



-- 
Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A      (New!)
Monterey, CA 93940               (New!)
Phone: (831)333-9878            (New!)
Fax:   (831)648-8440
Email: bob.simons at noaa.gov

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170222/16c643c8/attachment-0001.html>


More information about the CF-metadata mailing list