The presence of non-XML characters, escaped, or not escaped in an OOXML document, is contrary to interoperability of XML and XML-based tools. The W3C’s Internationalization Activity confirms this interpretation, saying “Control codes should be replaced with appropriate markup. Since XML provides a standard way of encoding structured data, representing control codes other than as markup would undo the actual advantages of using XML. Use of control codes in HTML and XHTML is never appropriate, since these markup languages are for representing text, not data.”
Remove the bstr type from OOXML
- Part 4, Section 7.4.2.4
te
Proposed Disposition of DIS 29500 Comment US-0161 (Modified: 2008-01-04) We agree that control codes should not be stored within the text of an element value. However, these characters do not represent control codes–this property is used solely to store user-defined data stored within the legacy document format; as such, we believe that it would be inappropriate to remove this datatype from the specification and lose this information. As suggested by the Canadian National Body, we believe some clarification would be useful; as a result, the following change will be made in Part 4, §7.4.2.4, page 5,122, lines 26: This element defines a binary basic string variant type , which can store any valid Unicode character . For all Unicode characters that cannot be directly represented in XML , as defined by the XML 1.0 specification, the characters are shall be escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character’s value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as _x0008_. end example] Similar Comments: BR-0059 , CA-0064 , CO-0232 , FR-0378 , GB-0591 , GR-0010 ,

The problem with the comment is that “bstr” is designed to encode strings that might contain binary data that can not be represented in XML or represented with “an appropriate markup” in XML.
This is why people encode blobs using base64 encoding, this just happens to be an encoding for strings that might contain binary data and happens to be more compact for that particular case than something like base64.
The spec actually states:
“This element defines a binary basic string variant type. For all characters that cannot be represented in XML as
defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character
representation escape character format _xHHHH_, where H represents a hexadecimal character in the
character’s value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be
escaped as _x0008_. end example]“
wow. There just has to be a better approach to solve that problem rather than making up their own encoding system to embed in XML. I am not sure what parts of a document might get the bstr treatment but it is nasty to have to first parse the XML using a standard library, then parse it again to decode the bstr elements.
Also the current ISO 26300 standard contains a lot of places where binary base64encoded data is used.
In ODF you can actually still embed things like pictures or any kind of media or even an MS office file as binary base64encoded data in the XML data.
Look for office binary data in the ODF standard.
If this comments were blocking then it would also require major changes to opendocument
embedding binary stuff as base64 encoded is standard XML, this is a bit different. It seems to be for stuff that is mostly expected to be text, so it would be best to have it human readable, but sometimes they might throw in a character that isn’t readable and doesn’t have a standard XML escape sequence (like " translates to " etc.) Unicode 8 is a control character of some sort, I think it might be backspace. Quite why that might be in a string is a bit of a puzzle. Anyhow they had the option to base64 encode it, or read the technical report on unicode in XML
http://www.w3.org/TR/unicode-xml/#Charlist
and basically drop the deprecated characters or think of something else. What they chose to do was invent an alternative character encoding method which won’t be decoded by existing XML parsers. I can take a regular DOM or SAX parser and expect " to be decoded by the parser back to ". A standard parser won’t unescape _x0008_ for me, I would have to re-parse the string to decode that. I do have some sympathy for the predicament they were in, I just don’t like their choice of solution much.
[...] has been mentioned here before although it was intended to remain secret. Here is Miguel trying to resolve comments in Microsoft’s favour. Why on earth does a Novell employee, who is being paid by Novell for his work, virtually aid [...]
There is some prior history to use more “visually friendly” encodings for things that are mostly text, but could contain data that can not be described by XML.
Consider email over 7-bit transports, instead of encoding subjects and senders with base64 rendering the from and subject lines indecipherable by default (or the content by default) and encoding was created called Quoted-Printable.
Quoted-Printable allowed names like “México” (accent on the ‘e’ letter) to be encoded as “M=82xico” which can still be easily seen on old email clients or can be easily inspected visually.
If they had used base64 for emails the same word would always be rendered as “TcOpeGljbwo=” which is not as useful.
So base64 is useful in places where you know in advance that most of the content will be binary.
bstr is similar in spirit to QuotedPrintable: you know that most of the text will be readable (and in this case, this extends to most of utf8) but for the few places where bits that can not be represented as utf8 you use this escape mechanism.
Whether the designers made the right choice in terms of balancing the various needs (single encoding: base64 vs multiple encodings to improve particular use cases) is open to discussion.
But the original comment is an opinion, not really a technical comment on the requirements of XML. It is misrepresenting the XML standard by inventing a restriction that does not exist.
Furthermore, the same comment as brought up by Norway is actually pretty incomplete (no-8)
Miguel
I understand their motivations for doing it, but I still don’t like the double parsing. It seems to me that they should use markup so instead of _x0008_ they could use or something like that (actually I am still struggling to figure out a sensible use case for a backspace in a string, creating weird doublestrike glyphs perhaps?). Inventing a new quoted printable type encoding in the middle of this spec seems like a bad thing and one that is worth challenging.
This is pretty messy. XML, be it 1.0 or 1.1 forbids all chars
I must have used an angle bracket and had the comment cut.
- XML forbids all chars of ascii value 0-31 except CR, TAB and LF.
- base64 is unreadable and very inefficient in UTF-16
- if you escape stuff, then you have to deal with double escaping, and remember to unescape stuff at the same rate.
From an XML purity perspective, base-64 would be better, known defects notwithstanding. But it is not as good as inline text with No unallowed chars. What is being proposed here is not really XML.
I admit I haven’t read that section of the spec, but the places I’ve actually seen this encoding used in the real world are in XML tag and attribute names, which can’t be base-64 encoded, and which also don’t accept certain characters. For instance, is (afaik) not permitted in XML, so it would be encoded as . Again, I don’t know if this really explains what they’re trying to do because I haven’t read the OOXML spec, but it might be a clue.
Grr, I don’t see why we can’t type XML in a discussion about an XML format!
I’ll replace them with square brackets in my example: “For instance, [tag"tag] is (afaik) not permitted in XML, so it would be encoded as [tag_x0022_tag].
Err, replace angle brackets with square brackets.
I get it. Wordpress supports some subset of HTML. I’ll try my example one last time: “For instance, <tag"tag> is (afaik) not permitted in XML, so it would be encoded as <tag_x0022_tag>.
I do wish Wordpress had a Preview button.
you can embed angle brackets in text in an element body, but this is best done in CDATA sections. Sections that still forbid low-ascii chars.
Thinking about this more, what I dont like about this escaping is its possibly dangerously easy to accidentally create XPath queries that appear to work, but once an escaped char sneaks in, accidentally propagates the escaping. This BSTR stuff is not going to be good for XML-level work. But you know what, neither is base-64, which is why the SOAP people moved to MTOM to tack binary stuff at the end of the XML payload and pretend it is a base-64 thing later.
There are some WordPress preview plugins, I quite like this one http://dev.wp-plugins.org/wiki/LiveCommentPreview but it needs a bit of tweaking. I will probably add it tomorrow.
Can someone tell me why they don’t use W3C standard for this?
There is 1 byte difference in length between:
_x0008_
and
or even,

The difference is that the second will be parsed correctly.
http://www.w3.org/TR/unicode-xml/
“Can someone tell me why they don’t use W3C standard for this?”
If you refer to #8 via a numeric reference as you suggest then it is a fatal error in XML 1.0, making the document not well formed. It is allowed in XML 1.1 but 1.1 is (to put it mildly) not as well deployed as XML 1.0.
The fundamental problem is that there are some characters that are allowed in a BSTR (probably even ASCII NUL) that XML flatly forbids. XML is meant to be a text format, after all. Escaping char’s using ampersands isn’t enough, because the next stage in the processing will reject the illegal character.
In COM, BSTR is a length-delimited string (a Basic Str) that can contain anything, more even than a nul-terminated CString. Trying to put it inside XML is not something XML supports.