<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> On 09/28/2010 04:06 PM, Stefan Berger wrote: <blockquote cite="mid:OF6D73EE01.7B8FACFA-ON852577AC.006DE338-852577AC.006E5F71@us.ibm.com" type="cite"> <tt>Eric Blake <a class="moz-txt-link-rfc2396E" href="mailto:eblake@redhat.com"><eblake@redhat.com></a> wrote on 09/28/2010 03:26:48 PM: > [image removed] </tt> <tt>> > Re: [libvirt] [PATCH v2 3/5] Extend nwfilter schema to accept > comment attributes</tt> <tt>> > Eric Blake </tt> <tt>> > to:</tt> <tt>> > Stefan Berger</tt> <tt>> > 09/28/2010 03:27 PM</tt> <tt>> > Cc:</tt> <tt>> > libvir-list</tt> <tt>> > On 09/28/2010 04:28 AM, Stefan Berger wrote: > >> okay. It also leaves out 8-bit bytes - could that be a problem for i18n > > > >> where people want comments with native-language accented characters? > >> That is, are we being too strict here? Maybe a better pattern would be > >> to reject specific non-printing ASCII bytes we want to avoid, assuing > >> you can use escape sequences like [^\001]? > > > > Looking at > > > > </tt><a moz-do-not-send="true" href="http://www.asciitable.com/"><tt>http://www.asciitable.com/</tt></a><tt> > > > > I should probably include 0x20-0x7E and 128-175, 224-238 - maybe even > > more? So the regex then becomes > > > > [ -~-¯à-î]{0,256} > > True ASCII is strictly 7-bit; any locale where isprint() returns true on > 8-bit bytes is a superset single-byte encoding, such as ISO-8859-1, or > 'extended ascii' from the URL you posted above. But I'm also thinking > about multi-byte encodings, like UTF-8, where we cannot a priori write a > regex that will accept all valid Unicode printable characters, in part > because you have to look at more than one byte at a time to determine if > you have a printable character. Which goes back to my suggestion of an > inverse charset - rejecting bytes that are known to be non-printable > ASCII, and letting everything else whether or not it is is a printable > byte sequence in the current locale. So what about this idea: exclude > control characters except for tab, and let space and everything after > through (I don't know if it needs to be adjusted to also reject &#x00): > > [^-&#x08&#x0A-&#x1F]{0,256}</tt> <tt>Fine by me. We may just give the impression of accepting unicode while the code does not handle it.</tt> </blockquote> ... except that xmllint does not like &#x01 with or without preceding ^ (among other things): <tt>xmllint --relaxng ./docs/schemas/nwfilter.rng tests/nwfilterxml2xmlout/comment-test.xml ./docs/schemas/nwfilter.rng:862: parser error : xmlParseCharRef: invalid xmlChar value 1 <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> ^ ./docs/schemas/nwfilter.rng:862: parser error : CharRef: invalid hexadecimal value <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> ^ ./docs/schemas/nwfilter.rng:862: parser error : xmlParseCharRef: invalid xmlChar value 0 <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> ^ ./docs/schemas/nwfilter.rng:862: parser error : CharRef: invalid hexadecimal value <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> ^ ./docs/schemas/nwfilter.rng:862: parser error : xmlParseCharRef: invalid xmlChar value 0 <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> ^ ./docs/schemas/nwfilter.rng:862: parser error : CharRef: invalid hexadecimal value <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> ^ ./docs/schemas/nwfilter.rng:862: parser error : xmlParseCharRef: invalid xmlChar value 0 <param name="pattern">[^-&#x08&#x0A-&#x1F]{0,256}</param> </tt> Stefan </body> </html>