Re: About charset setting and replacing




Chris Morris 写道:

gmclee@xxxxxxxx writes:
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"> SOME UNICODE HERE</spand>

No.

2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <meta> element if there is one, so you don't need
to worry about the format.

Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.

However, from the description you gave, it doesn't sound like you're
using HTTP.
I am writing a client to change HTML dynamically. All HTML are saved on
local Harddisk, it's nothing relate to network prototype.

So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.

<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.

Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...
I am not quite familiar with HTML, As you mention above, for both HTML
and XHTML, if the following valid ?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.

Any better approach for my problem?

Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.

Thanks. I see.

.



Relevant Pages

  • Re: http-equiv caps & spacing in Apache 1.3.36
    ... using the type for incompatible XHTML is not forbidden ... HTML, current practice on the Internet includes a wide variety of HTML ... Encoding of a charset is often for choosing an alphabet and that's ... override the HTTP headers sent by a prior server. ...
    (comp.infosystems.www.servers.unix)
  • Re: About charset setting and replacing
    ... I've met some problem in charset setting. ... inserted into the HTML before sending to IE. ... Since any valid us-ascii character is also valid UTF-8 ...
    (comp.infosystems.www.authoring.html)
  • Re: About charset setting and replacing
    ... HTML have charset "us-ascii", for some reason, some UNICODE TEXT ... If you create a page that is encoded as UTF-8, and serve it as UTF-8, ...
    (comp.infosystems.www.authoring.html)
  • Re: character encoding in CGI.pm
    ... >> Or is XML defined such that this is a perfectly valid situation? ... It isn't valid HTML (take this document, ... its charset; in this case, the charset given in the HTTP header ...
    (comp.lang.perl.misc)
  • Re: About charset setting and replacing
    ... HTML have charset "us-ascii", for some reason, some UNICODE TEXT ... If you create a page that is encoded as UTF-8, and serve it as UTF-8, ...
    (comp.infosystems.www.authoring.html)