Re: About charset setting and replacing
- From: gmclee@xxxxxxxx
- Date: 14 Jul 2006 05:12:36 -0700
Chris Morris 写道:
gmclee@xxxxxxxx writes:I am writing a client to change HTML dynamically. All HTML are saved on
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is
1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"> SOME UNICODE HERE</spand>
No.
2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <meta> element if there is one, so you don't need
to worry about the format.
Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.
However, from the description you gave, it doesn't sound like you're
using HTTP.
local Harddisk, it's nothing relate to network prototype.
I am not quite familiar with HTML, As you mention above, for both HTMLSo, for leading the program to replace the correct one, I search theNo.
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?
<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
<META http-equiv=Content-Type content='text/html; charset=us-ascii'>Might happen.
Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...
and XHTML, if the following valid ?
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
You might also get cases which have nothing to do with a <meta>Thanks. I see.
element, but trigger your pattern matching anyway.
Any better approach for my problem?
Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.
.
- Follow-Ups:
- Re: About charset setting and replacing
- From: Chris Morris
- Re: About charset setting and replacing
- References:
- About charset setting and replacing
- From: gmclee
- Re: About charset setting and replacing
- From: Chris Morris
- About charset setting and replacing
- Prev by Date: Re: About charset setting and replacing
- Next by Date: How to split an HTML page for printing
- Previous by thread: Re: About charset setting and replacing
- Next by thread: Re: About charset setting and replacing
- Index(es):
Relevant Pages
|