如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？（html utf8）

25-03-01 13

在本文中，我们将为您详细介绍如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？的相关知识，并且为您解答关于htmlutf8的疑问，此外，我们还会提供一些关于B

在本文中，我们将为您详细介绍如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？的相关知识，并且为您解答关于html utf8的疑问，此外，我们还会提供一些关于BeautifulSoup和Unicode问题、html解析库BeautifulSoup、Learn Beautiful Soup(5) —— 使用BeautifulSoup改变网页内容、Learn Beautiful Soup(6) —— BeautifulSoup中对于编码的支持的有用信息。

本文目录一览：

如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？（html utf8）
BeautifulSoup和Unicode问题
html解析库BeautifulSoup
Learn Beautiful Soup(5) —— 使用BeautifulSoup改变网页内容
Learn Beautiful Soup(6) —— BeautifulSoup中对于编码的支持

如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？（html utf8）

我正在运行一个Python程序，该程序可获取UTF-8编码的网页，并使用BeautifulSoup从HTML中提取一些文本。

但是，当我将此文本写入文件（或在控制台上打印）时，它会以意外的编码方式写入。

示例程序：

import urllib2from BeautifulSoup import BeautifulSoup# Fetch URLurl = ''http://www.voxnow.de/''request = urllib2.Request(url)request.add_header(''Accept-Encoding'', ''utf-8'')# Response has UTF-8 charset header,# and HTML body which is UTF-8 encodedresponse = urllib2.urlopen(request)# Parse with BeautifulSoupsoup = BeautifulSoup(response)# Print title attribute of a <div> which uses umlauts (e.g. können)print repr(soup.find(''div'', id=''navbutton_account'')[''title''])

运行此结果：

# u''Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!''

但是我希望Python
Unicode字符串ö在单词中呈现können为\xf6：

# u''Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!''

我已经试过了“fromEncoding”参数传递给BeautifulSoup，并试图read()与decode()该response对象，但它要么没什么区别，或引发错误。

使用命令curl www.voxnow.de | hexdump -C，我可以看到该网页确实是字符的UTF-8编码的（即包含0xc30xb6）ö：

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

我已经超出了Python的能力极限，因此对于如何进一步调试它一无所知。有什么建议吗？

答案1

小编典典

HTML内容以utf-8编码的形式报告自己，并且在大多数情况下是这样，除了一个或两个流氓无效的utf-8字符。

这显然使BeautifulSoup不清楚正在使用哪种编码，以及在将内容传递给BeautifulSoup时尝试首先解码为UTF-8时，如下所示：

soup = BeautifulSoup(response.read().decode(''utf-8''))

我会得到错误：

UnicodeDecodeError: ''utf8'' codec can''t decode bytes in position 186812-186813:                     invalid continuation byte

仔细观察输出，有一个字符实例Ü被错误编码为无效字节序列0xe3 0x9c，而不是正确的0xc30x9c。

正如该问题当前评分最高的答案所暗示的那样，在解析时可以删除无效的UTF-8字符，以便仅将有效数据传递给BeautifulSoup：

soup = BeautifulSoup(response.read().decode(''utf-8'', ''ignore''))

BeautifulSoup和Unicode问题

我正在使用BeautifulSoup解析一些网页。

有时，我会遇到如下所示的“ unicode hell”错误：

在TheAtlantic.com上查看本文的来源[
http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-
harvard-students-studying-ancient-chinese-philosophy/ 280356
/ ]

我们在og：description meta属性中看到了这一点：

<meta property="og:description" content="The professor who teaches&nbsp;Classical Chinese Ethical and Political Theory claims, &quot;This course will change your life.&quot;" />

当BeautifulSoup解析它时，我看到了：

>>> print repr(description)u''The professor who teaches\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."''

如果我尝试将其编码为UTF-8，则这样的注释建议：

>>> print repr(description.encode(''utf8''))''The professor who teaches\xc2\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."''

就在我以为我的所有unicode问题都得到控制的时候，我还是不太了解发生了什么，所以我要提出几个问题：

1-为什么BeautifulSoup会将转换 为\xa0[拉丁字符集空格字符]？此页面上的字符集和标题为UTF-8，我以为BeautifulSoup会提取该数据进行编码？为什么不将其替换为<space>？

2-有一种通用的方式来标准化空格以进行转换吗？

3-当我编码为UTF8时，\xa0的序列在\xc2\xa0哪里？

我可以通过所有方法unicodedata.normalize(''NFKD'',string)来帮助我到达自己想去的地方，但是我很想了解问题所在，并避免以后再出现此类问题。

答案1

小编典典

您没有遇到任何问题。一切都按预期进行。

 表示不间断的空格字符。它不会被空格代替，因为它不代表空格。它代表着一个不间断的空间。用空格代替它会丢失信息：在该空格出现的地方，文本呈现引擎不应放置换行符。

不间断空格的Unicode代码点是U + 00A0，它在Python中以Unicode字符串形式编写为\xa0。

U + 00A0的UTF-8编码为十六进制的两个字节序列C2
A0，或以Python字符串表示形式编写\xc2\xa0。在UTF-8中，超出7位ASCII集的任何内容都需要两个或更多字节来表示。在这种情况下，最高位设置为第八位。这意味着它可以由两字节序列（二进制）表示110xxxxx10xxxxxx，其中x是代码点的二进制表示形式的位。如果是A0，则为10000000，或者使用UTF-811000010 10000000或C2
A0进行编码。

许多人使用 HTML来获取通常的HTML空白折叠规则不会折叠的空间（在HTML中，除非应用了CSSwhite-space规则之一，否则所有连续的空格，制表符和换行符都将被解释为单个空格），但这并不是他们真正想要的。它们应该用于诸如“宫城先生”之类的名字，而您不希望在“先生”之间使用换行符。和“宫城”。我不确定为什么在这种特殊情况下使用它。它似乎在这里不合适，但这更多是源代码的问题，而不是解释它的代码。

现在，如果您不太在意布局，那么您就不必在乎文本布局算法是否选择将其作为包装的地方，而只想将其解释为常规空间，使用NFKD进行标准化是非常合理的答案（如果您更喜欢预合成的重音而不是分解的重音，则为NFKC）。该NFKC和NFKD归一映射字符，以便扩展在大多数上下文中表示基本相同语义值的大多数字符。例如，扩展连字（ﬃ->
ffi），将过时的long s字符转换为s（ſ-> s），将罗马数字字符扩展为它们的单个字母（Ⅳ->
IV），并且不间断空格转换为正常空间。对于某些字符，NFKC或NFKD归一化可能会丢失在某些情况下很重要的信息：ℌ和ℍ都将归一化为H，但在数学课本中可以用来指代不同的事物。

html解析库BeautifulSoup

安装：

apt install python-bs4

pip install beautifulsoup4

下载源码：https://pypi.python.org/pypi/beautifulsoup4/ 之后使用python setup.py install安装

apt install python-lxml

easy_install lxml

pip install lxml

apt install python-html5lib

easy_install html5lib

pip install html5lib

解析器比较

解析器	使用方法	优势	劣势
python标准库	BeautifulSoup(markup,"html.parser")	python的内置标准库执行速度适中文档容错能力强	python2.7.3或者3.2.2之前的版本文档容错能力差
lxml html解析器	BeautifulSoup(markup,"lxml")	速度快文档容错能力强	需要安装C语言库
lxml html解析器	BeautifulSoup（markup,["lxml","xml"]） BeautifulSoup(markup,"xml")	速度快唯一支持xml的解析器	需要安装C语言库
html5lib	BeautifulSoup（markup,"html5lib"）	最好的容错性以浏览器的方式解析文档生成html5格式文档	速度慢不依赖外部扩展

Learn Beautiful Soup(5) —— 使用BeautifulSoup改变网页内容

BeautifulSoup除了可以查找和定位网页内容，还可以修改网页。修改意味着可以增加或删除标签，改变标签名字，变更标签属性，改变文本内容等等。

使用修BeautifulSoup修改标签

每一个标签在BeautifulSoup里面都被当作一个标签对象，这个对象可以执行以下任务：

修改标签名
修改标签属性
增加新标签
删除存在的标签
修改标签的文本内容

修改标签的名字

只需要修改.name参数就可以修改标签名字。

producer_entries.name = "div"<span>怎么办嘛</span><img src="file:///C:\Users\ADMINI~1\AppData\Local\Temp\~LWHD)}S}%DE5RTOO[CVEI1.gif" sysface="15"alt="" />

你咋这么说

修改标签的属性

修改标签的属性如class,id,style等。因为属性以字典形式储存，所以改变标签属性就是简单的处理python的字典。

更新已经存在属性的标签

可以参照如下代码：

producer_entries[''id'']="producers_new_value"

为一个标签增加一个新的属性

比如一个标签没有class属性，那么可以参照如下代码增加class属性，

producer_entries[''class'']=''newclass''

删除标签属性

使用del操作符，示例如下：

del producer_entries[''class'']

增加一个新的标签

BeautifulSoup有new_tag()方法来创造一个新的标签。然后可以使用append(),insert(),insert_after()或者insert_before()等方法来对新标签进行插入。

增加一个新生产者，使用new_tag()然后append()

参照前面例子，生产者除了plants和alage外，我们现在添加一个phytoplankton.首先，需要先创造一个li标签。

用new_tag()创建一个新标签

new_tag()方法只能用于BeautifulSoup对象。现在创建一个li对象。

soup = BeautifulSoup(html_markup,"lxml")
new_li_tag = soup.new_tag("li")

new_tag()对象必须的参数是标签名，其他标签属性参数或其他参数都是可选参数。举例：

new_atag=soup.new_tag("a",href="www.example.com")

new_li_tag.attrs={''class'':''producerlist''}

使用append()方法添加新标签

append()方法添加新标签于,contents之后，就跟Python列表方法append()一样。

producer_entries = soup.ul
producer_entries.append(new_li_tag)

li标签是ul标签的子代，添加新标签后的输出结果。

<ul id="producers">
<li>
<div>
plants
</div>
<div>
100000
</div>
</li>
<li>
<div>
algae
</div>
<div>
100000
</div>
</li>s
<li>
</li>
</ul>

使用insert()向li标签中添加新的div标签

append()在.contents之后添加新标签，而insert()却不是如此。我们需要指定插入的位置。就跟python中的Insert（）方法一样。

new_div_name_tag=soup.new_tag("div")
new_div_name_tag["class"]="name"
new_div_number_tag=soup.new_tag("div")
new_div_number_tag["class"]="number"

先是创建两个div标签

new_li_tag.insert(0,new_div_name_tag)
new_li_tag.insert(1,new_div_number_tag)
print(new_li_tag.prettify())

然后进行插入，输出效果如下：

改变字符串内容

在上面例子中，只是添加了标签，但标签中却没有内容，如果想添加内容的话，BeautifulSoup也可以做到。

使用.string修改字符串内容

比如：

new_div_name_tag.string="phytoplankton"
print(producer_entries.prettify())

输出如下：

<ul id="producers">
<li>
<div>
plants
</div>
<div>
100000
</div>
</li>
<li>
<div>
algae
</div>
<div>
100000
</div>
</li>
<li>
<div>
phytoplankton
</div>
<div>
</div>
</li>
</ul>

使用.append/()，insert()，和new_string()添加字符串

使用append()和insert()的效果就跟用在添加新标签中一样。比如：

new_div_name_tag.append("producer")
print(soup.prettify())

输出：

<html>
<body>
<div>
<ul id="producers">
<li>
<div>
plants
</div>
<div>
100000
</div>
</li>
<li>
<div>
algae
</div>
<div>
100000
</div>
</li>
<li>
<strong><div>
phytoplankton
producer
</div>
</strong><div>
</div>
</li>
</ul>
</div>
</body>
</html>

还有一个new_string()方法，

new_string_toappend = soup.new_string("producer")
new_div_name_tag.append(new_string_toappend)

从网页中删除一个标签

删除标签的方法有decomose()和extract()方法

使用decompose()删除生产者

我们现在移去属性的div标签，使用decompose()方法。

third_producer = soup.find_all("li")[2]
div_name = third_producer.div
div_name.decompose()
print(third_producer.prettify())

输出：

decompose()方法会移去标签及标签的子代。

使用extract()删除生产者

extract()用于删除一个HTMNL文档中昂的标签或者字符串，另外，它还返回一个被删除掉的标签或字符串的句柄。不同于decompose()，extract也可以用于字符串。

third_producer_removed=third_producer.extract()
print(soup.prettify())

使用BeautifulSoup删除标签的内容

标签可以有一个NavigableString对象或tag对象作为子代。删除掉这些子代可以使用clear()

举例，可以移掉带有plants的div标签和相应的class=number属性标签。

li_plants=soup.li

li_plants.clear()

输出：

可以看出跟li相关的标签内容被删除干净。

修改内容的特别函数

除了我们之前看到的那些方法，BeautifulSoup还有其他修改内容的方法。

Insert_after()和Insert_before()方法：

这两个方法用于在标签或字符串之前或之后插入标签或字符串。这个方法需要的参数只有NavigavleString和tag对象。

soup = BeautifulSoup(html_markup,"lxml")
div_number = soup.find("div",class_="number")
div_ecosystem = soup.new_tag("div")
div_ecosystem[''class''] = "ecosystem"
div_ecosystem.append("soil")
div_number.insert_after(div_ecosystem)
print(soup.prettify())

输出：

<html>
<body>
<div>
<ul id="producers">
<li>
<div>
plants
</div>
<div>
100000
</div>
<div>
soil
</div>
</li>
<li>
<div>
algae
</div>

replace_with()方法：

这个方法用于用一个新的标签或字符串替代原有的标签或字符串。这个方法把一个标签对象或字符串对象作为输入。replace_with()会返回一个被替代标签或字符串的句柄。

soup = BeautifulSoup(html_markup,"lxml")
div_name =soup.div
div_name.string.replace_with("phytoplankton")
print(soup.prettify())

replace_with()同样也可以用于完全的替换掉一个标签。

wrap()和unwrap()方法：

wrap()方法用于在一个标签或字符串外包裹一个标签或字符串。比如可以用一个div标签包裹li标签里的全部内容。

li_tags = soup.find_all("li")
for li in li_tags:
<span>	</span>new_divtag = soup.new_tag("div")
<span>	</span>li.wrap(new_divtag)
print(soup.prettify())

而unwrap()就跟wrap()做的事情相反。unwrap()和replace_with()一样会返回被替代的标签句柄。

Learn Beautiful Soup(6) —— BeautifulSoup中对于编码的支持

所有的网页都有一个自己的编码。UTF-8是目前网站的标准编码。所以，当爬取这些网页时，爬虫程序必须要能理解这些网页的编码。否则，很有可能你在网页上看到的是正确的字符，而爬取获得的结果却是乱码。而BeautifulSoup则能熟练的处理这些编码。

BeautifulSoup中的编码

一般在一个网页中，可以从charset这个属性中看到网页的编码：

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

BeautifulSoup使用UnicodeDammit库来自动地检测文档的编码。BeautifulSoup创建soup对象时会自动地将内容转换为Unicode编码。

了解HTML文档的原始编码，soup.original_encoding会告诉我们文档的原始编码是什么。

指定HTML文档的编码，UnicodeDammit库会搜索整个文档然后来检测文档采取何种编码，这样一来浪费时间而且UnicodeDammit也有可能检测错误。如果知道文档编码是什么，那么可以在最初创建BeautifulSoup对象的时候就用from_encoding来指定文档的编码。

soup = BeautifulSoup(html_markup,"lxml",from_encoding="utf-8")

编码输出

BeautifulSoup中也有输出文本的方法。比如prettify()，将只会以UTF-8的编码方式输出。即使文档是其他类型的编码也照样UTF-8编码输出。

但是prettify()也可以指定其他编码格式输出：

print(soup.prettify("ISO8859-2")

我们同样可以用encode()来编码输出。encode()默认也是以UTF8编码。不过它同样可以指定编码方式输出。

今天关于如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？和html utf8的分享就到这里，希望大家有所收获，若想了解更多关于BeautifulSoup和Unicode问题、html解析库BeautifulSoup、Learn Beautiful Soup(5) —— 使用BeautifulSoup改变网页内容、Learn Beautiful Soup(6) —— BeautifulSoup中对于编码的支持等相关知识，可以在本站进行查询。

本文标签：