BeautifulSoup'没有属性'HTML_ENTITIES（html中没有属性的标签）

25-04-03 8

以上就是给各位分享BeautifulSoup'没有属性'HTML_ENTITIES，其中也会对html中没有属性的标签进行解释，同时本文还将给你拓展BeautifulSoup4find_all找不到B

以上就是给各位分享BeautifulSoup'没有属性'HTML_ENTITIES，其中也会对html中没有属性的标签进行解释，同时本文还将给你拓展Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接、BeautifulSoup HTML获取src链接、BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup等相关知识，如果能碰巧解决你现在面临的问题，别忘了关注本站，现在开始吧！

本文目录一览：

BeautifulSoup'没有属性'HTML_ENTITIES（html中没有属性的标签）
Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接
BeautifulSoup HTML获取src链接
BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签
BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup

BeautifulSoup'没有属性'HTML_ENTITIES（html中没有属性的标签）

我最近将Windows计算机上的BeautifulSoup从3.0版升级到了4.1版。

我现在遇到一个奇怪的错误：

File "C:\path\to\myscript.py", line 230, in soupify    return BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)AttributeError: type object ''BeautifulSoup'' has no attribute ''HTML_ENTITIES''

这是导致引发异常的代码段：

def soupify(html):    return BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

BS的文档没有提到构造函数签名是如何从v3更改为v4的。我该如何解决？

答案1

小编典典

传入的HTML或XML实体始终会转换为相应的Unicode字符。Beautiful Soup 3有许多重叠的实体处理方式，已被删除。
BeautifulSoup构造函数不再识别smartQuotesTo或convertEntities参数。
（Unicode，Dammit仍然具有smart_quotes_to，但现在的默认设置是将智能引号转换为Unicode。）
如果要在输出时将这些Unicode字符转换回HTML实体，而不是将其转换为UTF-8字符，则需要使用输出格式化程序。

资料来源：http :
//www.crummy.com/software/BeautifulSoup/bs4/doc/#entities

Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接

我注意到一个非常烦人的错误：BeautifulSoup4（程序包：）bs4经常发现的标签少于以前的版本（程序包：）BeautifulSoup。

这是该问题的可复制实例：

import requestsimport bs4import BeautifulSoupr = requests.get(''http://wordpress.org/download/release-archive/'')s4 = bs4.BeautifulSoup(r.text)s3 = BeautifulSoup.BeautifulSoup(r.text)print ''With BeautifulSoup 4 : {}''.format(len(s4.findAll(''a'')))print ''With BeautifulSoup 3 : {}''.format(len(s3.findAll(''a'')))

输出：

With BeautifulSoup 4 : 557With BeautifulSoup 3 : 1701

如您所见，差异并不小。

如果有人怀疑，以下是模块的确切版本：

In [20]: bs4.__version__Out[20]: ''4.2.1''In [21]: BeautifulSoup.__version__Out[21]: ''3.2.1''

答案1

小编典典

您已经lxml安装了，这意味着BeautifulSoup 4将在标准库选项上使用该解析器html.parser。

您可以将lxml升级到3.2.1（对我来说，这将为您的测试页返回1701个结果）；lxml本身会使用libxml2，libxslt在这里也可能要怪。您可能还必须升级
这些
。请参阅lxml要求页面；当前建议使用libxml2
2.7.8或更高版本。

或在解析汤时明确指定其他解析器：

s4 = bs4.BeautifulSoup(r.text, ''html.parser'')

BeautifulSoup HTML获取src链接

我正在使用python
3.5.1和request模块制作一个小型网络爬虫，该模块从特定网站下载所有漫画。我正在尝试一页。我使用BeautifulSoup4解析页面，如下所示：

import webbrowserimport sysimport requestsimport reimport bs4res = requests.get(''http://mangapark.me/manga/berserk/s5/c342'')res.raise_for_status()soup = bs4.BeautifulSoup(res.text, ''html.parser'')for link in soup.find_all("a", class_ = "img-link"):    if(link):        print(link)    else:        print(''ERROR'')

当我这样做时，我会print(link)感兴趣的是正确的HTML部分，但是当我尝试仅使用 src 来获取 src中
的链接时，link.get(''src'')它只会打印None。

我尝试使用以下方式获取链接：

img = soup.find("img")["src"]

没关系，但是我想拥有所有的src链接，而不是第一个链接。我对beautifulSoup经验很少。请指出发生了什么事。谢谢。

我感兴趣的网站的示例HTML部分是：

<ahref="#img2">    <img id="img-1"rel="1" i="1" e="0" z="1"           title="Berserk ch.342 page 1" src="http://2.p.mpcdn.net/352582/687224/1.jpg"          width="960" _width="818" _heighth="1189"/>        </a>

答案1

小编典典

我会使用CSS选择器一次性完成此操作：

for img in soup.select("a.img-link img[src]"):    print(img["src"])

在这里，我们得到的所有img具有src属性的元素都位于a具有img-link类的元素下。它打印：

http://2.p.mpcdn.net/352582/687224/1.jpghttp://2.p.mpcdn.net/352582/687224/2.jpghttp://2.p.mpcdn.net/352582/687224/3.jpghttp://2.p.mpcdn.net/352582/687224/4.jpg...http://2.p.mpcdn.net/352582/687224/20.jpg

如果仍要使用find_all()，则必须将其嵌套：

for link in soup.find_all("a", class_ = "img-link"):    for img in link.find_all("a", src=True):  # searching for img with src attribute        print(img["src"])

BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签

您可以将递归与生成器一起使用。可以通过迭代 soup.contents 并在每个级别增加一个计数器来遍历 HTML：

from bs4 import BeautifulSoup as soup,NavigableString as ns
def get_paths(d,p = [],c = 0):
   if not (k:=[i for i in getattr(d,'contents',[]) if not isinstance(i,ns)]):
      yield (c,' > '.join(p+[d.name]))
   else:
      for i in k:
         yield from get_paths(i,p=p+[d.name],c = c+1)

_,path = max(get_paths(soup(HTML,'html.parser').html),key=lambda x:x[0])

输出：

'html > body > div > p > span'

BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup

如何解决BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup？

我成功安装了 BeautifulSoup。这是最新的更新。但我仍然得到“

df[''Most_OCCURING''] = df.groupby(''Date'')[''Type''].transform(lambda x: x.value_counts().idxmax())

运行代码时。需要帮助！！

解决方法

试试：

from bs4 import BeautifulSoup

关于BeautifulSoup'没有属性'HTML_ENTITIES和html中没有属性的标签的问题我们已经讲解完毕，感谢您的阅读，如果还想了解更多关于Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接、BeautifulSoup HTML获取src链接、BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup等相关内容，可以在本站寻找。

本文标签：