Beautifulsoup：.find（）和.select（）之间的区别（beautifulsoup select和find）

25-03-02 11

想了解Beautifulsoup：.find的新动态吗？本文将为您提供详细的信息，我们还将为您解答关于和.select的相关问题，此外，我们还将为您介绍关于$.ajax（）和$.get（）和$.loa

想了解Beautifulsoup：.find的新动态吗？本文将为您提供详细的信息，我们还将为您解答关于和.select的相关问题，此外，我们还将为您介绍关于$ .ajax（）和$ .get（）和$ .load（）之间的区别、.string和.text BeautifulSoup之间的区别、Android中Thread.currentThread（）getId（）和Process.myTid（）之间的区别、BeautifulSoup4 的 find_all () 和 select ()，简单爬虫学习的新知识。

本文目录一览：

Beautifulsoup：.find（）和.select（）之间的区别（beautifulsoup select和find）
$ .ajax（）和$ .get（）和$ .load（）之间的区别
.string和.text BeautifulSoup之间的区别
Android中Thread.currentThread（）getId（）和Process.myTid（）之间的区别
BeautifulSoup4 的 find_all () 和 select ()，简单爬虫学习

Beautifulsoup：.find（）和.select（）之间的区别（beautifulsoup select和find）

当您使用
BeautifulSoup
抓取网站的特定部分时，您可以使用

soup.find()和soup.findAll()或
soup.select()。

.find()和.select()方法之间有区别吗？（例如，性能或灵活性等）还是相同？

答案1

小编典典

总结评论：

select 查找多个实例并返回一个列表， find 查找第一个实例，因此它们不会执行相同的操作。 select_one 将等同于 find 。
我链接时，标签或使用几乎总是使用CSS选择 tag.classname ，如果寻找一个单一的元素没有一个类我用找到。本质上，它取决于用例和个人喜好。
就灵活性而言，我认为您知道答案，soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a")使用多个链接的 find / find_all 调用看起来很难看。
bs4中的css选择器唯一的问题是对它的支持非常有限， nth-of-type 是唯一实现的伪类，并且像c [sref ]的许多其他部分一样，也不支持链接属性，例如a [href] [src]。但是像 a [href = ..] *， a [href ^ =] ， a [href $ =] 等之类的东西我认为要好得多，find("a", href=re.compile(....))但这又是个人喜好。

为了提高性能，我们可以运行一些测试，我修改了此处答案的代码，该答案在从此处获取的800多个html文件上运行，虽然并不详尽，但应为某些选项的可读性和性能提供线索：

修改后的功能为：

from bs4 import BeautifulSoupfrom glob import iglobdef parse_find(soup):    author = soup.find("h4", class_="h12 talk-link__speaker").text    title = soup.find("h4", class_="h9 m5").text    date = soup.find("span", class_="meta__val").text.strip()    soup.find("footer",class_="footer").find_previous("data", {        "class": "talk-transcript__para__time"}).text.split(":")    soup.find_all("span",class_="talk-transcript__fragment")def parse_select(soup):    author = soup.select_one("h4.h12.talk-link__speaker").text    title = soup.select_one("h4.h9.m5").text    date = soup.select_one("span.meta__val").text.strip()    soup.select_one("footer.footer").find_previous("data", {        "class": "talk-transcript__para__time"}).text    soup.select("span.talk-transcript__fragment")def  test(patt, func):    for html in iglob(patt):        with open(html) as f:            func(BeautifulSoup(f, "lxml")

现在是时候了：

In [7]: from testing import test, parse_find, parse_selectIn [8]: timeit test("./talks/*.html",parse_find)1 loops, best of 3: 51.9 s per loopIn [9]: timeit test("./talks/*.html",parse_select)1 loops, best of 3: 32.7 s per loop

就像我说的并不详尽，但我认为我们可以肯定地说CSS选择器绝对有效。

$ .ajax（）和$ .get（）和$ .load（）之间的区别

$.ajax()和$.get()和有$.load()什么区别？

在哪个条件下使用哪个更好？

.string和.text BeautifulSoup之间的区别

我发现与BeautifulSoup一起使用时有些奇怪，找不到任何文档来支持此操作，所以我想在这里询问。

假设我们有一个这样的标签，我们已经用BS对其进行了解析：

<td>Some Table Data</td>
<td></td>

提取数据的官方记录方法是soup.string。但是，这为第二个<td>标签提取了NoneType
。所以我尝试了soup.text（因为为什么不呢？），它完全按照我的意愿提取了一个空字符串。

但是，我在文档中找不到对此的任何引用，并且担心某些内容会丢失。谁能告诉我这是否可以接受，否则以后会引起问题吗？

顺便说一句，我正在从网页上抓取表格数据，并打算从该数据创建CSV，因此我实际上确实需要空字符串而不是NoneTypes。

Android中Thread.currentThread（）getId（）和Process.myTid（）之间的区别

众所周知

myTid() - 返回调用线程的标识符，该标识符与setThreadPriority（int，int）一起使用。

但是我发现Thread.currentThread().getId()不等于Process.myTid()。因此，我猜前者是JVM版本的线程ID，后者是Linux版本的线程ID。

我对吗？如果是这样，为什么Java会创建自己的线程ID而不使用Linux线程ID？

更新：

经过进一步研究并阅读了android的源代码，我有了新的认识：

Process.myTid()是与平台（OS）相关的操作，Process.setThreadPriority()在android的本机级别源中也是如此，它们都调用系统调用来实现目标。

但是java是一种与平台无关的语言，java并不强制主机os需要“
tid”或getTid()方法，因为另一个os可以通过字符串键（仅作为示例，:)来标识其线程。然后，java用自己的方式标识其线程，在java作用域中分配一个唯一的线程ID，如果Java提供了像这样的静态api
Process.setThreadPriority()，则Java作用域ID肯定是一个参数，但是我们不需要，因为我们可以通过调用来实现线程对象方法setPriority(int priority)。

欢迎任何评论。

更新：

答案都是正确的。但是法登的评论使我更加清楚。谢谢你们。

BeautifulSoup4 的 find_all () 和 select ()，简单爬虫学习

正则表达式 + BeautifulSoup 爬取网页可事半功倍。

就拿百度贴吧网址来练练手：https://tieba.baidu.com/index.html

1.find_all ()：搜索当前节点的所有子节点，孙子节点。

下面例子是用 find_all () 匹配贴吧分类模块，href 链接中带有 “娱乐” 两字的链接。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

f = urlopen(''https://tieba.baidu.com/index.html'').read()
soup = BeautifulSoup(f,''html.parser'')

for link in soup.find_all(''a'',href=re.compile(''娱乐'')):    #这里用了正则表达式来过滤
    print(link.get(''title'')+'':''+link.get(''href''))

结果：
娱乐明星:/f/index/forumpark?pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
港台东南亚明星:/f/index/forumpark?cn=港台东南亚明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
内地明星:/f/index/forumpark?cn=内地明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
韩国明星:/f/index/forumpark?cn=韩国明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
日本明星:/f/index/forumpark?cn=日本明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
时尚人物:/f/index/forumpark?cn=时尚人物&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
欧美明星:/f/index/forumpark?cn=欧美明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
主持人:/f/index/forumpark?cn=主持人&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1
其他娱乐明星:/f/index/forumpark?cn=其他娱乐明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1

soup.find_all(''a'',href=re.compile(''娱乐'')) 等效于：soup(''a'',href=re.compile(''娱乐''))
上面的例子也可以用soup代替。

**如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数。通过一个方法来过滤一类标签属性的时候, 这个方法的参数是要被过滤的属性的值, 而不是这个标签.

import re
def abc(href):
    return href and not re.compile(''娱乐明星'').search(href)
print(soup.find_all(href=abc))

find_all () 的参数：find_all ( name , attrs , recursive , string , **kwargs )

<a href="/f/index/forumpark?pcn=电视节目&amp;pci=0&amp;ct=1&amp;rn=20&amp;pn=1" rel="noopener" target="_blank" title="爱综艺">爱综艺</a>

find_all (''a'') ：查找所有 < a > 标签

find_all (title='' 爱综艺 '')：查找所有属性包含 “title='' 爱综艺 ''” 的标签

find (string=re.compile ('' 贴吧 ''))：查找第一个标签中包含 “贴吧” 的字符串

find_all (href=re.compile ('' 娱乐明星 ''),title='' 娱乐明星 '')：多个指定名字的参数可以同时过滤 tag 的多个属性

find_all (attrs={"title": "娱乐明星"})：可以用 attrs 来搜索包含特殊属性（无法直接搜索的标签属性）的 tag

find_all (href=re.compile ('' 娱乐明星 ''),limit=3)：limit 参数限制返回结果的数量

2. 通过 CSS 选择器来查找 tag，select () 循环你需要的内容：

** 搜索 html 页面中 a 标签下以 “/f/index” 开头的 href：

for link2 in soup.select(''a[href^="/f/index"]''):
    print(link2.get(''title'')+'':''+link2.get(''href''))


**搜索html页面中a标签下以“&pn=1”结尾的href：

for link2 in soup.select(''a[href$="&pn=1"]''):
    print(link2.get(''title'')+'':''+link2.get(''href''))


**搜索html页面中a标签下包含“娱乐”的href：

for link3 in soup.select(''a[href*="娱乐"]''):
    print(link3.get(''title'')+'':''+link3.get(''href''))

soup.select (''meta'')：根据标签查找

soup.select (''html meta link'')：根据标签逐层查找

soup.select (''meta> link:nth-of-type (3)'')：找到 meta 标签下的第 3 个 link 子标签

soup.select (''div> #head'')：找到 div 标签下，属性 id=head 的子标签

soup.select (''div> a'')：找到 div 标签下，所有 a 标签

soup.select ("#searchtb ~ .authortb")：找到 id=searchtb 标签的 class=authortb 兄弟节点标签

soup.select ("[class~=m_pic]") 和 soup.select (".m_pic")：找到 class=m_pic 的标签

soup.select (".tag-name,.post_author")：同时用多种 CSS 选择器查询

今天的关于Beautifulsoup：.find和和.select的分享已经结束，谢谢您的关注，如果想了解更多关于$ .ajax（）和$ .get（）和$ .load（）之间的区别、.string和.text BeautifulSoup之间的区别、Android中Thread.currentThread（）getId（）和Process.myTid（）之间的区别、BeautifulSoup4 的 find_all () 和 select ()，简单爬虫学习的相关知识，请在本站进行查询。

本文标签：