Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？（python用处很广吗）

25-03-29 5

在本文中，我们将给您介绍关于Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？的详细内容，并且为您解答python用处很广吗的相关问题，此外，我们还将为

在本文中，我们将给您介绍关于Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？的详细内容，并且为您解答python用处很广吗的相关问题，此外，我们还将为您提供关于BeautifulSoup Python Selenium - 在抓取网站之前等待推文加载、Beautifulsoup 和 selenium：单击 svg 路径进入下一页并从该页面获取数据、c# – 是否可以在不安装Selenium Server的情况下使用ISelenium / DefaultSelenium？、day 03 selenium与Beautifulsoup4的原理与使用的知识。

本文目录一览：

Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？（python用处很广吗）
BeautifulSoup Python Selenium - 在抓取网站之前等待推文加载
Beautifulsoup 和 selenium：单击 svg 路径进入下一页并从该页面获取数据
c# – 是否可以在不安装Selenium Server的情况下使用ISelenium / DefaultSelenium？
day 03 selenium与Beautifulsoup4的原理与使用

Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？（python用处很广吗）

此问题适用于Win10上的Python 3.6.3，bs4和Selenium 3.8。

我正在尝试使用动态内容抓取页面。我试图抓取的是数字和文本（例如，来自http://www.oddsportal.com）。以我的理解，使用请求+美丽的汤将无法完成任务，因为动态内容将被隐藏。因此，我必须使用其他工具，例如selenium
webdriver。

然后，假设我仍将使用Selenium WebDriver，建议您忽略BeautifulSoup并坚持使用Selenium WebDriver功能，例如

elem = driver.find_element_by_name("q"))

还是使用selenium+美容汤被认为是更好的做法？

您对两条路线中的哪条路线会给我带来更便捷的功能有什么看法？

谢谢。

答案1

小编典典

美丽汤

Beautifulsoup
是 Web爬网 的有力工具。它使用 urllib.request Python库。 urllib.request
从静态页面提取数据的功能非常强大。

硒

Selenium 是当前最广泛接受和最有效的 Web自动化
工具。Selenium支持与进行交互Dynamic Pages, Contents and Elements。

结论

要创建一个健壮且高效的框架来抓取具有动态内容的页面，您必须将两者 Selenium 和 Beautifulsoup
框架都集成在一起。浏览动态元素并与之交互， Selenium 并高效地刮取内容 Beautifulsoup

一个例子

下面是一个 example使用 Selenium 和 Beautifulsoup 用于
Scrapping

BeautifulSoup Python Selenium - 在抓取网站之前等待推文加载

如何解决BeautifulSoup Python Selenium - 在抓取网站之前等待推文加载？

我试图抓取一个网站来提取推文链接（在这种情况下特别是 DW），但我无法获取任何数据，因为推文没有立即加载，因此请求在有时间加载之前执行。我曾尝试使用请求超时以及 time.sleep() 但没有运气。使用这两个选项后，我尝试使用 Selenium 在本地加载网页并给它加载时间，但我似乎无法让它工作。我相信这可以用 Selenium 来完成。这是我到目前为止尝试过的：

        links = ''https://www.dw.com/en/vaccines-appear-effective-against-india-covid-variant/a-57344037''
        driver.get(links)
        delay = 30 #seconds
        try:
            webdriverwait(driver,delay).until(EC.visibility_of_all_elements_located((By.ID,"twitter-widget-0")))
        except:
            pass
        tweetSource = driver.page_source
        tweetSoup = BeautifulSoup(tweetSource,features=''html.parser'')
        linkTweets = tweetSoup.find_all(''a'')
        for linkTweet in linkTweets:
            try:
                tweetURL = linkTweet.attrs[''href'']
            except:  # pass on KeyError or any other error
                pass
            if "twitter.com" in tweetURL and "status" in tweetURL:
                # Run getTweetID function
                tweetID = getTweetID(tweetURL)
                newdata = [tweetID,date_tag,"DW",links,title_tag,"News",""]
                # Write to dataframe
                df.loc[len(df)] = newdata
                print("working on tweetID: " + str(tweetID))

如果有人能让 Selenium 找到这条推文就太好了！

解决方法

这是一个 iframe 首先你需要切换到那个 iframe

iframe = WebDriverWait(driver,10).until(
        EC.presence_of_element_located((By.ID,"twitter-widget-0"))
    )
driver.switch_to.frame(iframe)

Beautifulsoup 和 selenium：单击 svg 路径进入下一页并从该页面获取数据

如何解决Beautifulsoup 和 selenium：单击 svg 路径进入下一页并从该页面获取数据？

我正在做一个项目，网站上有一个表格，里面填满了数据，表格有 7 页长。这是这个网站上的表格：https://nonfungible.com/market/history。您可以通过 svg 路径进入下一页。我必须从所有 7 页中获取数据。我不知道如何点击这个 svg 路径。如果您知道如何单击路径，请告诉我。即使 svg 没有 aria-label 或 class。

这是源代码的照片。

我尝试了很多不同的东西，包括：

    driver.find_element_by_xpath(''//div[@id="icon-chevron-right"]/*[name()="svg"]/*[name()="path"]'').click()

这是我得到的错误：raise exception_class(message,screen,stacktrace) selenium.common.exceptions.NoSuchElementException：消息：没有这样的元素：无法定位元素：{"method":"xpath","selector":"//div[@id="icon-chevron-right"]/[name()="svg"]/[name()="path"]"} （会话信息：chrome=92.0.4515.107）

感谢您的帮助。请帮我解决这个问题。

解决方法

您可以尝试使用浏览器的开发人员工具查看向站点发出的请求，然后从脚本中发出相同的请求。这消除了对 selenium 的需求，并且应该会为您提供一个更具可扩展性的机器人。

与使用 GUI 的方法略有不同 - 但请查看以下内容。以这种方式提供的数据比前端显示的要多得多。

看起来 /market/history 页面带有 JSON 数据（它不是可以在开发工具中识别的单独调用）。但是 - 如果您：

使用 python requests 库获取页面
解析html并找到@id="__NEXT_DATA__"的json数据对象
获取包含表数据的 json 的正确部分
过滤对象以去除一些零碎的东西（其中name != none）

from lxml import html
import requests
import json

url = "https://nonfungible.com/market/history"

#get the page and parse
response = requests.get(url)
page = html.fromstring(response.content)

#get the data and convert to json
datastring = page.xpath(''//script[@id="__NEXT_DATA__"]/text()'')
data = json.loads(datastring[0])
#print(json.dumps(data,indent=4)) #this prints everything

#Get the relevant part of the json (it has lots of other cr*p in there - it was effort to find this
tabledata = data[''props''][''pageProps''][''currentTotals'']
# this filters out some of the unneeded data
AllItems = list(filter(lambda x: x[''name''] !=None,tabledata)) 

#print out each item - which relates to an row in the table 
for item in  AllItems:
    print (item[''name''])
    print (item[''totals''][''alltime''][''usd''])
    print (json.dumps(item,indent=4))

您需要从这里做的是从 json 中提取您想要的内容。

我已经开始了你... 循环中的前 2 个打印输出：

聚会

157251919.08

与网站上的项目相匹配的： website image highlighted

最后一个打印是该项目的所有内容。这将使您看到结构并帮助您获取数据。它看起来像这样：

{
    "name": "meebits","totals": {
        "alltime": {
            "count": 15622,"traders": 5023,"usd": 157251919.08,"average": 10066.06,"transfer_count": 29826,"transfer_unique_assets": 19981,"asset_unique_owners": 4812,"asset_usd": 95541331.68,"asset_average": 10566.39
        },"oneday": {
            "count": 0,"traders": 0,"usd": 0,"average": 0,"transfer_count": 0,"transfer_unique_assets": 0,"asset_unique_owners": 0,"asset_usd": 0,"asset_average": 0
        },"twodayago": {
            "count": 0,"average": 0
        },"sevenday": {
            "count": 144,"traders": 165,"usd": 703913.21,"average": 4888.29,"transfer_count": 265,"transfer_unique_assets": 204,"asset_unique_owners": 125,"asset_usd": 611620.92,"asset_average": 5412.57
        },"thirtyday": {
            "count": 1663,"traders": 1167,"usd": 12662841.8,"average": 7614.46,"transfer_count": 2551,"transfer_unique_assets": 1704,"asset_unique_owners": 781,"asset_usd": 9908945.2,"asset_average": 9107.49
        }
    }
}

c# – 是否可以在不安装Selenium Server的情况下使用ISelenium / DefaultSelenium？

我之前使用IWebDriver控制IE进行测试.但是IWebDriver和IWebElement支持的方法非常有限.我发现属于Selenium命名空间的ISelenium / DefaultSelenium非常有用.如何在不安装Selenium Server的情况下使用它们来控制IE？

这是DefaultSelenium的构造函数：

ISelenium sele = new DefaultSelenium(**serveraddr**,**serverport**,browser,url2test);
sele.Start();
sele.open();
...

似乎我必须在创建ISelenium对象之前安装Selenium Server.

我的情况是,我正在尝试使用C#Selenium构建一个.exe应用程序,它可以在不同的PC上运行,并且不可能在所有PC上安装Selenium Server(你永远不知道哪个是下一个运行应用程序).

有没有人知道如何在不安装服务器的情况下使用ISelenium / DefaultSelenium？
谢谢！

解决方法

在不使用RC Server的情况下,Java中有一些解决方案：

1)对于selenium浏览器启动：

DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setbrowserName("safari");
CommandExecutor executor = new SeleneseCommandExecutor(new URL("http://localhost:4444/"),new URL("http://www.google.com/"),capabilities);
WebDriver driver = new RemoteWebDriver(executor,capabilities);

2)对于selenium命令：

// You may use any WebDriver implementation. Firefox is used here as an example
WebDriver driver = new FirefoxDriver();

// A "base url",used by selenium to resolve relative URLs
 String baseUrl = "http://www.google.com";

// Create the Selenium implementation
Selenium selenium = new WebDriverBackedSelenium(driver,baseUrl);

// Perform actions with selenium
selenium.open("http://www.google.com");
selenium.type("name=q","cheese");
selenium.click("name=btnG");

// Get the underlying WebDriver implementation back. This will refer to the
// same WebDriver instance as the "driver" variable above.
WebDriver driverInstance = ((WebDriverBackedSelenium) selenium).getWrappedDriver();

//Finally,close the browser. Call stop on the WebDriverBackedSelenium instance
//instead of calling driver.quit(). Otherwise,the JVM will continue running after
//the browser has been closed.
selenium.stop();

描述于此：http://seleniumhq.org/docs/03_webdriver.html

谷歌在C#中有类似的东西.没有其他方法可以实现这一目标.

day 03 selenium与Beautifulsoup4的原理与使用

#爬取京东商品数据
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def get_good(driver):
    num=1
    try:
        time.sleep(5)
        # 下拉滑动5000px
        js_code=''''''
            window.scrollTo(0,5000)
        ''''''
        driver.execute_script(js_code)
        # 等待5秒，待商品数据加载
        time.sleep(5)
        good_list = driver.find_elements_by_class_name(''gl-item'')
        for good in good_list:
            # print(good)
            # 商品名称
            good_name = good.find_element_by_css_selector(''.p-name em'').text
            # print(good_name)
            good_url = good.find_element_by_css_selector(''.p-name a'').get_attribute(''href'')
            # print(good_url)
            good_price = good.find_element_by_class_name(''p-price'').text
            # print(good_price)

            # 商品评价
            good_commit = good.find_element_by_class_name(''p-commit'').text
            good_content = f''''''
               商品名称:{good_name}
               商品链接:{good_url}
               商品价格:{good_price}
               商品评价:{good_commit}
               \n
               ''''''
            print(good_content)
            with open(''jd.txt'', ''a'', encoding=''utf-8'')as f:
                f.write(good_content)
            num+=1
        print(''商品信息写入成功！'')
        # 找到下一页并点击
        next_tag=driver.find_element_by_class_name(''pn-next'')
        next_tag.click()
        time.sleep(5)
        # 递归调用函数本身
        get_good(driver)
    finally:
        driver.close()

if __name__==''__main__'':
    driver=webdriver.Chrome()
    try:
        driver.implicitly_wait(10)
        # 往京东发送请求
        driver.get(''http://www.jd.com/'')
        # 往京东主页输入墨菲定律，按回车键
        input_tag=driver.find_element_by_id(''key'')
        input_tag.send_keys(''墨菲定律'')
        input_tag.send_keys(Keys.ENTER)

        # 调取商品信息函数
        get_good(driver)
    finally:
        driver.close()
Beautifulsoup4的原理与使用

html_doc=''''''
<html><head><title>The Dormouse''s story</title></head>
<body>
<p><b>$37</b></p>

<pid="p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie">Elsie</a>,
<a href="http://example.com/lacie"id="link2">Lacie</a> and
<a href="http://example.com/tillie"id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p>...</p>
''''''
from bs4 import BeautifulSoup
# python自带的解析库
# soup=BeautifulSoup(html_doc,''html.parser'')

# 利用bs4得到一个soup对象
soup=BeautifulSoup(html_doc,''lxml'')
# bs4对象
# print(soup)
# bs4类型
# print(type(soup))
# 美化功能
# html=soup.prettify()
# print(html)


# 1、直接选择标签（返回的是一个对象）   *****
print(soup.a)  # 获取第一个a标签
print(soup.p)  # 获取第一个p标签
print(type(soup.a))  # <class ''bs4.element.Tag''>

# 2、获取标签的名称
print(soup.a.name)  # 获取a标签的名字

# 3、获取标签的属性     *****
print(soup.a.attrs)  # 获取a标签内所有的属性
print(soup.a.attrs[''href''])  # 获取a标签内的href属性

# 4、获取标签的文本内容   *****
print(soup.p.text)      #  $37
# 5、嵌套选择标签
print(soup.p.b)  # 获取第一个p标签内的b标签
print(soup.p.b.text)  # 打印b标签内的文本

# 6、子节点、子孙节点
# 获取子节点
print(soup.p.children)  # 获取第一个p标签所有的子节点，返回的是一个迭代器
print(list(soup.p.children))  # list转成列表
# 7、父节点，祖先节点
print(soup.b.parent)
print(soup.b.parents)
print(list(soup.b.parents))

关于Python-哪个更适合报废：selenium还是含selenium的beautifulsoup？和python用处很广吗的介绍现已完结，谢谢您的耐心阅读，如果想了解更多关于BeautifulSoup Python Selenium - 在抓取网站之前等待推文加载、Beautifulsoup 和 selenium：单击 svg 路径进入下一页并从该页面获取数据、c# – 是否可以在不安装Selenium Server的情况下使用ISelenium / DefaultSelenium？、day 03 selenium与Beautifulsoup4的原理与使用的相关知识，请在本站寻找。

本文标签：