GVKun编程网logo

BeautifulSoup:获取特定表的内容(beautifulsoup获取属性内容)

12

本文将为您提供关于BeautifulSoup:获取特定表的内容的详细介绍,我们还将为您解释beautifulsoup获取属性内容的相关知识,同时,我们还将为您提供关于BeautifulSoup4fin

本文将为您提供关于BeautifulSoup:获取特定表的内容的详细介绍,我们还将为您解释beautifulsoup获取属性内容的相关知识,同时,我们还将为您提供关于Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup、BeautifulSoup 提取某个tag标签里面的内容、BeautifulSoup:h2 标签内的标签 href的实用信息。

本文目录一览:

BeautifulSoup:获取特定表的内容(beautifulsoup获取属性内容)

BeautifulSoup:获取特定表的内容(beautifulsoup获取属性内容)

我当地的机场可耻地阻止了没有IE的用户,并且看起来很糟糕。我想编写一个Python脚本,该脚本每隔几分钟就会获取“到达和离开”页面的内容,并以更具可读性的方式显示它们。

我选择的工具是使网站相信我使用IE的机械化工具,以及BeautifulSoup来解析页面以获得航班数据表的工具。

老实说,我迷失在BeautifulSoup文档中,无法理解如何从整个文档中获取表(我知道它的标题),以及如何从该表中获取行列表。

有任何想法吗?

答案1

小编典典

这不是您需要的特定代码,只是有关如何使用BeautifulSoup的演示。它找到ID为“ Table1”的表,并获取其所有tr元素。

html = urllib2.urlopen(url).read()bs = BeautifulSoup(html)table = bs.find(lambda tag: tag.name==''table'' and tag.has_attr(''id'') and tag[''id'']=="Table1") rows = table.findAll(lambda tag: tag.name==''tr'')

Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接

Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接

我注意到一个非常烦人的错误:BeautifulSoup4(程序包:)bs4经常发现的标签少于以前的版本(程序包:)BeautifulSoup

这是该问题的可复制实例:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

输出:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

如您所见,差异并不小。

如果有人怀疑,以下是模块的确切版本:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'

BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup

BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup

如何解决BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup?

我成功安装了 BeautifulSoup。这是最新的更新。但我仍然得到“

df[''Most_OCCURING''] = df.groupby(''Date'')[''Type''].transform(lambda x: x.value_counts().idxmax())

运行代码时。 需要帮助!!

解决方法

试试:

from bs4 import BeautifulSoup

BeautifulSoup 提取某个tag标签里面的内容

BeautifulSoup 提取某个tag标签里面的内容

用的版本是BeautifulSoup4,用起来的确要比 re 好用一些,不用一个个的去写正则表达式,这样还是挺方便的。

比如我要获取高匿代理IP页面上的IP和端口,网址这里:点击打开链接,它的组织方式是这样的,如下图:


IP和端口 tr.td 标签里面,tr有class属性,属性有两种情况的值,对于这点我们可以用正则表达式来匹配下。当提取某一个标签里的具体内容时,可以用bs的 .string属性,注意:用 .string 属性来提取标签里的内容时,该标签应该是只有单个节点的。比如上面的 td 标签那样。下面直接上代码了。

import requests
from bs4 import BeautifulSoup
import re
import os.path

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
headers = {'User-Agent': user_agent}

session = requests.session()
page = session.get("http://www.xicidaili.com/nn/1",headers=headers)
soup = BeautifulSoup(page.text,'lxml')  #这里没有装lxml的话,把它去掉用默认的就好

#匹配带有class属性的tr标签
taglist = soup.find_all('tr',attrs={'class': re.compile("(odd)|()")})
for trtag in taglist:
    tdlist = trtag.find_all('td')  #在每个tr标签下,查找所有的td标签
    print tdlist[1].string   #这里提取IP值
    print tdlist[2].string   #这里提取端口值

结果如下:

124.88.67.24
80
61.224.239.71
8080
113.3.78.124
8118
61.227.228.141
8080
222.130.171.58
8118
123.57.190.51
7777
183.61.71.112
8888
120.25.171.183
8080
1.164.146.91
8080
101.201.235.141
8000
121.193.143.249
80
118.180.15.152
8102
124.88.67.19
80
。
。
。
。
。
。
。

BeautifulSoup:h2 标签内的标签 href

BeautifulSoup:h2 标签内的标签 href

如何解决BeautifulSoup:h2 标签内的标签 href?

我试图在 h2 标签内的“a”标签中获取链接,但我遇到的问题是其中有 2 个在单独的“父”标签中。

我正在查看链接:https://emerging-europe.com/tag/poland/

以下是我到现在为止的代码。

from bs4 import BeautifulSoup
import requests

url=''https://emerging-europe.com/tag/poland/''
response=requests.get(url)

soup=BeautifulSoup(response.content,''lxml'')

for item in soup.select(''.col-lg-6''):
    try:
        headline = item.find(''h2'',{''class'':''entry-title''}).get_text()
        link = item.find(''h2'',{''class'':''entry-title''})[''href'']
           
    except:
        continue

我所指的 html 是下面的那个。

<div>
        <div>
                            <spanhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta-category"><a href="https://emerging-europe.com/category/news/">News &amp; Analysis</a></span>
            
            <h2><a href="https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/">Montenegro leads CEE on ILGA-Europe’s new Rainbow Map</a></h2>
                            <divhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta"><divhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta-item herald-date"><span>May 17,2021</span></div><divhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta-item herald-author"><span><span><a href="https://emerging-europe.com/author/marekgrzegorczyk/">Marek Grzegorczyk</a></span></span></div></div>
                    </div>

                    <div>
                <p>Montenegro is Central and Eastern Europe’s best performer on the latest edition of the ILGA-Europe Rainbow Europe Map and Index,which monitors LGBTI rights across...</p>
            </div>
        
                    <ahref="https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/" title="Montenegro leads CEE on ILGA-Europe’s new Rainbow Map">Read More</a>
            </div>

我想获得“https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/”链接,但我得到的是“https” ://emerging-europe.com/category/news/”之一。我如何引用第二个?

感谢您的帮助!

解决方法

试试这个来获取所有的文章网址:

import requests
from bs4 import BeautifulSoup

url = "https://emerging-europe.com/tag/poland/"
css = ".entry-header .entry-title,.entry-header .entry-title a,.post-author-list .categoriesarticle .title a"

soup = BeautifulSoup(requests.get(url).text,"lxml").select(css)
article_links = [a.find("a")["href"] for a in soup if a.find("a") is not None]
print("\n".join(article_links))

输出:

https://emerging-europe.com/voices/the-zangezur-corridor-is-a-geo-economic-revolution/
https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/
https://emerging-europe.com/business/made-in-emerging-europe-vinted-up-catalyst-propergate/
https://emerging-europe.com/news/polish-government-shifts-left-on-economy/
https://emerging-europe.com/news/georgias-modern-parliament-building-faces-uncertain-future-elsewhere-in-emerging-europe/
https://emerging-europe.com/after-hours/mixed-feelings-as-libeskind-reimagines-lodz/
https://emerging-europe.com/news/hungarys-united-opposition-emerging-europe-this-week/
https://emerging-europe.com/business/small-local-market-think-international-from-the-start/
https://emerging-europe.com/business/new-esg-guidelines-can-strengthen-polish-capital-market/
https://emerging-europe.com/news/why-is-the-left-propping-up-polands-right-wing-government/
https://emerging-europe.com/news/cee-should-redouble-efforts-to-end-violence-against-women/
https://emerging-europe.com/after-hours/a-century-on-the-silesian-uprisings-remains-complicated/
https://emerging-europe.com/voices/the-zangezur-corridor-is-a-geo-economic-revolution/
https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/
https://emerging-europe.com/business/made-in-emerging-europe-vinted-up-catalyst-propergate/

关于BeautifulSoup:获取特定表的内容beautifulsoup获取属性内容的介绍现已完结,谢谢您的耐心阅读,如果想了解更多关于Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup、BeautifulSoup 提取某个tag标签里面的内容、BeautifulSoup:h2 标签内的标签 href的相关知识,请在本站寻找。

本文标签: