本文将为您提供关于BeautifulSoup:获取特定表的内容的详细介绍,我们还将为您解释beautifulsoup获取属性内容的相关知识,同时,我们还将为您提供关于BeautifulSoup4fin
本文将为您提供关于BeautifulSoup:获取特定表的内容的详细介绍,我们还将为您解释beautifulsoup获取属性内容的相关知识,同时,我们还将为您提供关于Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup、BeautifulSoup 提取某个tag标签里面的内容、BeautifulSoup:h2 标签内的标签 href的实用信息。
本文目录一览:- BeautifulSoup:获取特定表的内容(beautifulsoup获取属性内容)
- Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接
- BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup
- BeautifulSoup 提取某个tag标签里面的内容
- BeautifulSoup:h2 标签内的标签 href
BeautifulSoup:获取特定表的内容(beautifulsoup获取属性内容)
我当地的机场可耻地阻止了没有IE的用户,并且看起来很糟糕。我想编写一个Python脚本,该脚本每隔几分钟就会获取“到达和离开”页面的内容,并以更具可读性的方式显示它们。
我选择的工具是使网站相信我使用IE的机械化工具,以及BeautifulSoup来解析页面以获得航班数据表的工具。
老实说,我迷失在BeautifulSoup文档中,无法理解如何从整个文档中获取表(我知道它的标题),以及如何从该表中获取行列表。
有任何想法吗?
答案1
小编典典这不是您需要的特定代码,只是有关如何使用BeautifulSoup的演示。它找到ID为“ Table1”的表,并获取其所有tr元素。
html = urllib2.urlopen(url).read()bs = BeautifulSoup(html)table = bs.find(lambda tag: tag.name==''table'' and tag.has_attr(''id'') and tag[''id'']=="Table1") rows = table.findAll(lambda tag: tag.name==''tr'')
Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接
我注意到一个非常烦人的错误:BeautifulSoup4(程序包:)bs4
经常发现的标签少于以前的版本(程序包:)BeautifulSoup
。
这是该问题的可复制实例:
import requests
import bs4
import BeautifulSoup
r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)
print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))
输出:
With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701
如您所见,差异并不小。
如果有人怀疑,以下是模块的确切版本:
In [20]: bs4.__version__
Out[20]: '4.2.1'
In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'
BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup
如何解决BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup?
我成功安装了 BeautifulSoup。这是最新的更新。但我仍然得到“
df[''Most_OCCURING''] = df.groupby(''Date'')[''Type''].transform(lambda x: x.value_counts().idxmax())
运行代码时。 需要帮助!!
解决方法
试试:
from bs4 import BeautifulSoup
BeautifulSoup 提取某个tag标签里面的内容
用的版本是BeautifulSoup4,用起来的确要比 re 好用一些,不用一个个的去写正则表达式,这样还是挺方便的。
比如我要获取高匿代理IP页面上的IP和端口,网址这里:点击打开链接,它的组织方式是这样的,如下图:
IP和端口 tr.td 标签里面,tr有class属性,属性有两种情况的值,对于这点我们可以用正则表达式来匹配下。当提取某一个标签里的具体内容时,可以用bs的 .string属性,注意:用 .string 属性来提取标签里的内容时,该标签应该是只有单个节点的。比如上面的 td 标签那样。下面直接上代码了。
import requests from bs4 import BeautifulSoup import re import os.path user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)' headers = {'User-Agent': user_agent} session = requests.session() page = session.get("http://www.xicidaili.com/nn/1",headers=headers) soup = BeautifulSoup(page.text,'lxml') #这里没有装lxml的话,把它去掉用默认的就好 #匹配带有class属性的tr标签 taglist = soup.find_all('tr',attrs={'class': re.compile("(odd)|()")}) for trtag in taglist: tdlist = trtag.find_all('td') #在每个tr标签下,查找所有的td标签 print tdlist[1].string #这里提取IP值 print tdlist[2].string #这里提取端口值
结果如下:
124.88.67.24 80 61.224.239.71 8080 113.3.78.124 8118 61.227.228.141 8080 222.130.171.58 8118 123.57.190.51 7777 183.61.71.112 8888 120.25.171.183 8080 1.164.146.91 8080 101.201.235.141 8000 121.193.143.249 80 118.180.15.152 8102 124.88.67.19 80 。 。 。 。 。 。 。
BeautifulSoup:h2 标签内的标签 href
如何解决BeautifulSoup:h2 标签内的标签 href?
我试图在 h2 标签内的“a”标签中获取链接,但我遇到的问题是其中有 2 个在单独的“父”标签中。
我正在查看链接:https://emerging-europe.com/tag/poland/
以下是我到现在为止的代码。
from bs4 import BeautifulSoup
import requests
url=''https://emerging-europe.com/tag/poland/''
response=requests.get(url)
soup=BeautifulSoup(response.content,''lxml'')
for item in soup.select(''.col-lg-6''):
try:
headline = item.find(''h2'',{''class'':''entry-title''}).get_text()
link = item.find(''h2'',{''class'':''entry-title''})[''href'']
except:
continue
我所指的 html 是下面的那个。
<div>
<div>
<spanhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta-category"><a href="https://emerging-europe.com/category/news/">News & Analysis</a></span>
<h2><a href="https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/">Montenegro leads CEE on ILGA-Europe’s new Rainbow Map</a></h2>
<divhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta"><divhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta-item herald-date"><span>May 17,2021</span></div><divhttps://www.jb51.cc/tag/Meta/" target="_blank">Meta-item herald-author"><span><span><a href="https://emerging-europe.com/author/marekgrzegorczyk/">Marek Grzegorczyk</a></span></span></div></div>
</div>
<div>
<p>Montenegro is Central and Eastern Europe’s best performer on the latest edition of the ILGA-Europe Rainbow Europe Map and Index,which monitors LGBTI rights across...</p>
</div>
<ahref="https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/" title="Montenegro leads CEE on ILGA-Europe’s new Rainbow Map">Read More</a>
</div>
我想获得“https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/”链接,但我得到的是“https” ://emerging-europe.com/category/news/”之一。我如何引用第二个?
感谢您的帮助!
解决方法
试试这个来获取所有的文章网址:
import requests
from bs4 import BeautifulSoup
url = "https://emerging-europe.com/tag/poland/"
css = ".entry-header .entry-title,.entry-header .entry-title a,.post-author-list .categoriesarticle .title a"
soup = BeautifulSoup(requests.get(url).text,"lxml").select(css)
article_links = [a.find("a")["href"] for a in soup if a.find("a") is not None]
print("\n".join(article_links))
输出:
https://emerging-europe.com/voices/the-zangezur-corridor-is-a-geo-economic-revolution/
https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/
https://emerging-europe.com/business/made-in-emerging-europe-vinted-up-catalyst-propergate/
https://emerging-europe.com/news/polish-government-shifts-left-on-economy/
https://emerging-europe.com/news/georgias-modern-parliament-building-faces-uncertain-future-elsewhere-in-emerging-europe/
https://emerging-europe.com/after-hours/mixed-feelings-as-libeskind-reimagines-lodz/
https://emerging-europe.com/news/hungarys-united-opposition-emerging-europe-this-week/
https://emerging-europe.com/business/small-local-market-think-international-from-the-start/
https://emerging-europe.com/business/new-esg-guidelines-can-strengthen-polish-capital-market/
https://emerging-europe.com/news/why-is-the-left-propping-up-polands-right-wing-government/
https://emerging-europe.com/news/cee-should-redouble-efforts-to-end-violence-against-women/
https://emerging-europe.com/after-hours/a-century-on-the-silesian-uprisings-remains-complicated/
https://emerging-europe.com/voices/the-zangezur-corridor-is-a-geo-economic-revolution/
https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/
https://emerging-europe.com/business/made-in-emerging-europe-vinted-up-catalyst-propergate/
关于BeautifulSoup:获取特定表的内容和beautifulsoup获取属性内容的介绍现已完结,谢谢您的耐心阅读,如果想了解更多关于Beautiful Soup 4 find_all找不到Beautiful Soup 3找到的链接、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup、BeautifulSoup 提取某个tag标签里面的内容、BeautifulSoup:h2 标签内的标签 href的相关知识,请在本站寻找。
本文标签: