用于HTML解析的Python正则表达式（BeautifulSoup）（python正则匹配html标签）

25-02-10 16

针对用于HTML解析的Python正则表达式和BeautifulSoup这两个问题，本篇文章进行了详细的解答，同时本文还将给你拓展BeautifulSoup的Python正则表达式、Beautiful

针对用于HTML解析的Python正则表达式和BeautifulSoup这两个问题，本篇文章进行了详细的解答，同时本文还将给你拓展Beautiful Soup的Python正则表达式、BeautifulSoup中使用正则表达式re、find_all的用法 Python（bs4，BeautifulSoup）、html解析库BeautifulSoup等相关知识，希望可以帮助到你。

本文目录一览：

用于HTML解析的Python正则表达式（BeautifulSoup）（python正则匹配html标签）
Beautiful Soup的Python正则表达式
BeautifulSoup中使用正则表达式re
find_all的用法 Python（bs4，BeautifulSoup）
html解析库BeautifulSoup

用于HTML解析的Python正则表达式（BeautifulSoup）（python正则匹配html标签）

我想获取HTML中隐藏的输入字段的值。

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

我想用Python编写一个正则表达式，该表达式将返回的值fooId，因为我知道HTML中的行遵循以下格式

<input type="hidden" name="fooId" value="**[id is here]**" />

有人可以提供Python范例来解析HTML值吗？

答案1

小编典典

对于这种特殊情况，BeautifulSoup比正则表达式更难编写，但是它更健壮…我只是为BeautifulSoup示例提供帮助，因为您已经知道要使用哪个正则表达式:-)

from BeautifulSoup import BeautifulSoup#Or retrieve it from the web, etc. html_data = open(''/yourwebsite/page.html'',''r'').read()#Create the soup object from the HTML datasoup = BeautifulSoup(html_data)fooId = soup.find(''input'',name=''fooId'',type=''hidden'') #Find the proper tagvalue = fooId.attrs[2][1] #The value of the third attribute of the desired tag                           #or index it directly via fooId[''value'']

Beautiful Soup的Python正则表达式

我正在使用Beautiful Soup提取特定的div标签，看来我不能使用简单的字符串匹配。

该页面具有以下形式的一些标签：

<div...>

我想忽略的，还有一些形式的标签

<div>

其中x表示任意长度的整数，椭圆表示任意数目的其他值，这些值之间用空格隔开（我不在乎）。我无法找出正确的正则表达式，尤其是因为我从未使用过python的re类。

使用

soup.find_all(class_="comment")

查找以单词comment开头的所有标签。我尝试使用

soup.find_all(class_=re.compile(r''(comment)( )(comment)''))soup.find_all(class_=re.compile(r''comment comment.*''))

以及许多其他变体，但我想我在这里缺少有关正则表达式或match（）的工作原理的明显信息。谁能帮我吗？

答案1

小编典典

我想我知道了：

>>> [div[''class''] for div in soup.find_all(''div'')][[''comment'', ''form'', ''new''], [''comment'', ''comment-xxxx...'']]

请注意，与BS3中的等效项不同，它不是这样的：

[''comment form new'', ''comment comment-xxxx...'']

这就是为什么您的正则表达式不匹配的原因。

但是您可以匹配，例如：

>>> soup.find_all(''div'', class_=re.compile(''comment-''))[<div></div>]

请注意，BS等效于BS
re.search，而不是re.match，所以您不需要''comment-.*''。当然，如果您想匹配''comment-12345''但不''comment-of-another-kind想要，例如''comment-\d+''。

BeautifulSoup中使用正则表达式re

在学习BeautifulSoup过程中，我们肯定都遇到过这种情况，我们在查找某些具有特殊格式的标签时候头疼，举例说一下：我现在要去爬取www.baidu.com首页中的链接并且输出。在爬取的过程中你会发现，结果会把所有具有<a>标签的连接都输出来了，其中包括一些js跳转或者“/”等符号，所以我们在使用BeautifulSoup函数的时候很有必要对标签的属性进行一下筛选，这就是本文所要将的内容，篇幅小但个人感觉还是挺方便的。

拿一个简单的链接的例子：

我们的任务是爬取a标签的href参数开头具有"http:"字符串性质的链接。

</pre><pre name="code">import re

from bs4 import BeautifulSoup

import urllib2

url='http://www.baidu.com'
op=urllib2.urlopen(url)
soup=BeautifulSoup(op)
a=soup.findAll(name='a',attrs={"href":re.compile(r'^http:')})
try:
    for i in a:
        op1=urllib2.urlopen(i['href'])
        soup1=BeautifulSoup(op1)
        b=soup1.findAll(name='a',attrs={"href":re.compile(r'^http:')})
        for j in b:
            op2=urllib2.urlopen(j['href'])
            soup2=BeautifulSoup(op2)
            c=soup2.findAll(name='a',attrs={"href":re.compile(r'^http:')})
            for m in c:
                op3=urllib2.urlopen(m['href'])
                soup3=BeautifulSoup(op3)
                d=soup3.findAll(name='a',attrs={"href":re.compile(r'^http:')})
                for h in d:
                    ii=ii+1
                    print h['href']
                    print '\n'
except:
    print "\n"

这里是我做的一个例子，作用是用来爬取百度为根节点，按照深度优先的方式实现深度为3的爬取链接的代码。爬取了8123个链接

解析：

首先是导入正则表达式模块re，还有BeautifulSoup模块，urllib2。

使用urllib2模块的urlopen函数打开链接然后使用BeautifulSoup模块将网页信息op格式化为BeautifulSoup格式的数据，利用soup对象的findAll函数查找符合正则表达式的a便签，至于深度优先这里就不再讲述。

这里的重点是在findAll函数中使用了正则表达式：

soup1.findAll(name='a',attrs={"href":re.compile(r'^http:')})

这里name来制定标签的名字，attrs来设置标签的一些参数设置，这里只拿出了href属性，并且使用re.compile(r'^http:')来对href字符串进行匹配，只有匹配成功的才能被检索。

这里的正则表达式中使用了r 不懂的可以查看博客，或者是查看正则表达式的其他应用也可以查看博客。

r'^http:'

这里的BeautifulSoup的findAll函数没有自己讲述，有兴趣的可以查看博客。

这里没有对乱码进行处理，有兴趣的可以去查看博客。

find_all的用法 Python（bs4，BeautifulSoup）

find_all()简单说明：

find_all()

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

用法一：

rs=soup.find_all(''a'')

将返回soup中所有的超链接内容

类似的还有rs.find_all(''span'')、rs.find_all(''title'')、rs.find_all(''h1'')

也可加入查找条件，eg：

rs.find_all(''img'',{''class'':''news-img''})

将返回所有的class属性为news-img的img内容

用法二：

这里的true指的就是选中所有有id这个属性的标签

soup.find_all(id=True)

返回结果：

[<ahref="http://example.com/elsie" id="link1">Elsie</a>, # <ahref="http://example.com/lacie" id="link2">Lacie</a>, # <ahref="http://example.com/tillie" id="link3">Tillie</a>]

用法三：

soup.find_all("a", string="Elsie")

通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True

用法四：

soup.find_all("a", limit=2)

limit即为查找的数量，此处查找数量为两次

html解析库BeautifulSoup

安装：

apt install python-bs4

pip install beautifulsoup4

下载源码：https://pypi.python.org/pypi/beautifulsoup4/ 之后使用python setup.py install安装

apt install python-lxml

easy_install lxml

pip install lxml

apt install python-html5lib

easy_install html5lib

pip install html5lib

解析器比较

解析器	使用方法	优势	劣势
python标准库	BeautifulSoup(markup,"html.parser")	python的内置标准库执行速度适中文档容错能力强	python2.7.3或者3.2.2之前的版本文档容错能力差
lxml html解析器	BeautifulSoup(markup,"lxml")	速度快文档容错能力强	需要安装C语言库
lxml html解析器	BeautifulSoup（markup,["lxml","xml"]） BeautifulSoup(markup,"xml")	速度快唯一支持xml的解析器	需要安装C语言库
html5lib	BeautifulSoup（markup,"html5lib"）	最好的容错性以浏览器的方式解析文档生成html5格式文档	速度慢不依赖外部扩展

今天关于用于HTML解析的Python正则表达式和BeautifulSoup的讲解已经结束，谢谢您的阅读，如果想了解更多关于Beautiful Soup的Python正则表达式、BeautifulSoup中使用正则表达式re、find_all的用法 Python（bs4，BeautifulSoup）、html解析库BeautifulSoup的相关知识，请在本站搜索。

本文标签：