从XPath表达式填充XML模板文件？（xpath怎么用）

25-02-25 12

如果您想了解从XPath表达式填充XML模板文件？的相关知识，那么本文是一篇不可错过的文章，我们将对xpath怎么用进行全面详尽的解释，并且为您提供关于2-Python爬虫-正则表达式/XML/XPa

如果您想了解从XPath表达式填充XML模板文件？的相关知识，那么本文是一篇不可错过的文章，我们将对xpath怎么用进行全面详尽的解释，并且为您提供关于2-Python爬虫-正则表达式/XML/XPath/CSS、c# – 在多个祖先上具有条件的XPath表达式、c# – 异常：XPath表达式计算为意外类型System.Xml.Linq.XAttribute、html – XPath表达式：选择A HREF =“expr”标记之间的元素的有价值的信息。

本文目录一览：

从XPath表达式填充XML模板文件？（xpath怎么用）
2-Python爬虫-正则表达式/XML/XPath/CSS
c# – 在多个祖先上具有条件的XPath表达式
c# – 异常：XPath表达式计算为意外类型System.Xml.Linq.XAttribute
html – XPath表达式：选择A HREF =“expr”标记之间的元素

从XPath表达式填充XML模板文件？（xpath怎么用）

从XPath表达式的映射填充（或生成）XML模板文件的最佳方法是什么？

要求是我们需要从模板开始（因为它可能包含XPath表达式中未捕获的信息）。

例如，起始模板可能是：

<s11:Envelope xmlns:s11=''http://schemas.xmlsoap.org/soap/envelope/''>    <ns1:create xmlns:ns1=''http://predic8.com/wsdl/material/ArticleService/1/''>      <article xmlns:ns1=''http://predic8.com/material/1/''>        <name>?XXX?</name>        <description>?XXX?</description>        <price xmlns:ns1=''http://predic8.com/common/1/''>          <amount>?999.99?</amount>          <currency xmlns:ns1=''http://predic8.com/common/1/''>???</currency>        </price>        <id xmlns:ns1=''http://predic8.com/material/1/''>???</id>      </article>    </ns1:create>  </s11:Body></s11:Envelope>

然后提供给我们，类似于：

expression: /create/article[1]/id                => 1expression: /create/article[1]/description       => barexpression: /create/article[1]/name[1]           => fooexpression: /create/article[1]/price[1]/amount   => 00.00expression: /create/article[1]/price[1]/currency => USDexpression: /create/article[2]/id                => 2expression: /create/article[2]/description       => some nameexpression: /create/article[2]/name[1]           => some descriptionexpression: /create/article[2]/price[1]/amount   => 00.01expression: /create/article[2]/price[1]/currency => USD

然后，我们应该生成：

<ns1:create xmlns:ns1=''http://predic8.com/wsdl/material/ArticleService/1/''>    <article xmlns:ns1=''http://predic8.com/material/1/''>        <name xmlns:ns1=''http://predic8.com/material/1/''>foo</name>        <description>bar</description>        <price xmlns:ns1=''http://predic8.com/common/1/''>            <amount>00.00</amount>            <currency xmlns:ns1=''http://predic8.com/common/1/''>USD</currency>        </price>        <id xmlns:ns1=''http://predic8.com/material/1/''>1</id>    </article>    <article xmlns:ns1=''http://predic8.com/material/2/''>        <name>some name</name>        <description>some description</description>        <price xmlns:ns1=''http://predic8.com/common/2/''>            <amount>00.01</amount>            <currency xmlns:ns1=''http://predic8.com/common/2/''>USD</currency>        </price>        <id xmlns:ns1=''http://predic8.com/material/2/''>2</id>    </article></ns1:create>

我是用Java实现的，尽管如果可能的话，我会首选基于XSLT的解决方案。

答案1

小编典典

此转换从“表达式”创建具有所需结果结构的XML文档-仍需将其转换为最终结果：

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:my="my:my"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:variable name="vPop" as="element()*">    <item path="/create/article[1]/id">1</item>    <item path="/create/article[1]/description">bar</item>    <item path="/create/article[1]/name[1]">foo</item>    <item path="/create/article[1]/price[1]/amount">00.00</item>    <item path="/create/article[1]/price[1]/currency">USD</item>    <item path="/create/article[1]/price[2]/amount">11.11</item>    <item path="/create/article[1]/price[2]/currency">AUD</item>    <item path="/create/article[2]/id">2</item>    <item path="/create/article[2]/description">some name</item>    <item path="/create/article[2]/name[1]">some description</item>    <item path="/create/article[2]/price[1]/amount">00.01</item>    <item path="/create/article[2]/price[1]/currency">USD</item> </xsl:variable> <xsl:template match="/">  <xsl:sequence select="my:subTree($vPop/@path/concat(.,''/'',string(..)))"/> </xsl:template> <xsl:function name="my:subTree" as="node()*">  <xsl:param name="pPaths" as="xs:string*"/>  <xsl:for-each-group select="$pPaths"    group-adjacent=        "substring-before(substring-after(concat(., ''/''), ''/''), ''/'')">    <xsl:if test="current-grouping-key()">     <xsl:choose>       <xsl:when test=          "substring-after(current-group()[1], current-grouping-key())">         <xsl:element name=           "{substring-before(concat(current-grouping-key(), ''[''), ''['')}">          <xsl:sequence select=            "my:subTree(for $s in current-group()                         return                            concat(''/'',substring-after(substring($s, 2),''/''))                             )            "/>        </xsl:element>       </xsl:when>       <xsl:otherwise>        <xsl:value-of select="current-grouping-key()"/>       </xsl:otherwise>     </xsl:choose>     </xsl:if>  </xsl:for-each-group> </xsl:function></xsl:stylesheet>

当此转换应用于任何XML文档（未使用）时，结果为：

<create>   <article>      <id>1</id>      <description>bar</description>      <name>foo</name>      <price>         <amount>00.00</amount>         <currency>USD</currency>      </price>      <price>         <amount>11.11</amount>         <currency>AUD</currency>      </price>   </article>   <article>      <id>2</id>      <description>some name</description>      <name>some description</name>      <price>         <amount>00.01</amount>         <currency>USD</currency>      </price>   </article></create>

注意事项：

您需要将给出的“表达式”转换为此转换中使用的格式-这是简单而直接的。
在最后的转换中，您需要“按原样”复制每个节点（使用身份规则），除了顶级节点应在"http://predic8.com/wsdl/material/ArticleService/1/”命名空间中生成。请注意，“模板”中存在的其他名称空间不会使用，可以安全地省略。

2-Python爬虫-正则表达式/XML/XPath/CSS

页面解析和数据提取

结构数据：先有的结构，在谈数据
- JSON文件
  - JSON Path
  - 转换成Python类型进行操作（json类）
- XML文件
  - 转换成python类型（xmltodict）
  - XPath
  - CSS选择器
  - 正则
非结构化数据：先有数据，再谈结构
- 文本
- 电话号码
- 邮箱地址
  - 通常处理此类数据，使用正则表达式
- Html文件
  - 正则
  - XPath
  - CSS选择器

正则表达式

一套规则，可以在字符串文本中进行搜查替换等
案例v23,re的基本使用流程
案例v24，match的基本使用
正则常用方法：
- match: 从开始位置开始查找，一次匹配
- search：从任何位置查找，一次匹配，案例v25
- findall：全部匹配，返回列表, 案例v26
- finditer：全部匹配，返回迭代器, 案例v26
- split：分割字符串，返回列表
- sub：替换
匹配中文
- 中文unicode范围主要在[u4e00-u9fa5]
- 案例v27
贪婪与非贪婪模式
- 贪婪模式：在整个表达式匹配成功的前提下，尽可能多的匹配
- 非贪婪模式： xxxxxxxxxxxxxxxxxxxxxx, 尽可能少的匹配
- python里面数量词默认是贪婪模式
- 例如：
  - 查找文本abbbbbbccc
  - re是 ab*
  - 贪婪模式：结果是abbbbbb
  - 非贪婪：结果是a

XML

XML(EXtensibleMarkupLanguage)
http://www.w3school.com.cn/xml/index.asp
案例v28.xml
概念：父节点，子节点，先辈节点，兄弟节点，后代节点

XPath

XPath(XML Path Language), 是一门在XML文档中查找信息的语言，
官方文档： http://www.w3school.com.cn/xpath/index.asp
XPath开发工具
- 开元的XPath表达式工具： XMLQuire
- chrome插件： Xpath Helper
- Firefox插件： XPath CHecker
常用路径表达式：
- nodename: 选取此节点的所有子节点
- /: 从根节点开始选
- //: 选取元素，而不考虑元素的具体为止
- .: 当前节点
- ..:父节点
- @：选取属性
- 案例：
  - booksotre: 选取bookstore下的所有子节点
  - /booksotre: 选取根元素
  - booksotre/book: 选取bookstore的所有为book的子元素
  - //book: 选取book子元素
  - //@lang:选取名称为lang的所有属性
谓语(Predicates)
- 谓语用来查找某个特定的节点，被向前在方括号中
- /bookstore/book[1]: 选取第一个属于bookstore下叫book的元素
- /bookstore/book[last()]: 选取最后一个属于bookstore下叫book的元素
- /bookstore/book[last()-1]: 选取倒数第二个属于bookstore下叫book的元素
- /bookstore/book[position()<3]: 选取属于bookstore下叫book的前两个元素
- /bookstore/book[@lang]: 选取属于bookstore下叫book的,含有属性lang元素
- /bookstore/book[@lang="cn"]: 选取属于bookstore下叫book的,含有属性lang的值是cn的元素
- /bookstore/book[@price < 90]: 选取属于bookstore下叫book的,含有属性price的，且值小于90的元素
- /bookstore/book[@price < 90]/title: 选取属于bookstore下叫book的,含有属性price的，且值小于90的元素的子元素title
通配符
- * : 任何元素节点
- @*：匹配任何属性节点
- node(): 陪陪任何类型的节点
选取多个路径
- //book/tile | //book/author : 选取book元素中的title和author元素
- //tile | //price: 选取文档中所有的title和price元素

lxml库

python的HTML/XML的解析器
官方文档： http://lxml.de/index.html
功能：
- 解析HTML,案例v29.py
- 文件读取，案例v30.html, v31.py
- etree和XPath的配合使用, 案例v32.py

CSS选择器 BeautifulSoup4

现在使用BeautifulSoup4
http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
几个常用提取信息工具的比较：
- 正则：很快，不好用，不许安装
- beautifulsoup：慢，使用简单，安装简单
- lxml：比较快，使用简单，安装一般
案例v33.py
四大对象
- Tag
- NavigableString
- BeautifulSoup
- Comment
Tag
- 对应Html中的标签
- 可以通过soup.tag_name
- tag两个重要属性
  - name
  - attrs
- 案例a34
NavigableString
- 对应内容值
BeautifulSoup
- 表示的是一个文档的内容，大部分可以把他当做tag对象
- 一般我们可以用soup来表示
Comment
- 特殊类型的NavagableString对象，
- 对其输出，则内容不包括注释符号
遍历文档对象
- contents: tag的子节点以列表的方式给出
- children：子节点以迭代器形式返回
- descendants：所子孙节点
- string
- 案例34
搜索文档对象
- find_all(name, attrs, recursive, text, ** kwargs)
  - name:按照那个字符串搜索，可以传入的内容为
    - 字符串
    - 正则表达式
    - 列表
  - kewwortd参数，可以用来表示属性
  - text：对应tag的文本值
  - 案例34
css选择器
- 使用soup.select, 返回一个列表
- 通过标签名称: soup.select("title")
- 通过类名： soup.select(".content")
- id查找: soup.select("#name_id")
- 组合查找: soup.select("div #input_content")
- 属性查找: soup.select("img[photo''])
- 获取tag内容： tag.get_text
- 案例35

c# – 在多个祖先上具有条件的XPath表达式

我正在开发的应用程序收到类似于以下的 XML结构：

<Root>
    <Valid>
        <Child name="Child1" />
        <Container>
            <Child name="Child2" />
        </Container>
        <Container>
            <Container>
                <Child name="Child3"/>
                <Child name="Child4"/>
            </Container>
        </Container>
        <Wrapper>
            <Child name="Child5" />
        </Wrapper>
        <Wrapper>
            <Container>
                <Child name="Child19" />
            </Container>
        </Wrapper>
        <Container>
            <Wrapper>
                <Child name="Child6" />
            </Wrapper>
        </Container>
        <Container>
            <Wrapper>
                <Container>
                    <Child name="Child20" />
                </Container>
            </Wrapper>
        </Container>
    </Valid>
    <Invalid>
        <Child name="Child7" />
        <Container>
            <Child name="Child8" />
        </Container>
        <Container>
            <Container>
                <Child name="Child9"/>
                <Child name="Child10"/>
            </Container>
        </Container>
        <Wrapper>
            <Child name="Child11" />
        </Wrapper>
        <Container>
            <Wrapper>
                <Child name="Child12" />
            </Wrapper>
        </Container>
    </Invalid>
</Root>

我需要在以下条件下获取Child元素的列表：

> Child是有效祖先的n代后代.
> Child可能是Container祖先的m代后代,它是有效祖先的o代后代.
> Child元素的有效祖先是作为m代祖先的Container元素和作为第一代祖先的Valid元素.

其中m,n,o是自然数.

我需要编写以下XPath表达式

Valid/Child
Valid/Container/Child
Valid/Container/Container/Child
Valid/Container/Container/Container/Child
...

作为单个XPath表达式.

对于提供的示例,XPath表达式将仅返回name属性等于Child1,Child2,Child3和Child4的Child元素.

我最接近解决方案是遵循表达式.

Valid/Child | Valid//*[self::Container]/Child

但是,这将选择名称属性等于Child19和Child20的Child元素.

XPath语法是否支持元素的可选出现或者在前面的示例中为self和Valid元素之间的所有祖先设置类似于self的条件？

解决方法

使用：

//Child[ancestor::*
          [not(self::Container)][1]
                            [self::Valid]
       ]

在提供的XML文档上评估此XPath表达式时：

<Root>
    <Valid>
        <Child name="Child1" />
        <Container>
            <Child name="Child2" />
        </Container>
        <Container>
            <Container>
                <Child name="Child3"/>
                <Child name="Child4"/>
            </Container>
        </Container>
        <Wrapper>
            <Child name="Child5" />
        </Wrapper>
        <Wrapper>
            <Container>
                <Child name="Child19" />
            </Container>
        </Wrapper>
        <Container>
            <Wrapper>
                <Child name="Child6" />
            </Wrapper>
        </Container>
        <Container>
            <Wrapper>
                <Container>
                    <Child name="Child20" />
                </Container>
            </Wrapper>
        </Container>
    </Valid>
    <Invalid>
        <Child name="Child7" />
        <Container>
            <Child name="Child8" />
        </Container>
        <Container>
            <Container>
                <Child name="Child9"/>
                <Child name="Child10"/>
            </Container>
        </Container>
        <Wrapper>
            <Child name="Child11" />
        </Wrapper>
        <Container>
            <Wrapper>
                <Child name="Child12" />
            </Wrapper>
        </Container>
    </Invalid>
</Root>

确切地选择了所需的节点：

<Child name="Child1"/>
<Child name="Child2"/>
<Child name="Child3"/>
<Child name="Child4"/>

说明：

表达方式：

//Child[ancestor::*
          [not(self::Container)][1]
                            [self::Valid]
       ]

手段：

从文档中的所有子元素中,仅选择那些不是Container的第一个祖先有效的元素.

c# – 异常：XPath表达式计算为意外类型System.Xml.Linq.XAttribute

我有一个像下面这样的 XML文件：

<Employees>
  <Employee Id="ABC001">
    <Name>Prasad 1</Name>
    <Mobile>9986730630</Mobile>
    <Address Type="Perminant">
      <City>City1</City>
      <Country>India</Country>
    </Address>
    <Address Type="Temporary">
      <City>City2</City>
      <Country>India</Country>
    </Address>
  </Employee>

现在我想要获取所有地址类型.

我尝试下面使用XPath,我得到例外.

var xPathString = @"//Employee/Address/@Type";
doc.XPathSelectElements(xPathString); // doc is XDocument.Load("xml file Path")

Exception: The XPath expression evaluated to unexpected type
System.Xml.Linq.XAttribute.

我的XPath有什么问题吗？

解决方法

你的XPath很好(虽然你可能希望它更具选择性),但你必须调整你的评估方式……

顾名思义,XPathSelectElement()只应用于选择元素.

XPathEvaluate()更通用,可用于属性.您可以枚举结果,或抓住第一个：

var type = ((IEnumerable<object>)doc.XPathEvaluate("//Employee/Address/@Type"))
                                    .OfType<XAttribute>()
                                    .Single()
                                    .Value;

html – XPath表达式：选择A HREF =“expr”标记之间的元素

我没有找到一种明确的方法来选择 HTML文件中两个锚点(< a>< / a>标记对)之间存在的所有节点.

第一个锚具有以下格式：

<a href="file://START..."></a>

第二锚：

<a href="file://END..."></a>

我已经验证了可以使用starts-with选择两者(请注意我使用的是HTML Agility Pack)：

HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://END')]"));

考虑到这一点,以及我的业余XPath技能,我编写了以下表达式来获取两个锚点之间的所有标记：

html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(@href,'file://START0')]) and not (preceding-sibling::a[starts-with(@href,'file://END0')])]");

这似乎工作,但选择所有HTML文档！

我需要,例如对于以下HTML片段：

<html>
...

<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
    <span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>

...
</html>

删除两个锚点,三个P(当然包括内部SPAN).

有什么办法吗？

我不知道XPath 2.0是否提供了更好的方法来实现这一目标.

*编辑(特殊情况！)*

我还应该处理以下情况：

“选择X和X之间的标签,其中X是< p>< a href =”file：// ...“>< / a>< / p>”

所以代替：

<a href="file://START..."></a>
<!-- xhtml to be extracted -->
<a href="file://END..."></a>

我也应该处理：

<p>
  <a href="file://START..."></a>
</p>
<!-- xhtml to be extracted -->

<p>
  <a href="file://END..."></a>
</p>

再一次非常感谢你.

解决方法

使用此XPath 1.0表达式：

//a[starts-with(@href,'file://START')]/following-sibling::node()
     [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
     =
      count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
     ]

或者,使用此XPath 2.0表达式：

//a[starts-with(@href,'file://START')]/following-sibling::node()
  intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()

XPath 2.0表达式使用XPath 2.0相交运算符.

XPath 1.0表达式使用Kayessian(在@Michael Kay之后)公式用于两个节点集的交叉连接：

$ns1[count(.|$ns2) = count($ns2)]

使用XSLT进行验证：

这个XSLT 1.0转换：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "    //a[starts-with(@href,'file://START')]/following-sibling::node()
         [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
         =
          count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
         ]
  "/>
 </xsl:template>
</xsl:stylesheet>

当应用于提供的XML文档时：

<html>...
    <a href="file://START0"></a>
    <p>First nodes</p>
    <p>First nodes    
        <span>X</span>
    </p>
    <p>First nodes</p>
    <a href="file://END0"></a>...
</html>

产生想要的,正确的结果：

<p>First nodes</p>
<p>First nodes    
        <span>X</span>
</p>
<p>First nodes</p>

这个XSLT 2.0转换：

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  " //a[starts-with(@href,'file://START')]/following-sibling::node()
   intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()
  "/>
 </xsl:template>
</xsl:stylesheet>

当应用于同一XML文档(上面)时,再次产生完全想要的结果.

今天关于从XPath表达式填充XML模板文件？和xpath怎么用的介绍到此结束，谢谢您的阅读，有关2-Python爬虫-正则表达式/XML/XPath/CSS、c# – 在多个祖先上具有条件的XPath表达式、c# – 异常：XPath表达式计算为意外类型System.Xml.Linq.XAttribute、html – XPath表达式：选择A HREF =“expr”标记之间的元素等更多相关知识的信息可以在本站进行查询。

本文标签：