python自定義解析簡(jiǎn)單xml格式文件的方法

更新時(shí)間：2015年05月11日 15:51:26 作者：像風(fēng)一樣的自由

這篇文章主要介紹了python自定義解析簡(jiǎn)單xml格式文件的方法,涉及Python解析XML文件的相關(guān)技巧,非常具有實(shí)用價(jià)值,需要的朋友可以參考下

本文實(shí)例講述了python自定義解析簡(jiǎn)單xml格式文件的方法。分享給大家供大家參考。具體分析如下：

因?yàn)楣緝?nèi)部的接口返回的字串支持2種形式：php數(shù)組，xml；結(jié)果php數(shù)組python不能直接用，而xml字符串的格式不是標(biāo)準(zhǔn)的，所以也不能用標(biāo)準(zhǔn)模塊解析?！静粯?biāo)準(zhǔn)的地方是某些節(jié)點(diǎn)會(huì)的名稱是以數(shù)字開(kāi)頭的】，所以寫(xiě)個(gè)簡(jiǎn)單的腳步來(lái)解析一下文件，用來(lái)做接口測(cè)試。

#!/usr/bin/env python
#encoding: utf-8
import re
class xmlparse:
  def __init__(self, xmlstr):
    self.xmlstr = xmlstr
    self.xmldom = self.__convet2utf8()
    self.xmlnodelist = []
    self.xpath = ''
  def __convet2utf8(self):
    headstr = self.__get_head()
    xmldomstr = self.xmlstr.replace(headstr, '')
    if 'gbk' in headstr: 
      xmldomstr = xmldomstr.decode('gbk').encode('utf-8')
    elif 'gb2312' in headstr:
      xmldomstr = self.xmlstr.decode('gb2312').encode('utf-8')
    return xmldomstr
  def __get_head(self):
    headpat = r'<\?xml.*\?>'
    headpatobj = re.compile(headpat)
    headregobj = headpatobj.match(self.xmlstr)
    if headregobj:
      headstr = headregobj.group()
      return headstr
    else:
      return ''
  def parse(self, xpath):
    self.xpath = xpath
    xpatlist = []
    xpatharr = self.xpath.split('/')
    for xnode in xpatharr:
      if xnode:
        spcindex = xnode.find('[')
        if spcindex > -1:
          index = int(xnode[spcindex+1:-1])
          xnode = xnode[:spcindex]
        else:
          index = 0;
        temppat = ('<%s>(.*?)</%s>' % (xnode, xnode),index)
        xpatlist.append(temppat)
    xmlnodestr = self.xmldom
    for xpat,index in xpatlist:
      xmlnodelist = re.findall(xpat,xmlnodestr)
      xmlnodestr = xmlnodelist[index]
      if xmlnodestr.startswith(r'<![CDATA['):
        xmlnodestr = xmlnodestr.replace(r'<![CDATA[','')[:-3]
    self.xmlnodelist = xmlnodelist
    return xmlnodestr
if '__main__' == __name__:
  xmlstr = '<?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>'
  xpath1 = '/product_id'
  xpath2 = '/product_id[1]'
  xpath3 = '/a/product_id'
  xp = xmlparse(xmlstr)
  print 'xmlstr:',xp.xmlstr
  print 'xmldom:',xp.xmldom
  print '------------------------------'
  getstr = xp.parse(xpath1)
  print 'xpath:',xp.xpath
  print 'get list:',xp.xmlnodelist
  print 'get string:', getstr
  print '------------------------------'
  getstr = xp.parse(xpath2)
  print 'xpath:',xp.xpath
  print 'get list:',xp.xmlnodelist
  print 'get string:', getstr
  print '------------------------------'
  getstr = xp.parse(xpath3)
  print 'xpath:',xp.xpath
  print 'get list:',xp.xmlnodelist
  print 'get string:', getstr

運(yùn)行結(jié)果：

xmlstr: <?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>
xmldom: <resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>
------------------------------
xpath: /product_id
get list: ['aaaaa', 'bbbbb']
get string: aaaaa
------------------------------
xpath: /product_id[1] 
get list: ['aaaaa', 'bbbbb']
get string: bbbbb
------------------------------
xpath: /a/product_id
get list: ['aaaaa']
get string: aaaaa

因?yàn)榉祷氐膞ml格式比較簡(jiǎn)單，沒(méi)有帶屬性的節(jié)點(diǎn)，所以處理起來(lái)就比較簡(jiǎn)單了。但測(cè)試還是發(fā)現(xiàn)有一個(gè)bug。即當(dāng)相同節(jié)點(diǎn)嵌套時(shí)會(huì)出現(xiàn)正則匹配出問(wèn)題，該問(wèn)題的可以通過(guò)避免在xpath中出現(xiàn)有嵌套節(jié)點(diǎn)的名稱來(lái)解決，否則只有重寫(xiě)復(fù)雜的機(jī)制了。

希望本文所述對(duì)大家的Python程序設(shè)計(jì)有所幫助。

您可能感興趣的文章: