使用Scala XML支持解析HTML页面-过滤出数据

最后发布: 2013-11-06 12:57:01


问题

抓取网站并接收HTML页面。

该页面有一些带有行的表

(演员->角色)

例如:

(演员= Jason Priestley->角色= Brandon Walsh)

有时有些行缺少“演员”或“角色”

(预期2时,行数为1)

文件示例:

<div id="90210">
      <h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
      <table class="actors">
        <tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
        <tr><td class="actor">Shannen Doherty</td></tr>
        <tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
      </table>
</div>

在仅过滤1列的行时遇到麻烦:

我的代码:

  def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
    val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
    beverlyHillsData match {
      case Some(data) => {
        val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
        val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
        val roles  = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role")  map {_.text}
        actors zip roles  toMap
      }
      case None => Map()
    }
  }

主要关注点是:

val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )

我怎样才能过滤掉不良行,使其更精确(没有_.toString())

有什么建议么 ?

xml scala xml-parsing pattern-matching web-scraping
回答

您可以

def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

val goodRows = data \\ "tr" filter actorWithRole

我还将更改数据提取以完整保留角色/角色对。 我需要更多时间找出解决方案

我的建议是

def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {

  def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

  def rowToEntry(r: Node) =
    r \ "td" map (_.text) match {
      case actor :: role :: Nil => (actor -> role)
    }  

  val beverlyHillsData = page \\ "div" find whereId("90210")

  beverlyHillsData match {
    case Some(data) => {
      val goodRows = data \\ "tr" filter actorWithRole
      val entries = goodRows map rowToEntry
      entries.toMap
    }
    case None => Map()
  }
}