onlyforEcho 发表于 2013-2-7 15:12:57

webharvest & xpath tips

webharvest:

1.get a web page source formats in XML format
<html-to-xml><http url="${sys.fullUrl(rooturl,nexturl)}" charset="ISO-8859-1"/>
</html-to-xml>

or just get html format

<http url="${sys.fullUrl(rooturl,nexturl)}" charset="ISO-8859-1"/>

2. SimpleDateFormat
EEE, dd MMM yyyy hh:mm:ss Z

dd-MM-yyyy HH:mm a

3.<template>${sys.fullUrl(rooturl,commenter_name)}</template>

XPATH

1.data((//font[@class='subject']))

2.//td[@class='tablerow' and @valign='top' and @style='height: 80px; width: 82%']/font

3.a[.,'1']


Regular Expression

<content>([\\w\\W]*?)<content>

<post>(.*?)</post>

/\\d{4}/\\d{1,2}/\\d{1,2}/<!-- such as /2009/12/3/-->
页: [1]
查看完整版本: webharvest & xpath tips