FANDOM


Best way to screen-scrape a web site. 1. Use wget to get the html file. See examples of using wget:

2. Convert the html to xml using "tidy"

tidy -asxhtml -numeric <oldpage.html> newpage.xml

3. Use xpath / xslt to interpret the context of the xml, and recursively invoke wget again, depending on need.