I have a website page with a bunch of links. I want to write a script which would dump all the data contained in those link in a local file.
Bạn đang xem: What are the best ways to crawl a website with php?
Has anybody done that with PHP? General guidelines & gotchas would suffice as an answer.
Trending sort Trending sort is based off of the mặc định sorting method — by highest score — but it boosts votes that have happened recently, helping khổng lồ surface more up-to-date answers.
It falls back lớn sorting by highest score if no posts are trending.
Highest score (default) Trending (recent votes count more) Date modified (newest first) Date created (oldest first)
Meh. Don"t parse HTML with regexes.
Here"s a DOM version inspired by Tatu"s:
loadHTMLFile($url); $anchors = $dom->getElementsByTagName("a"); foreach ($anchors as $element) $href = $element->getAttribute("href"); if (0 !== strpos($href, "http")) $path = "/" . Ltrim($href, "/"); if (extension_loaded("http")) $href = http_build_url($url, array("path" => $path)); else $parts = parse_url($url); $href = $parts<"scheme"> . "://"; if (isset($parts<"user">) && isset($parts<"pass">)) $href .= $parts<"user"> . ":" . $parts<"pass"> . "
"; $href .= $parts<"host">; if (isset($parts<"port">)) $href .= ":" . $parts<"port">; $href .= dirname($parts<"path">, 1).$path; crawl_page($href, $depth - 1); echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;}crawl_page("http://hobodave.com", 2);Edit: I fixed some bugs from Tatu"s version (works with relative URLs now).
Edit: I added a new bit of functionality that prevents it from following the same URL twice.
Edit: echoing đầu ra to STDOUT now so you can redirect it to lớn whatever file you want
Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append khổng lồ the end of the url path, but overwrite it. Thanks to George for this. Chú ý that George"s answer doesn"t trương mục for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.