Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Đang xem: What are the best ways to crawl a website with php?

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

*

*

Trending sort Trending sort is based off of the default sorting method — by highest score — but it boosts votes that have happened recently, helping to surface more up-to-date answers.

It falls back to sorting by highest score if no posts are trending.

Xem thêm:

Switch to Trending sort
Highest score (default) Trending (recent votes count more) Date modified (newest first) Date created (oldest first)
Meh. Don”t parse HTML with regexes.

Here”s a DOM version inspired by Tatu”s:

loadHTMLFile($url); $anchors = $dom->getElementsByTagName(“a”); foreach ($anchors as $element) { $href = $element->getAttribute(“href”); if (0 !== strpos($href, “http”)) { $path = “/” . ltrim($href, “/”); if (extension_loaded(“http”)) { $href = http_build_url($url, array(“path” => $path)); } else { $parts = parse_url($url); $href = $parts<"scheme"> . “://”; if (isset($parts<"user">) && isset($parts<"pass">)) { $href .= $parts<"user"> . “:” . $parts<"pass"> . “
“; } $href .= $parts<"host">; if (isset($parts<"port">)) { $href .= “:” . $parts<"port">; } $href .= dirname($parts<"path">, 1).$path; } } crawl_page($href, $depth – 1); } echo “URL:”,$url,PHP_EOL,”CONTENT:”,PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;}crawl_page(“http://hobodave.com”, 2);Edit: I fixed some bugs from Tatu”s version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Xem thêm: Tải Game Xây Dựng Pc Hay Miễn Phí #1, Top Game Xây Dựng Nhiều Người Chơi

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George”s answer doesn”t account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *