Please refer to W3Schools Tutorials if you want khổng lồ know more about HTML tags, id and class.
Bạn đang xem: Github
What will you learn in this tutorial?How to install SimpleHTMLDOMHow lớn Scrape data from trang web using SimpleHTMLDOMHow to store data khổng lồ xml fileHow khổng lồ automate script using crontab
1. How khổng lồ install Simple HTML Dom Parser:
To start with, download Simple HTML Dom Parser from this LINK.
Next, extract zip file Simplehtmldom_1_5.zip & what you will have is a thư mục called “simple_dom”.
2. How lớn Scrape data from trang web using PHP with Simple HTML DOM
Now we come khổng lồ the application part of the process. Let’s get down to scraping the IMDB website to lớn extract the đánh giá of the movie “Avengers: Infinity War”. You can get it here.
Step 1: Create a new PHP tệp tin called scraper.php và include the library mentioned below:
To create a new PHP file, create a new thư mục called “simple_dom” & include “simple_html_dom.php” file at the top.
Why movie đánh giá and rating matter is because these can be used khổng lồ create the necessary database for sentiment analysis, text classification etc.
Since there are countless đánh giá in a website like IMDB, it is not possible lớn get all the review by mere copy-paste.
With the help of website scraping, you can get all the nhận xét in an automatic fashion and save it in xml file.Rating stars – The users’ rating stars of the film.Title of nhận xét – The title of the users’ review.Review – The nội dung of the review.
Here’s how all these fields are arranged. Take a look at the screenshot:
Step 2: Extract the html returned content from the website.
What you need to vì is use file_get_html function lớn get HTML page of the URL.
URL = https://www.imdb.com/title/tt4154756/reviews?ref_=tt_ov_rt .
require_once ‘simple_html_dom.php’;//get html nội dung from the site.$dom = file_get_html(‘https://www.imdb.com/title/tt4154756/reviews?ref_=tt_ql_3‘, false);
Step 3: Scrape the fields of the reviews
Now the fun starts. We will make use of the HTML tag & scrape the data items mentioned earlier, lượt thích rating stars, title of the review and nhận xét with the help of Inspect element.
This is how you can find out the class of the tag with the help of following step:
Go to chrome browser => open this url => vày right click => inspect element
NOTE: If you don’t use chrome browser, go through this article
Next, we will scrape the requisite information from HTML based on css selectors like class, id etc. Now let’s get the css class for title, review and rating stars. All you got to vày is right click on title and select “Inspect” or “Inspect Element”.
As you can see, the css class “review-container” is applied to lớn all
tags which contain titles, rating stars and reviews of users. This will be useful in the process of filtering the field from the rest of the other nội dung in the response object:
Next, we will scrape all those fields with the help of that class & a for each loop, as is shown below:
//collect all user’s review into an array$answer = array();if(!empty($dom)) $divClass = $title = ”;$i = 0;foreach($dom->find(“.review-container”) as $divClass) //titleforeach($divClass->find(“.title”) as $title ) $answer<$i><‘title’> = $title->plaintext;//ipl-ratings-barforeach($divClass->find(“.ipl-ratings-bar”) as $ipl_ratings_bar ) $answer<$i><‘rate’> = trim($ipl_ratings_bar->plaintext);//contentforeach($divClass->find(‘div
Output:As you can observe in the screenshot, we could scrape the title (title of review), rate (rating stars) and nội dung (reviews) in array.
Step 4: Store data into xml tệp tin using “SimpleXMLElement”The next step is khổng lồ store the output in an xml file. So all we need to bởi is to convert “$answer” array into xml element.In order to bởi vì that, we will make use of “SimpleXMLElement” built-in class khổng lồ convert PHP array into xml element.
//function definition to convert array to xmlfunction array_to_xml($array, &$xml_user_info) foreach($array as $key => $value) if(is_array($value)) $subnode = $xml_user_info->addChild(“Review$key”);foreach ($value as $k=>$v) $xml_user_info->addChild(“$k”, $v);else $xml_user_info->addChild(“$key”,htmlspecialchars(“$value”));return $xml_user_info->asXML();//creating object of SimpleXMLElement$xml_user_info = new SimpleXMLElement(“”);//function call to convert array lớn xml và return whole xml content with tag$xmlContent = array_to_xml($answer,$xml_user_info);⦁ We created an object of SimpleXMLElement and then placed that object into a user defined function called “array_to_xml”.“$xml_user_info” = It is an object of SimpleXMLElement“array_to_xml” = It is a user defined function“$xmlContent” = It is a variable where the data is stored in array format
Step 5: Create an xml file và write xml nội dung to xml fileNext I created a tệp tin called “AvengersMovieReview.xml” và stored “$xmlContent” into this file.
// Create a xml file$my_file = ‘AvengersMovieReview.xml’;$handle = fopen($my_file, ‘w’) or die(‘Cannot open file: ‘.$my_file);//success và error message based on xml creationif(fwrite($handle, $xmlContent)) echo ‘XML tệp tin have been generated successfully.’;elseecho ‘XML tệp tin generation error.’;?>At the end of it all, run the whole code and reviews the output và created xml file AvengersMovieReview.xml.See the screenshot of the output and that is the file.
And we completed scraping the data that we needed. Wasn’t it easy to scrape the web data using PHP?
The last bit that you should know: here’s the explanation for Linux basis regarding how to schedule & run this task in the background at regular breaks và in an automatic fashion with the help of Crontab command.
Automating Script Using Crontab
As you would know, Linux vps can help you in automatize certain functions and completing the tasks which otherwise require human intervention. As far as Linux servers are concerned, cron utility is something that people prefer in order khổng lồ automate the way scripts run. For your needs of large data on a daily basis, it can be useful.
Cron is something works well on Linux & Unix environments that take care of scheduled commands which are also called cron jobs configured by the crontab command.
As regards a Linux pc, you can use this script khổng lồ run it at a specified time of the day with the help of the command “crontab-e”. If you wish khổng lồ access more information on crontab, read it here: https://www.tutorialspoint.com/unix_commands/crontab.htm
Web scraping has turned into a compulsion for businesses. If you want lớn carry out market research, you need data. If you want lớn devise your sales strategy, you need data. If you want to generate leads for your business, you need data. In all possible crucial aspects of business strategy và operation, web scraping can enormously contribute by automating extraction of data.
If you want to scrape large amounts of data for your specific needs, you may encounter the following challenges:You may get blocked.You may find it difficult lớn scrape data from a dynamic websiteYou may be stuck up dealing with pages scrolling on & on.
Thank heavens, there is a highly efficient và reliable web scraping service lượt thích darkedeneurope.com to tackle all these challenges and provide you the data you want.