Scraping with Huge Data on Electronic Tools
This job was huge data extracting from ToolsTop and save them into MySQL , CSV and Spreadsheet. Although this website did not have that much security but for better performance of my Scraping Script I did manage anti-blocking strategies – just used proxy inside Scrape , Data Extraction, Data Mining , Data Harvesting.
Here I will show the partial and sample report of that Data Mining.
I have scraped on the following categories of data:
-product_name
-list_price_exclude_vat
-list_price_include_vat
-our_price_exclude_vat
-our_price_include_vat
-your_savings
-product_code
-product_images (images source link and images downloaded)
This partial report has been generated based on the following link:
http://www.toolstop.co.uk/cordless-drills-b210
After that I have generated a CSV file on these product list.
Product images were also downloaded by script:
Parse Data Using Proxy to Avoid Blockage
Some of the websites has advanced security to monitor the abnormal traffic / hits. Basically it tracked the nonhuman behavior or you can say clicks clicks, crawling on their network/website. Eventually they blocked the IP address or network from where suspicious hits come in.
As a professional Website Scraper you should able to adapt the Scraping Technology and Parse the data without experiencing IP or network blockage of your own.
To be continued …. sometimes this week…I promised!
Okay! am back..
I am exploring this in terms of PHP Simple HTML DOM Parse and using it’s library.
So at the beginning include this library:
/*PHP Script on Scrape , parse using proxy*/
include(‘simple_html_dom.php’);
$url = ‘the url you wanted to parse’;
/* Connecting Via Proxy */
$via_proxy= array
(
‘http’ => array
(
‘proxy’ =>’addresseproxy:portproxy’,
‘request_fulluri’ => true,
),
);
$via_proxy= stream_context_create($via_proxy);
$html = file_get_html($url,false,$via_proxy);
/* EO Proxy */
Now we must need to consider two vital issues in this Scraping / parsing technique :
#1. Valid Good Proxy Addresses and ports
#2. Does this proxy and Scrape Script really working via Proxy.
There are thousands of websites that broadcasting proxy addresses but I preferred to use XROXY.COM and port 80, you may have different sources and preferences.
I have taken from http://www.xroxy.com/proxy-port-80.htm
Let’s check now – is this proxy really working:
<?php
/*PHP Script on validation of using proxy in Scraping, parsing*/
include(‘simple_html_dom.php’);
error_reporting(0);
$url = ‘http://www.find-ip-address.org/’; /*This website track back your own IP or your gateway IP*/
/* Connecting Via Proxy */
$via_proxy= array
(
‘http’ => array
(
‘proxy’ => ‘95.65.100.24:80’,
‘request_fulluri’ => true,
),
);
$via_proxy= stream_context_create($via_proxy);
$html = file_get_html($url,false,$via_proxy);
/* EO Proxy */
echo $html->outertext;
?>
You will see on output page : My Ip Address: 95.65.100.24
Note: it is just a very simple way and initial exploring as I promised – no worries guys I will put more complex way while getting some free hours 😉 …