Archive

Posts Tagged ‘rotate proxies’

Scraping with Huge Data on Electronic Tools


This job was huge data extracting from ToolsTop and save them into MySQL , CSV and Spreadsheet. Although this website did not have that much security but for better performance of my Scraping Script I did manage anti-blocking strategies – just used proxy inside Scrape , Data Extraction, Data Mining , Data Harvesting.

Here I will show the partial and sample report of that Data Mining.

I have scraped on the following categories of data:

-product_name
-list_price_exclude_vat
-list_price_include_vat
-our_price_exclude_vat
-our_price_include_vat
-your_savings
-product_code
-product_images (images source link and images downloaded)

This partial report has been generated based on the following link:

http://www.toolstop.co.uk/cordless-drills-b210

Electronics product scraped and data mining

Website Page

 

After that I have generated a CSV file on these product list.

Report in CSV

Report in CSV

 

Product images were also downloaded by script:

Downloaded Automatically Product Images by Script

Downloaded Automatically Product Images by Script

 

Parse Data Using Proxy to Avoid Blockage


Some of the websites has  advanced security to monitor the abnormal traffic / hits. Basically it tracked the nonhuman behavior or you can say clicks clicks, crawling on their network/website.  Eventually they blocked the IP address or network from where suspicious hits come in.

As a professional Website Scraper you should able to adapt the Scraping Technology and Parse the data without experiencing IP or network blockage of your own. 

To be continued …. sometimes this week…I promised!

Okay! am back..

I am exploring this in terms of PHP Simple HTML DOM Parse and using it’s library.

So at the beginning include this library:

/*PHP Script on Scrape , parse using proxy*/

include(‘simple_html_dom.php’);

$url = ‘the url you wanted to parse’;

/* Connecting Via Proxy */
$via_proxy= array
(
‘http’ => array
(
‘proxy’ =>’addresseproxy:portproxy’,

‘request_fulluri’ => true,
),
);

$via_proxy= stream_context_create($via_proxy);

$html = file_get_html($url,false,$via_proxy);
/* EO Proxy */

Now we must need to consider two vital issues in this Scraping / parsing technique :

#1. Valid Good Proxy Addresses and ports

#2. Does this proxy and Scrape Script really working via Proxy.

There are thousands of websites that broadcasting proxy addresses but I preferred to use  XROXY.COM and port 80, you may have different sources and preferences.

I have taken from http://www.xroxy.com/proxy-port-80.htm

Let’s check now – is this proxy really working:

<?php

/*PHP Script on validation of using proxy in Scraping, parsing*/

include(‘simple_html_dom.php’);

error_reporting(0);

$url = ‘http://www.find-ip-address.org/&#8217;; /*This website track back your own IP or your gateway IP*/

/* Connecting Via Proxy */
$via_proxy= array
(
‘http’ => array
(
‘proxy’ => ‘95.65.100.24:80’,

‘request_fulluri’ => true,
),
);

$via_proxy= stream_context_create($via_proxy);

$html = file_get_html($url,false,$via_proxy);
/* EO Proxy */

echo $html->outertext;

?>

You will see on output page : My Ip Address: 95.65.100.24

Note: it is just a very simple way and initial exploring as I promised  – no worries guys I will put more complex way while getting some free hours 😉 …