Web scraping has been widely adopted in the field of data science. It played a great role in areas like weather report, market pricing and sentimental analysis when combined with other data mining techniques.
Compared to other ways to acquire data, web scraping has many advantages from the perspectives of data science. First, web scraping allows data scientist/researchers to process and utilize the unstructured data on the web. Most data on the web is organized in an unstructured way that is not naturally suited for automated processing by computers. Second, with web scraping to exact, pre-process the web data, data scientist can harness the wisdom of the crowd by incorporating the web data into later stage model building and other data mining techniques. Another reason why web scraping is very important is that not all the websites provide well-designed API for the end users to fetch the data. Using web scraping data scientist can get the data based on customized requirement, without worrying about the lacking API support from the website. Last but not least, the display of the web page is very intuitive so before data scientists crawl the data they already have an understanding of the data since the web data is easy to read and understand.
There are also a lot of challenges related to web scraping. One technical challenge for web scraping actually arises from the nature of HTML. Since web page is written in HTML and HTML can represent the same web content in unlimited number of ways, therefore to extract the data from the web page we have to write separate scripts for each page. The other problem of scraping the web page is that when the web page updates frequently, the scraping script to crawl the web page is also needed to be updated correspondingly. These two issues make the web scraping techniques hard to scale. Besides that, since the data is scraped from multiple sources from the web, different websites might have different representations of the data so it would not be easy to transform and normalize the data. This is also performance challenge for web scraping. The performance of the web crawler is not only impacted by the scraping program itself, but also factors like network speed and limitations from the web hosting servers etc. There are also non-technical aspects like legal issues/privacy issues that can influence the application of web scraping.
To address various challenges for web scraping, we need to explore the cutting edge solutions as well as the right methodologies to use the web scraping. Most of the web-scraping program is written in scripting language like Python, Ruby or Perl. To master one of the scripting languages is the first step to explore web scraping. There are also packages/library that are specific written for problems mentioned above. For example Beautiful Soup in Python and Nokigiri in Ruby are good ways to parse the web data, Mechanize provides way to simulate the browser for the applications.
Through exploration of different tools/methods, we can leverage the values of the massive web data.
“Current challenges in web crawling.” Current challenges in web crawling. N.p., n.d.Web. 11 Apr. 2014. <http://www.slideshare.net/denshe/icwe13-tutorial-webcrawling>.
“Fetch is now a part of Connotate.” Connotate. N.p., n.d. Web. 11 Apr. 2014. <http://www.connotate.com/top-pitfalls-web-scraping-6502>.
“Hartley Brody.” Hartley Brody I Dont Need No Stinking API Web Scraping For Fun and Profit Comments. N.p., n.d. Web. 11 Apr. 2014. <http://blog.hartleybrody.com/web-scraping/>.
“Web Scraping:Â How advanced can web scraping or data mining be?.” Web Scraping: How advanced can web scraping or data mining be?. N.p., n.d. Web. 10 Apr. 2014. <http://www.quora.com/Web-Scraping/How-advanced-can-web-scraping-or-data-mining-be>.
“Web scraping.” Wikipedia. Wikimedia Foundation, 4 Aug. 2014. Web. 11 Apr. 2014. <http://en.wikipedia.org/wiki/Web_scraping>.
MLA formatting by BibMe.org.