Based on the discussions on web scraping in our previous blogs, we believe that the future of web scraping is very promising for the following reasons. First, the permeation of Internet and web technology allows massive amount of data accessible from the web. Since one of the characteristics of “Big Data” is about utilizing both structured and perhaps more importantly unstructured data [1], web scraping is a popular tool in capturing unstructured data on the Internet. In addition, a wide of range of web scraping vendors exist, such as Mozenda and import.io. Individuals can create customized scrapping programs as well by using open source programming languages such as Python, R and Ruby. Start-up companies can use cheap data from the web crawling without significant investment in purchasing external data [2].
Yet, numerous challenges exists in web scraping technology. Since everyone equipped with web scraping skills can crawl information from the Internet, it leads to significant redundancies in web scraping effort, and results in unnecessary waste of man-power. Also, in order to retrieve necessary information from the web, people oftentimes need to crawl extensively but only utilize a small subset of the scrapped content. As such, collaborative scraping could possibly be a future trend – there will be one web crawler engaged in broad scrapping, and therefore other parties can scrape data crawled via an API. Furthermore, although web scraping possess the potential to harvest unstructured data, most techniques still focus on text retrieval instead of multimedia file processing. In addition, from infrastructure perspective, as websites become increasingly complex, it imposes more demands on computers’ processing capacity to perform web scrapping [3].
Web scraping has also led to privacy concerns. For instance, eBay in 2000 filed a lawsuit against a bidding system, which crawled data from eBay website to gain bidding insights. Similarly, a major US airline suited a travel agency for scraping its flight pricing data and providing it to that airline’s competitors. Both examples demonstrate how unintended users are able to take advantages of a company’s business via web scraping, and disrupt its normal business operations. However, most of the web scraping related legal cases indicates that although companies can include “do not scrape” under terms of usage for their websites, this clause alone is not legally effective unless users explicitly agree to the term [2]. If companies can figure out a way to restrict users from web scraping, most likely via a technological solution instead of a legal one, might cause decline in commercial usage of web scrapping.
[1] United Nations Global Pulse. “Big Data for Development: Opportunities & Challenges.” United Nations Global Pulse, May 2012. Web. 18 April 2014. < http://www.unglobalpulse.org/sites/default/files/BigDataforDevelopment-UNGlobalPulseJune2012.pdf>.
[2] Distil. “Is web scraping illegal? Depends on what the meaning of the word is is.” Distil, June 2013. Web. 18 April 2014.
[3] Shestakov, Denis. “Current challenges in web crawling.” 13th International Conference on Web Engineering. August 2013. Web. 18 April 2014. <
http://www.slideshare.net/denshe/icwe13-tutorial-webcrawling>.