Web scraping: future

18 Friday Apr 2014

Posted by weiweiliu23 in 02_Web scraping, Uncategorized

Tags

collaborative web crawling, privacy, unstructured data

Based on the discussions on web scraping in our previous blogs, we believe that the future of web scraping is very promising for the following reasons. First, the permeation of Internet and web technology allows massive amount of data accessible from the web. Since one of the characteristics of “Big Data” is about utilizing both structured and perhaps more importantly unstructured data [1], web scraping is a popular tool in capturing unstructured data on the Internet. In addition, a wide of range of web scraping vendors exist, such as Mozenda and import.io. Individuals can create customized scrapping programs as well by using open source programming languages such as Python, R and Ruby. Start-up companies can use cheap data from the web crawling without significant investment in purchasing external data [2].

Yet, numerous challenges exists in web scraping technology. Since everyone equipped with web scraping skills can crawl information from the Internet, it leads to significant redundancies in web scraping effort, and results in unnecessary waste of man-power. Also, in order to retrieve necessary information from the web, people oftentimes need to crawl extensively but only utilize a small subset of the scrapped content. As such, collaborative scraping could possibly be a future trend – there will be one web crawler engaged in broad scrapping, and therefore other parties can scrape data crawled via an API. Furthermore, although web scraping possess the potential to harvest unstructured data, most techniques still focus on text retrieval instead of multimedia file processing. In addition, from infrastructure perspective, as websites become increasingly complex, it imposes more demands on computers’ processing capacity to perform web scrapping [3].

Web scraping has also led to privacy concerns. For instance, eBay in 2000 filed a lawsuit against a bidding system, which crawled data from eBay website to gain bidding insights. Similarly, a major US airline suited a travel agency for scraping its flight pricing data and providing it to that airline’s competitors. Both examples demonstrate how unintended users are able to take advantages of a company’s business via web scraping, and disrupt its normal business operations. However, most of the web scraping related legal cases indicates that although companies can include “do not scrape” under terms of usage for their websites, this clause alone is not legally effective unless users explicitly agree to the term [2]. If companies can figure out a way to restrict users from web scraping, most likely via a technological solution instead of a legal one, might cause decline in commercial usage of web scrapping.

[1] United Nations Global Pulse. “Big Data for Development: Opportunities & Challenges.” United Nations Global Pulse, May 2012. Web. 18 April 2014. < http://www.unglobalpulse.org/sites/default/files/BigDataforDevelopment-UNGlobalPulseJune2012.pdf>.

[2] Distil. “Is web scraping illegal? Depends on what the meaning of the word is is.” Distil, June 2013. Web. 18 April 2014.

<http://www.distilnetworks.com/is-web-scraping-illegal-depends-on-what-the-meaning-of-the-word-is-is/>.

[3] Shestakov, Denis. “Current challenges in web crawling.” 13th International Conference on Web Engineering. August 2013. Web. 18 April 2014. <

http://www.slideshare.net/denshe/icwe13-tutorial-webcrawling>.

Web scrapping: Technique

11 Friday Apr 2014

Posted by weiweiliu23 in 02_Web scraping, Uncategorized

≈ 1 Comment

Tags

api, html

Web scraping has been widely adopted in the field of data science. It played a great role in areas like weather report, market pricing and sentimental analysis when combined with other data mining techniques.

Compared to other ways to acquire data, web scraping has many advantages from the perspectives of data science. First, web scraping allows data scientist/researchers to process and utilize the unstructured data on the web. Most data on the web is organized in an unstructured way that is not naturally suited for automated processing by computers. Second, with web scraping to exact, pre-process the web data, data scientist can harness the wisdom of the crowd by incorporating the web data into later stage model building and other data mining techniques. Another reason why web scraping is very important is that not all the websites provide well-designed API for the end users to fetch the data. Using web scraping data scientist can get the data based on customized requirement, without worrying about the lacking API support from the website. Last but not least, the display of the web page is very intuitive so before data scientists crawl the data they already have an understanding of the data since the web data is easy to read and understand.

There are also a lot of challenges related to web scraping. One technical challenge for web scraping actually arises from the nature of HTML. Since web page is written in HTML and HTML can represent the same web content in unlimited number of ways, therefore to extract the data from the web page we have to write separate scripts for each page. The other problem of scraping the web page is that when the web page updates frequently, the scraping script to crawl the web page is also needed to be updated correspondingly. These two issues make the web scraping techniques hard to scale. Besides that, since the data is scraped from multiple sources from the web, different websites might have different representations of the data so it would not be easy to transform and normalize the data. This is also performance challenge for web scraping. The performance of the web crawler is not only impacted by the scraping program itself, but also factors like network speed and limitations from the web hosting servers etc. There are also non-technical aspects like legal issues/privacy issues that can influence the application of web scraping.

To address various challenges for web scraping, we need to explore the cutting edge solutions as well as the right methodologies to use the web scraping. Most of the web-scraping program is written in scripting language like Python, Ruby or Perl. To master one of the scripting languages is the first step to explore web scraping. There are also packages/library that are specific written for problems mentioned above. For example Beautiful Soup in Python and Nokigiri in Ruby are good ways to parse the web data, Mechanize provides way to simulate the browser for the applications.

Through exploration of different tools/methods, we can leverage the values of the massive web data.

“Current challenges in web crawling.” Current challenges in web crawling. N.p., n.d.Web. 11 Apr. 2014. <http://www.slideshare.net/denshe/icwe13-tutorial-webcrawling>.

“Fetch is now a part of Connotate.” Connotate. N.p., n.d. Web. 11 Apr. 2014. <http://www.connotate.com/top-pitfalls-web-scraping-6502>.

“Hartley Brody.” Hartley Brody I Dont Need No Stinking API Web Scraping For Fun and Profit Comments. N.p., n.d. Web. 11 Apr. 2014. <http://blog.hartleybrody.com/web-scraping/>.

“Web Scraping:Â How advanced can web scraping or data mining be?.” Web Scraping: How advanced can web scraping or data mining be?. N.p., n.d. Web. 10 Apr. 2014. <http://www.quora.com/Web-Scraping/How-advanced-can-web-scraping-or-data-mining-be>.

“Web scraping.” Wikipedia. Wikimedia Foundation, 4 Aug. 2014. Web. 11 Apr. 2014. <http://en.wikipedia.org/wiki/Web_scraping>.

MLA formatting by BibMe.org.

Web scrapping: technology

04 Friday Apr 2014

Posted by weiweiliu23 in 02_Web scraping

≈ 2 Comments

Tags

Google, Menzenda, web scraping

Web scrapping is a technique to fetch data from the web to users, usually in a routine fashion. Any business that consumes data can potentially be active users of web scrapping to reduce the manual labor in web browsing significantly. Government agencies, educational institutes, industry players and entrepreneurs can all make use of web scrapping. Google is an exemplar in incorporating web scrapping to their business model – Google constantly scrape information from other websites and store relevant information into its database. When end users type a particular search string into Google, the search engine will return a list of websites based on Google’s prior scrapping effort. Likewise, Kayak returns the lowest flight fare via scrapping ticket information from a number of online flight ticket vendors. [1]

[2]

Web scrapping can be used to solve a wide range of problems, as long as users are able to derive structured information from websites. [3] For instance, real estate companies can apply this technology to receive a list of property information from other real estate websites. Websites that serve price minded customers can generate current deals and coupon codes from other online stores. University career offices can automatically collect tailored list of internship and job opening from websites such as Glassdoor.com, Indeed.com and company career websites. [4] Therefore, web scrapping is a flexible tool that can tackle various tasks for different users.

The most rudimentary technique in web scrapping is simple copy and paste. However, web scrapping can be achieved through automated process empowered by programming language such as Excel VBA, R, Python and Perl. Common methods of web scrapping include DOM parsing and HTML parsers. Instead of hiring programmer who can write code to “hack” other organizations’ website, they are also utilize web scrapping software for web scrapping solutions. There are also a number of web scrapping tools, including commercial tools such as Mezenda, Import.io, and open source tools such as DeiXTO and Scrapy. [5] Users can choose their web scrapping methods based on the problems they need to tackle, as well as organizational factors such as budget constraints and security policies.

[6]

[1] https://www.udemy.com/learn-web-scraping-in-minutes/

[2] http://promptcloud.com/web-scraping-companies.php

[3] http://www.scrapegoat.com/faqs.php

[4] http://deixto.blogspot.com/2012/03/uses-and-applications-of-web-scraping.html

[5] http://www.kdnuggets.com/software/web-content-mining.html

[6] http://online.wsj.com/news/articles/SB10001424052748703358504575544381288117888

Web scraping: introduction

28 Friday Mar 2014

Posted by weiweiliu23 in 02_Web scraping

≈ 3 Comments

Tags

Introduction

Web scraping is a technique to extract information from webs. It is often implemented in software application to simulate human exploration of the World Wide Web. [1] The scraping technique is widely adopted. For example, most search engine utilizes web scraping to build web index, deciding which page is the most important to show. Also, web scraping facilitates the transformation of unstructured data to more structured format for further analysis.

We are interested in combining web scraping with data science for the following reasons:

First, data science helps extract meaningful information from massive web pages in particular business context that support decision-making. For instance, we can extract the price of products to make purchasing decisions. Various techniques such as DOM Tree Model and regular expressions can be used to extract the data.

Second, data visualization helps to reveal the linkages among information on web pages. Tools such as Tableau and R package ‘shiny’ convey complex idea to non-technical people. Scraping results are often converted to graphs, which generally present business insights in a more intuitive fashion.

Third, data science may fundamentally improve current scraping results. For example, we can use data mining to detect what would be the best factors in deciding which URLs to scrape. The prioritization can be based on patterns in the URLs and image size on web pages. In short, we look forward to the revolutionary change in the field of web scraping that data science can bring to us.

[1]“Site Scrape Review.” Site Scrape Review. N.p., n.d. Web. 27 Mar. 2014. <https://sites.google.com/site/sitescraperev>.

datascienceCMU

~ Learn,Explore,Network on Data Science

Category Archives: 02_Web scraping

Web scraping: future

Web scrapping: Technique

Web scrapping: technology

Web scraping: introduction