Data Stream Analytics: Future Scope and Problems

18 Friday Apr 2014

Posted by richac in 18_Data stream analytics

Tags

Google Chairman Eric Schmidt has said that every 2 days the world produces five exabytes of data. This is as much data as the world produced from the start of humanity until 2003 ^[1]. The amount of streaming data being generated is exploding fast from the growing variety of interconnected machines, devices, sensors, and consumer content. The amount of data produced from any one source is surprising. For example, Virgin Atlantic has a new fleet of highly connected planes that generate a half terabyte of data on every flight ^[2]. This data is expected to grow in size even further resulting in enormous data points termed “Big Data”. According to Gartner’s prediction for business intelligence, by 2017 more than 50 percent of analytics implementations will make use of event data streams generated from instrumented machines, applications and/or individuals ^[3]. Managing such huge increasing data will pose a big challenge to current stream mining techniques and will need evolution in terms of how the results are being dealt with today.

The healthcare industry is expected to see a drastic change in data input and the requirements from the decision support systems using this data. Moving forward, expectations will rise from assisting doctors in diagnosis to tracking a patient’s state in real time. Physicians should be able to analyze the patient’s information during the treatment duration and accordingly decide further steps of treatment. At UCLA’s school of medicine Dr. Paul Vespa is heading a research where real-time signal streams from the brain are being analyzed using IBM’s Watson Foundation to help physicians through decision support for brain analysis ^[4]. Such types of real time analysis will be expected from stream processing in healthcare.

FIGURE 1 – Healthcare Decision Support Systems ^[7]

Streaming data has the potential to radically reduce the time to provide business critical information if it can be captured, processed and analyzed in real time. In order to make effective use of this vast information, visual data processing needs to be seamlessly integrated with traditional streaming analysis. Stream analytics needs to shift from static report based solutions to user driven interactive visualization of information. Users should be able to access and combine information from multiple sources of data streams and view recent result from all the business functions. Currently, due to the size of the data and recentness, it is difficult to find user-friendly visualizations. Going further new techniques needs to be developed that will incorporate easy data visualization into all stream mining software.

Our research ^{[5], [6]} suggests that the following are major areas where we can see potential problems with the current data stream mining techniques and implementations.

1. Converging real time and historical data – If the data streams from various sources grow as they are expected to, making sense of such incoming Big Data along with the historical data present will be a challenge. Architectures that deal with real-time and historical data will need to be developed to make efficient use of all the information at our hands.

2. Situation aware data stream mining – Real time data stream mining that can provide results in seconds using the most recent data will be in demand. Present data stream techniques involve modifying models based on the characteristics of incoming streams. Such data systems will eventually become computationally slow as the amount of data pouring in every second increases. New methodologies should be devised such that models that were built in similar situations are recalled rather than building new ones for every scenario in order to provide faster analysis of the current state.

3. Mobile data stream mining – Given that the number of mobile technology users has augmented radically in the past decade, data stream mining will need to be performed on remote mobile devices. For example, we need systems that should be able to analyze a driver’s behavior and vehicle’s health from the incoming streams coming through and prevent accidents and adverse events. Such systems will face computational and connectivity challenges. Techniques that can yield efficient and faster results for such scenarios using minimal resources need to be researched upon.

References

[1] “http://techcrunch.com/2010/08/04/schmidt-data/”, Aug 2010, Tech Crunch

[2] “http://sandbox.macworld.com.au/news/boeing-787s-to-create-half-a-terabyte-of-data-per-flight-says-virgin-atlantic-88897/#.UzWkA8eDpEg”, March 2013, MacWorld

[3] Gartner predicts Business Intelligence, “http://www.gartner.com/newsroom/id/2637615”, Dec 2013, Gartner

[4] Using data stream analysis in brain research, “http://www.ibmbigdatahub.com/blog/using-data-stream-analysis-brain-research-ucla%E2%80%99s-school-medicine”, March 2014, IBM

[5] Advances in Data Stream Mining, “http://www.immagic.com/eLibrary/ARCHIVES/GENERAL/JOURNALS/W120101G.pdf”, Mohamed Medhat Gaber, Feb 2012

[6] Big Data Mining future challenges, “http://albertbifet.com/big-data-mining-future-challenges/”, April 2013, Albert Bifet

[7] Clinical Decision Support Systems, “http://www.philblock.info/hitkb/c/clinical_decision_support_systems_part1.html”, philblock.info

Data Stream Mining: Techniques and Challenges

11 Friday Apr 2014

Posted by richac in 18_Data stream analytics

≈ 1 Comment

Tags

Technique

A growing number of applications are producing massive streams of data such as real- time surveillance, sensor networks, social data, telecommunication data, etc. This data needs intelligent processing and online analysis to draw useful information. The critical need for using such data to capture important statistics augments the development of systems, algorithms and techniques that address the challenges associated with streaming data. This post discusses two major techniques used for clustering and classification of data streams and also lists out some generic challenges being faced by researchers to develop efficient systems for real-time data stream analytics.

Data Stream Clustering

The nature of data streams calls for three basic requirements in clustering algorithms – Compactness of representation, fast, incremental processing of new data points, clear and fast identification of “outliers”. One of the extensively used clustering approach for data streams is described below.

D-Stream Clustering ^{[1] [2]} – D-Stream is a density based clustering approach which two phased. It has an online component and an offline component. The overall approach is shown in figure 1. The online phase reads the multi-dimensional data from each data record in the arriving data stream and places it in a density grid. The algorithm then computes all the cell counts and cells that are above a given threshold form a cluster. The offline component adjusts the clusters every certain time steps to accommodate for new data points coming through. Grids can be divided into dense, sparse and traditional grids which can be eliminated/updated as new information comes in.

FIGURE 1 – D-Stream Clustering ^[2]

D-Stream adopts the decaying mechanism to capture dynamic changes in the arriving streams. It passes through a set of data only once and hence the impact of every data point on the cluster reduces with time exponentially. The decaying method allows for efficient updating of the clustering. D-Stream works in a lazy fashion where only the cells relevant to the current data point that arrived are updated and the rest are not altered till they are hit with some information coming in.

Decaying factor along with the threshold for clustering helps in identifying cells that are hit occasionally and hence can be eliminated. D-Stream is a very memory efficient technique due to the fact that it deals with data only in real time and works on incremental approach.

Data Stream Classification

Three major requirements for classification techniques of massive incoming data streams can be identified as – Processing an example at a time, and inspect it only once (at most), using a limited amount of memory, work in a limited amount of time and being ready to predict at any point. The typical cycle of data stream classification algorithms can be seen in figure 2. One of the majorly used methods for classification of data streams is Hoeffding trees.

FIGURE 2 – Data Stream Classification Cycle ^[5]

Hoeffding Trees, using VFDT (Very Fast Decision Tree) ^{[3] [4]} – This is a stream based classification approach for the incoming data streams. The decision tree is built incrementally based by splitting nodes based on small amounts of arriving streams. The leaf nodes of the tree are split when enough observations are made that imply that the tree needs to be expanded further to represent the real-time arriving data. The number of enough observations are determined by Hoeffding bound. Hoeffding bound decides how many instances are needed to achieve a certain level of confidence (i.e. the chosen instance attribute using the bound is close to the attribute chosen when infinite examples are presented into the classifier).

A node is split only when the data streams coming through imply that a new conditional node is required and hence the tree rules represent the current conditions of the data streams. The VFDT approach implements the test and train methodology simultaneously and hence engages in classification and prediction exercise for the data.

The tree is built from scratch and is updated upon continuous data arrival which is different from traditional decision approach which needs repeated scanning and hence is suitable for real time classification of incoming streams. The biggest advantage of using Hoeffding bound to build the decision trees is that the trees built using a small amount of data are almost identical to the trees built using traditional approaches.

Challenges and Limitations

Given the above techniques and many more we can identify useful patterns in extremely large incoming data streams. However, there are still major limitations to various methodologies that need to be addressed and many algorithms are being researched upon to overcome these. Few challenges for data stream analytics can be listed as follows.

1. Concept drift – Concept drift is a phenomenon that is mostly in all sorts of data streams that come through. This signifies the fact that the statistical properties or attributes of the incoming data streams that the given model is trying to analyze can change over time in unpredictable ways. This is one of the biggest challenges for data stream mining as the data is dynamic and depends on several factors that can keep changing real fast. In order to overcome this problem, techniques that dynamically process the data and work on incremental updates of the model can be most helpful.

2. Too much data – High data volume and rate of arrival makes it difficult to manage such streams. Such large data streams coming from ATM transactions, call logs, Twitter tweet streams or Facebook status update streams are costly for memory storage and computationally expensive as well. Sampling and filtering such data with effective techniques such that all the relevant information is captured is another daunting task for data stream mining techniques.

3. Cost of learning – Dynamic/incremental models that are being extensively used for real-time data stream mining can be very costly. With the extensive amount of data pouring in every second model updates to represent the current data can be very expensive. Analytical models can afford only constant time per data samples.

4. Requires preprocessing – The massive amount of data that arrives in dynamic streams needs heavy preprocessing to make sure the techniques/algorithms are utilized efficiently to understand and draw conclusions from the data. A lot of times the data arrives may be incomplete (due to the large amount it is fairly easy to lose a considerable amount of data owing to transmission issues) or imprecise (due to added noise or irrelevant details in the streams. Such preprocessing needs to highly efficient to yield efficient results from mining techniques.

References –

[1] Stream Data Clustering Based on Grid Density and Attraction, “http://www.cse.wustl.edu/~ychen/public/cluster.pdf”, Dec 2008, ACM Transactions on computational logic

[2] D-Stream: Density Based Clustering for Real time Stream Data, “https://files.ifi.uzh.ch/boehlen/dis/teaching/DWDM08/proposals/d-stream/dstream.html”

[3] Hoeffding Tree for Streaming Classification, “http://www.otnira.com/2013/03/28/hoeffding-tree-for-streaming-classification/”, March 2013, Otnira Blog

[4] Very fast Decision Tree algorithm for Real-Time Data Mining, “http://www.hindawi.com/journals/ijdsn/2012/863545/”, Oct 2012, Hindawi

[5] Data Stream Mining- A Practical approach, “http://sourceforge.net/projects/moa-datastream/files/documentation/StreamMining.pdf/download?use_mirror=garr”, May 2011, COSI

Organizations and Technologies in Streaming Analytics

04 Friday Apr 2014

Posted by richac in 18_Data stream analytics

≈ 4 Comments

Tags

02_Technology

Various companies observe practical implementations of streaming analytics that can help them better classify, identify anomalies, predict trends and analyze them. Such usage can be seen in multiple industries. Examples of two such industries are described below.

Healthcare – Health Data Stream analytics can play a crucial role in clinical decision support, patient safety improvement and early detection of adverse patient outcomes. New data – structured or unstructured – are pouring into the healthcare world from fitness devices, genetics and genomics, social media research and other sources. Major challenge in applying data stream analytics to this massive data in health care are tailoring the query support for clinical context. Trend analysis by observing trends in the streams can be particularly helpful in predicting extreme events such as cardiac arrest. ^[1],[2]

Telecom – There are now 6 billion mobile phone subscribers globally (United Nations Telecom Agency). Due to fierce competition between multiple telecom providers companies are forced to improve the overall customer experience and operational efficiencies and hence the increased pressure to analyze massive streams coming from their networks to detect fraud, reduce customer churn and improve service. Sprint is currently using IBM analytics to capture and interpret all network data (e.g. location data, dropped calls, service interruption, network performance, etc.). ^[3]

Given the vast possibility of analytics that can be executed using streaming data there are various challenges that companies face in trying to implement such solution. Filtering data prior to storage and analyzing by reducing the volume is not efficient enough. Companies gather information every second using sensors, mobile technology, etc to predict weather, find optimal travel routes and various other functions. ^[1]

This drives the need for on-line or real time data stream analytics not just for simple predictions but other complex analytics like classification, forecasting, risk assessments, alerts, etc. Major drivers for real-time streaming analytics are excessive information that goes beyond computational storage, growing demand for fast and intelligent decision support systems and lack of available for critical decision making. ^[1]

A dramatic increase in digital information like photographs, music, video, emails, and blogs has increased due to the internet which connects people to the social world with services from Google, Facebook, YouTube, Twitter, etc.

Twitter – Twitter is tuned towards fast communication. More than 600 million active users publish over 58 million 140-character “Tweets” every day ^[4]. Twitter plays an important roles in political, social and socio-economic events. Twitter concentrates on these “Tweet” streams in large amount to identify the trends and provide the most updated results to its users. Twitter’s Storm (now open source) is an example of distributed real-time computation system used by Twitter to achieve its objectives. Storm can be used for continuous computation, doing a continuous query on data streams and streaming out the results to users as they are computed ^[5]. Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm can process over a million tuples per second per node and is efficiently scalable and fault-tolerant. ^[6]

Twitter Stream Analysis ^[7]

Netflix – Netflix is another very good example of enormous data streams coming through that need real time processing. Netflix collects 1.5 million events per second during peak hours, or around 80 billion events per day. These events include log messages, user activity records, system operational data, or any arbitrary data that systems need to collect for business, product, and operational analysis. More recent trends of real-time stream processing drove Netflix to produce instant feedback, exploratory analysis and operational insights. Netflix adopted SURO as a part of the NetflixOSS family to process such varied streams in real time and provide customers with instant recommendations. ^[8]

IBM – Streaming analytics

Collaboration between academics and practitioners working in the industry sector is required. Data processing and analytical research needs to be combined into a single platform. IBM is making effective contributions to achieve this collaboration through its stream computing platforms like IBM InfoSphere Streams Application. These stream processing application identify new information by incrementally building on the models and detect any patterns that deviate from the normal behavior and hence are interesting in some way.

Distributed Stream Computation ^[10]

InfoSphere offers a complete stream computing solution to enable real-time analytic processing of data in motion. InfoSphere Streams provide a platform for user-developed application to quickly consume and analyze information from thousands of real-time resources by building incremental models. This IBM solution has a high flow rate and can handle millions of events/messages per second. ^[9]

Apart from the InfoSphere platform few of the IBM’s extensive research areas are as follows. ^[10]

Systems: Design for runtime distributed stream processing for parallelization, fault-tolerance, load-shedding, etc.
Distributed Data Management: Shared state and storage support for streaming systems, graph-analytics and graph-processing systems.
Languages & Compilers: Language and compiler design for stream processing systems, including streaming languages, intermediate representations, optimizations, language translation, macro systems, etc.
Analytics: Time series analysis, signal processing, and data mining for streaming systems, with a focus on adaptive and proactive analytics. Integration of offline and online analytics.
Knowledge Representation: Automated composition of flows, goal-oriented planning in streaming systems.

References

[1] Analytics in healthcare: promise and potential, http://www.hissjournal.com/content/2/1/3, 2014, Health Information Science and Systems

[2] Towards Health Data Stream Analytics, ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5558827&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5558827, 2014, IEEE

[3] IBM Press Release, http://www-03.ibm.com/press/us/en/pressrelease/39181.wss, Oct 2012, IBM

[4] Twitter Statistics, http://www.statisticbrain.com/twitter-statistics/, April 2014, Twitter

[5] Twitter Storm, http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop, Sept 2011, InfoQ

[6] Storm, http://storm.incubator.apache.org/, Apache

[7] New Twitter Search Results, https://blog.twitter.com/2013/new-twitter-search-results, Feb 2013, Twitter

[8] Announcing SURO, http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html, Dec 2013, Netflix

[9] IBM InfoSphere, http://www-01.ibm.com/software/data/infosphere/stream-computing/, IBM

[10] Stream Computing Platforms, Applications, and Analytics, http://researcher.watson.ibm.com/researcher/view_project.php?id=2531, IBM

Introduction: Data Stream Analytics

28 Friday Mar 2014

Posted by richac in 18_Data stream analytics

≈ 2 Comments

Tags

Introduction

Between Oscars at the 2014 Academy Awards, the event host, Ellen DeGeneres, cracked jokes and had pizza delivered to the biggest stars in Hollywood. At one point, she pulled out her smartphone and asked a few celebrities nearby, including Brad Pitt, Meryl Streep, and others, to join in taking a “selfie” that she would then share on Twitter. A few moments later, with 10+ people crowding together, the picture that would garner the most retweets ever was snapped into history. Within moments, fans and trend watchers worldwide were re-sharing the photo and commenting, so much so that Twitter went down for nearly 25 minutes amid the flurry of retweeting. With a few taps on a smartphone, events triggered around the world and a data streams deluge overwhelmed the helpless engineers at Twitter.

“If only Bradley’s arm was longer” – Ellen DeGeneres ^[1]

Traditional data analytics techniques are based on the concept of consistent, static data that is stored in reliable storage systems to be recalled later for information. In this day and age, however, data for high-profile applications is more and more dynamic and arrives in large, streaming quantities as happened in the situation above. These large streams of data flood in from all over the world, and pose new, big problems in terms of modeling and analyzing. Tackling it as a whole would require large chunks of memory for storage and processing. Thus, the need of “online” solutions for processing, cleaning, crunching, and modeling these large data streams. Data streams are ordered sequences of instances that can be read only once or a small number of times to apply analytical techniques with constrained computing and storage capabilities.

An example of such data stream can be sensor data. Suppose a sensor in the ocean is using its GPS units to send data to its base station regarding the surface height of the ocean. This height would be varying rapidly and hence it might need to send back readings every tenth of a second. Now, if we are using multiple sensors within the ocean we will be receiving large amount of data streams every second and it is impossible to analyze all this data at once. Also, the continuity of data arrival requires real time analysis to predict patterns and other important aspects of the data. Other examples of data streams can be computer network traffic, web searches, ATM transactions, etc ^[2]. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data ^[3].

Mining data streams is concerned with extracting knowledge structures represented in models and patterns in continuous streams of information. Streaming analytics operate over fixed time windows. The goal is to utilize this information to predict the class or value of new instances in the data stream and identify interesting behavior of the rapidly incoming streams. As mentioned earlier, it is difficult to deal with the huge amounts of data coming in. To work with such streams, they are usually archived in a large data store (Figure 1). It is often difficult to answer queries from archival store, the data must be examined under special circumstances using time-consuming retrieval processes. Usually, there are working stores to hold summaries or parts of streams coming in which can be used for real time analysis and to produce approximate answers to questions that would be very close to the true results ^[2]. Various machine learning techniques can be used to analyze such data in an automated fashion. There are different streaming processing models like Sliding Window, Exponential and other decay, Duplicate sensitivity, Random order streams, etc. to deal analyze this data in small pieces ^[4].

FIGURE 1 ^[2]

The need for continuous data stream analytics is greater than ever. In the era of Twitter, Facebook, and live TV tweeting, executives and audiences want to see the immediate impact of decisions and events. Data stream analytics is a new, blossoming field with much to be figured out and learned as it evolves, particularly if celebrities continue trying to break Twitter records!

References

[1] Twitter Post, “https://twitter.com/TheEllenShow/status/440322224407314432/photo/1”, March 2014, Twitter

[2] Mining Data Streams, “http://infolab.stanford.edu/~ullman/mmds/ch4.pdf”, Stanford

[3] Data Stream Mining and its applications, “http://link.springer.com/chapter/10.1007%2F978-3-642-29035-0_33”, Latifur Khan and Wei Fan, Springer, Part of Springer Science and Business Media

[4] Fundamentals of analyzing and Mining Data Streams, “http://dimacs.rutgers.edu/~graham/pubs/slides/streammining.pdf”, Graham Cormode, Rutgers

[5] Mining Data Streams, “http://dl.acm.org/citation.cfm?doid=1083784.1083789”, Mohamed, Arkady, Shonali, 2014, ACM, Inc.

Hello from Team 10

28 Friday Mar 2014

Posted by richac in 00_Say hello, 18_Data stream analytics

≈ 2 Comments

Hello People

This is Team 10 – Abhinav Batra, Jacob Quinn and Richa Choudhary. We will be blogging about Data Stream Analytics through this mini.

Excited to share our findings and to learn about various other fields of Data Science. 🙂

Regards,

Team 10

datascienceCMU

~ Learn,Explore,Network on Data Science

Category Archives: 18_Data stream analytics

Data Stream Analytics: Future Scope and Problems

Data Stream Mining: Techniques and Challenges

Organizations and Technologies in Streaming Analytics

Introduction: Data Stream Analytics

Hello from Team 10