Data Stream Mining: Techniques and Challenges

11 Friday Apr 2014

Posted by richac in 18_Data stream analytics

Tags

A growing number of applications are producing massive streams of data such as real- time surveillance, sensor networks, social data, telecommunication data, etc. This data needs intelligent processing and online analysis to draw useful information. The critical need for using such data to capture important statistics augments the development of systems, algorithms and techniques that address the challenges associated with streaming data. This post discusses two major techniques used for clustering and classification of data streams and also lists out some generic challenges being faced by researchers to develop efficient systems for real-time data stream analytics.

Data Stream Clustering

The nature of data streams calls for three basic requirements in clustering algorithms – Compactness of representation, fast, incremental processing of new data points, clear and fast identification of “outliers”. One of the extensively used clustering approach for data streams is described below.

D-Stream Clustering ^{[1] [2]} – D-Stream is a density based clustering approach which two phased. It has an online component and an offline component. The overall approach is shown in figure 1. The online phase reads the multi-dimensional data from each data record in the arriving data stream and places it in a density grid. The algorithm then computes all the cell counts and cells that are above a given threshold form a cluster. The offline component adjusts the clusters every certain time steps to accommodate for new data points coming through. Grids can be divided into dense, sparse and traditional grids which can be eliminated/updated as new information comes in.

FIGURE 1 – D-Stream Clustering ^[2]

D-Stream adopts the decaying mechanism to capture dynamic changes in the arriving streams. It passes through a set of data only once and hence the impact of every data point on the cluster reduces with time exponentially. The decaying method allows for efficient updating of the clustering. D-Stream works in a lazy fashion where only the cells relevant to the current data point that arrived are updated and the rest are not altered till they are hit with some information coming in.

Decaying factor along with the threshold for clustering helps in identifying cells that are hit occasionally and hence can be eliminated. D-Stream is a very memory efficient technique due to the fact that it deals with data only in real time and works on incremental approach.

Data Stream Classification

Three major requirements for classification techniques of massive incoming data streams can be identified as – Processing an example at a time, and inspect it only once (at most), using a limited amount of memory, work in a limited amount of time and being ready to predict at any point. The typical cycle of data stream classification algorithms can be seen in figure 2. One of the majorly used methods for classification of data streams is Hoeffding trees.

FIGURE 2 – Data Stream Classification Cycle ^[5]

Hoeffding Trees, using VFDT (Very Fast Decision Tree) ^{[3] [4]} – This is a stream based classification approach for the incoming data streams. The decision tree is built incrementally based by splitting nodes based on small amounts of arriving streams. The leaf nodes of the tree are split when enough observations are made that imply that the tree needs to be expanded further to represent the real-time arriving data. The number of enough observations are determined by Hoeffding bound. Hoeffding bound decides how many instances are needed to achieve a certain level of confidence (i.e. the chosen instance attribute using the bound is close to the attribute chosen when infinite examples are presented into the classifier).

A node is split only when the data streams coming through imply that a new conditional node is required and hence the tree rules represent the current conditions of the data streams. The VFDT approach implements the test and train methodology simultaneously and hence engages in classification and prediction exercise for the data.

The tree is built from scratch and is updated upon continuous data arrival which is different from traditional decision approach which needs repeated scanning and hence is suitable for real time classification of incoming streams. The biggest advantage of using Hoeffding bound to build the decision trees is that the trees built using a small amount of data are almost identical to the trees built using traditional approaches.

Challenges and Limitations

Given the above techniques and many more we can identify useful patterns in extremely large incoming data streams. However, there are still major limitations to various methodologies that need to be addressed and many algorithms are being researched upon to overcome these. Few challenges for data stream analytics can be listed as follows.

1. Concept drift – Concept drift is a phenomenon that is mostly in all sorts of data streams that come through. This signifies the fact that the statistical properties or attributes of the incoming data streams that the given model is trying to analyze can change over time in unpredictable ways. This is one of the biggest challenges for data stream mining as the data is dynamic and depends on several factors that can keep changing real fast. In order to overcome this problem, techniques that dynamically process the data and work on incremental updates of the model can be most helpful.

2. Too much data – High data volume and rate of arrival makes it difficult to manage such streams. Such large data streams coming from ATM transactions, call logs, Twitter tweet streams or Facebook status update streams are costly for memory storage and computationally expensive as well. Sampling and filtering such data with effective techniques such that all the relevant information is captured is another daunting task for data stream mining techniques.

3. Cost of learning – Dynamic/incremental models that are being extensively used for real-time data stream mining can be very costly. With the extensive amount of data pouring in every second model updates to represent the current data can be very expensive. Analytical models can afford only constant time per data samples.

4. Requires preprocessing – The massive amount of data that arrives in dynamic streams needs heavy preprocessing to make sure the techniques/algorithms are utilized efficiently to understand and draw conclusions from the data. A lot of times the data arrives may be incomplete (due to the large amount it is fairly easy to lose a considerable amount of data owing to transmission issues) or imprecise (due to added noise or irrelevant details in the streams. Such preprocessing needs to highly efficient to yield efficient results from mining techniques.

References –

[1] Stream Data Clustering Based on Grid Density and Attraction, “http://www.cse.wustl.edu/~ychen/public/cluster.pdf”, Dec 2008, ACM Transactions on computational logic

[2] D-Stream: Density Based Clustering for Real time Stream Data, “https://files.ifi.uzh.ch/boehlen/dis/teaching/DWDM08/proposals/d-stream/dstream.html”

[3] Hoeffding Tree for Streaming Classification, “http://www.otnira.com/2013/03/28/hoeffding-tree-for-streaming-classification/”, March 2013, Otnira Blog

[4] Very fast Decision Tree algorithm for Real-Time Data Mining, “http://www.hindawi.com/journals/ijdsn/2012/863545/”, Oct 2012, Hindawi

[5] Data Stream Mining- A Practical approach, “http://sourceforge.net/projects/moa-datastream/files/documentation/StreamMining.pdf/download?use_mirror=garr”, May 2011, COSI

Cutting-edge Techniques in Data Visualization

11 Friday Apr 2014

Posted by mkleimancmu in 25_Data visualization

≈ Leave a comment

Tags

03_Techniques, Technique, Techniques

Cutting-edge Techniques in Data Visualization

Data presentation can be beautiful, elegant and descriptive. There is an emerging set of profound, creative and absolutely fascinating ways to visualize data. This blog post talks about some of the cutting edge techniques in data visualization as well as some of the challenges and limitations in this field.

Network Analysis

As mentioned in the previous blog, data visualization is widely used in the industry of social media. Therefore, a major group of the cutting edge techniques in data visualization focuses on network analysis. They are, but no limited to:

1. Graph Insight

Graph Insight is the fastest software for network data exploration. It can load, visualize and analyze large network graphs with the embedded Python shell. It is ideal for any network analysis.

Reference: http://www.graphinsight.com/documentation/tutorial-analysis-of-website-topology/

2. KeyLines

KeyLines is a JavaScript toolkit for visualizing networks. It is ideal for those seeking a modern, flexible and customizable visualization solution that can reach anyone in the organization. It works with any graph database, on all major browsers and on all platforms – including touch devices and smart phones. It has a growing collection of advanced features, such as social network analytics, automatic layouts and node grouping and filtering capabilities.

Reference: http://keylines.com/features

3. Sentinel Visualizer

Sentinel Visualizer provides users with insight into patterns and trends hidden in their data:its database driven data visualization platform lets users quickly see multi-level links among entities and model different relationship types. It also has advanced drawing and redrawing features that generate optimized views to highlight the most important entities.

Reference: http://www.fmsasg.com/Products/SentinelVisualizer/visualization.asp

Other cutting edge techniques:

1. Circos

Circos visualizes data in a circular layout — this makes it ideal for exploring relationships between objects or positions. Circos is ideal for creating publication-quality infographics and illustrations with a high data-to-ink ratio, richly layered data and pleasant symmetries. Users have fine control over each element in the figure to tailor its focus points and detail to your audience.

Reference: http://circos.ca/guide/tables/

2. Parallel Sets

Parallel Sets (ParSets) is a visualization application for categorical data, like census and survey data, inventory, and many other kinds of data that can be summed up in a cross-tabulation. ParSets provide a simple, interactive way to explore and analyze such data.

Reference: http://eagereyes.org/parallel-sets

3. Fineo

Fineo is a web application which implements a visualization technique based on the visual model of Sankey diagrams. Fineo was born from the idea that Sankey diagrams, although developed as a technique for visualizing continuous data, may be used to represent relations between dimensions of categorical data.

Reference: http://www.densitydesign.org/research/fineo/

Challenges and limitations

The increasing amount and complexity of electronic data sets turns visualization into a key technology to provide an intuitive interface to the information using unsupervised learning. However, the field has not fully answered the problems the data analysts seeking to use data visualization to solve by applying unsupervised learning. For example, heterogeneous and context dependent objectives or streaming and distributed data with different credibility. (NIPS 2010 Workshop)

As a result, here are some of the current challenges:

1) Available techniques for specific tasks should be combined in a canonic way

It is still an open question now for how can we effectively combine techniques for clustering, collaborative filtering, and topic modeling with dimensionality reduction to construct scatter plots that reveal the similarity between groups of data, movies, or documents. In general, the creation of context dependent visualization is still a big challenge.

2) Visualization techniques should be accompanied by theoretical guarantees

Questions yet to be answered in this fields are, but not limited to: what are reasonable mathematical specifications of data visualization to shape this inherently ill-posed problem? Can this be controlled by the user in an efficient way? How can visualization be evaluated? What are reasonable benchmarks? What are reasonable evaluation measures?

3) Meeting the need for speed

Visualization helps organizations perform analyses and make decisions much more rapidly, but the challenge is going through the sheer volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases.

4) Addressing data quality

This is a challenge with any data analysis, but when considering the volumes of information involved in big data projects, it becomes even more pronounced. Data visualization will only prove to be a valuable tool if the data quality is assured. Therefore, companies need to have data governance process in place to ensure the data is clean.

References:

Friedman, Vitaly. “Data Visualization: Modern Approaches.” Smashing Magazine. (2007): n. page. Web. 11 Apr. 2014.

NIPS 2010 Workshop, . “Challenges of Data Visualization.” (2010): n. page. Web. 11 Apr. 2014.

Visualizing Data, . “Essential Resources: Multivariate, network and qualitative visualisations.” (2014): n. page. Web. 11 Apr. 2014.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

datascienceCMU

~ Learn,Explore,Network on Data Science

Tag Archives: Technique

Data Stream Mining: Techniques and Challenges

Cutting-edge Techniques in Data Visualization