Tags
A growing number of applications are producing massive streams of data such as real- time surveillance, sensor networks, social data, telecommunication data, etc. This data needs intelligent processing and online analysis to draw useful information. The critical need for using such data to capture important statistics augments the development of systems, algorithms and techniques that address the challenges associated with streaming data. This post discusses two major techniques used for clustering and classification of data streams and also lists out some generic challenges being faced by researchers to develop efficient systems for real-time data stream analytics.
Data Stream Clustering
The nature of data streams calls for three basic requirements in clustering algorithms – Compactness of representation, fast, incremental processing of new data points, clear and fast identification of “outliers”. One of the extensively used clustering approach for data streams is described below.
D-Stream Clustering [1] [2] – D-Stream is a density based clustering approach which two phased. It has an online component and an offline component. The overall approach is shown in figure 1. The online phase reads the multi-dimensional data from each data record in the arriving data stream and places it in a density grid. The algorithm then computes all the cell counts and cells that are above a given threshold form a cluster. The offline component adjusts the clusters every certain time steps to accommodate for new data points coming through. Grids can be divided into dense, sparse and traditional grids which can be eliminated/updated as new information comes in.
FIGURE 1 – D-Stream Clustering [2]
D-Stream adopts the decaying mechanism to capture dynamic changes in the arriving streams. It passes through a set of data only once and hence the impact of every data point on the cluster reduces with time exponentially. The decaying method allows for efficient updating of the clustering. D-Stream works in a lazy fashion where only the cells relevant to the current data point that arrived are updated and the rest are not altered till they are hit with some information coming in.
Decaying factor along with the threshold for clustering helps in identifying cells that are hit occasionally and hence can be eliminated. D-Stream is a very memory efficient technique due to the fact that it deals with data only in real time and works on incremental approach.
Data Stream Classification
Three major requirements for classification techniques of massive incoming data streams can be identified as – Processing an example at a time, and inspect it only once (at most), using a limited amount of memory, work in a limited amount of time and being ready to predict at any point. The typical cycle of data stream classification algorithms can be seen in figure 2. One of the majorly used methods for classification of data streams is Hoeffding trees.
FIGURE 2 – Data Stream Classification Cycle [5]
Hoeffding Trees, using VFDT (Very Fast Decision Tree) [3] [4] – This is a stream based classification approach for the incoming data streams. The decision tree is built incrementally based by splitting nodes based on small amounts of arriving streams. The leaf nodes of the tree are split when enough observations are made that imply that the tree needs to be expanded further to represent the real-time arriving data. The number of enough observations are determined by Hoeffding bound. Hoeffding bound decides how many instances are needed to achieve a certain level of confidence (i.e. the chosen instance attribute using the bound is close to the attribute chosen when infinite examples are presented into the classifier).
A node is split only when the data streams coming through imply that a new conditional node is required and hence the tree rules represent the current conditions of the data streams. The VFDT approach implements the test and train methodology simultaneously and hence engages in classification and prediction exercise for the data.
The tree is built from scratch and is updated upon continuous data arrival which is different from traditional decision approach which needs repeated scanning and hence is suitable for real time classification of incoming streams. The biggest advantage of using Hoeffding bound to build the decision trees is that the trees built using a small amount of data are almost identical to the trees built using traditional approaches.
Challenges and Limitations
Given the above techniques and many more we can identify useful patterns in extremely large incoming data streams. However, there are still major limitations to various methodologies that need to be addressed and many algorithms are being researched upon to overcome these. Few challenges for data stream analytics can be listed as follows.
1. Concept drift – Concept drift is a phenomenon that is mostly in all sorts of data streams that come through. This signifies the fact that the statistical properties or attributes of the incoming data streams that the given model is trying to analyze can change over time in unpredictable ways. This is one of the biggest challenges for data stream mining as the data is dynamic and depends on several factors that can keep changing real fast. In order to overcome this problem, techniques that dynamically process the data and work on incremental updates of the model can be most helpful.
2. Too much data – High data volume and rate of arrival makes it difficult to manage such streams. Such large data streams coming from ATM transactions, call logs, Twitter tweet streams or Facebook status update streams are costly for memory storage and computationally expensive as well. Sampling and filtering such data with effective techniques such that all the relevant information is captured is another daunting task for data stream mining techniques.
3. Cost of learning – Dynamic/incremental models that are being extensively used for real-time data stream mining can be very costly. With the extensive amount of data pouring in every second model updates to represent the current data can be very expensive. Analytical models can afford only constant time per data samples.
4. Requires preprocessing – The massive amount of data that arrives in dynamic streams needs heavy preprocessing to make sure the techniques/algorithms are utilized efficiently to understand and draw conclusions from the data. A lot of times the data arrives may be incomplete (due to the large amount it is fairly easy to lose a considerable amount of data owing to transmission issues) or imprecise (due to added noise or irrelevant details in the streams. Such preprocessing needs to highly efficient to yield efficient results from mining techniques.
References –
[1] Stream Data Clustering Based on Grid Density and Attraction, “http://www.cse.wustl.edu/~ychen/public/cluster.pdf”, Dec 2008, ACM Transactions on computational logic
[2] D-Stream: Density Based Clustering for Real time Stream Data, “https://files.ifi.uzh.ch/boehlen/dis/teaching/DWDM08/proposals/d-stream/dstream.html”
[3] Hoeffding Tree for Streaming Classification, “http://www.otnira.com/2013/03/28/hoeffding-tree-for-streaming-classification/”, March 2013, Otnira Blog
[4] Very fast Decision Tree algorithm for Real-Time Data Mining, “http://www.hindawi.com/journals/ijdsn/2012/863545/”, Oct 2012, Hindawi
[5] Data Stream Mining- A Practical approach, “http://sourceforge.net/projects/moa-datastream/files/documentation/StreamMining.pdf/download?use_mirror=garr”, May 2011, COSI