The Future of Multi-Source Data Analysis

18 Friday Apr 2014

Posted by edubrovs in 22_Analysis of multi-source data

We believe that multi-source data analysis will become increasingly pervasive in the future, due to:

the expansion of networking and connectivity – allowing data collection from more sources
increasing ease and lower cost of data collection, storage and manipulation
increased role of machine learning and automation in decision-making. This is particularly relevant for multi-source data. Currently (in certain industries), human beings rather than computers tend to make complex decisions that rely on a subjective weighing of evidence from multiple sources. For instance, a UX designer will consider digital analytics, market research, and A/B testing while creating a user-centered design. In the future (and even now) this array of multi-source data with diverse dimensions and different analytical requirements will be combined into a single easy-to-use decision-making model.
growing dependence on simulation versus physical experimentation. Simulation frequently relies on multi-source data
the expansion of Geographic Information Systems, which rely on the analysis of spatial data from multiple sources
increased use of sensors and sensor algorithms to track multi-source data, generate alarm signals, and make predictions
across the board, as data analysis capabilities evolve, a higher threshold for accuracy and a lower tolerance for risk, especially in more established data analysis applications
increased investment in data technologies and methods

As reliance on data grows across activities and users in both business and personal spheres, technology and analytical methods will be fashioned particularly towards the rapid, continuous combination and analysis of multi-source data. In particular we believe that entrepreneurs and new technologies will expand upon the following offerings:

Multi-source data visualization
Enterprise-level automation solutions for decision-making based on multi-source data
Information systems focused on multi-source data integration
An information systems focus on intra-sector, multi-source information-sharing between actors in sectors such as: private business, health, research, and government
A more established set of standardized algorithms for cleaning, integrating, and fusing multi-source data
Continued and expanded offering of big multi-source data integration and cloud storage (e.g. Hadoop)

Throughout the literature on multi-source data analysis, we have encountered common challenges, regardless of analytical application. Some of these challenges are:

Data integration: for heterogeneous, dynamic, distributed large data. Heterogeneity can be defined in different ways: different dimensions, file formats, etc.
Understanding large multi-source data sets: collecting dynamic data from various sensors and satellites can create an enormous data collection. This data collection can be quite complex and challenging to understand. In addition, running a model can be computationally challenging. Decisions that involve pruning the data-sets or reducing dimensionality require further investigation.
Fusion: Empirical observation of appropriate scenarios for “early” or “late” fusion
Measurement of model performance: Models don’t always benefit from the addition of data from new sources.
Data visualization: Difficulty of visualizing heterogeneous, massive data.

Conclusions:

While reading research papers, we have observed that commonly used data mining methods are applied in multi-source data analysis in novel, experimental ways. We believe that, beyond its effect on prompting technology advancements, the increasing interest in multi-source data analysis will lead to a generation of new or altered data mining algorithms.

Techniques for Analyzing Multisource Data

11 Friday Apr 2014

Posted by TJ in 22_Analysis of multi-source data

≈ Leave a comment

Tags

03_Techniques

Over the past two weeks, we have introduced the concept of multisource data mining, what it entails and which users are most affected by it. This week, we’ll be introducing you to techniques that allow you to implement multisource data mining.

But why do we need any special techniques for analyzing big data? Why can’t we just use traditional data mining techniques? There are multiple reasons! First, because of the inherently large volume of multisource data, traditional data mining techniques require proportionally higher computational resources. Secondly, traditional data mining techniques make it difficult to find not just local, but also global patterns in the data[1].

Techniques for multisource data are largely dependent on the domain, the type of data and the models used to analyze the data. For example, with medical or genetic data, association rules are typically used with local pattern analysis, which is explained below. In general, the choice of multisource data mining technique is ad-hoc and depends on the needs of the data analyst.

LOCAL PATTERN ANALYSIS[2]

Local Pattern Analysis provides a novel approach to Multisource Data mining. The idea here is that instead of combining all the data into one huge dataset, we mine each data source separately for local patterns using traditional data mining techniques. Then we collect all the different local patterns from all the data sources and perform global pattern analysis on them. This approach alleviates all the concerns with merging multisource data and also promotes local knowledge discovery “at each data source independently to discover local trends and for making local decisions.” 2 The steps are as follows:

Local pattern analysis and fusion
Pattern comparison across the multiple data sources
Merging local rules for global pattern discovery
Global pattern synthesizing and assessment
Pattern summarization from multiple datasets[3]

Challenges W/ Local Pattern Analysis

However even though these techniques are optimized for multisource data analysis, they still have a few problems.

Complex Data Relationships

The discovery of global patterns is inhibited especially if they are not discovered as local patterns in the separate data sources

Data Pre-Processing

If we analyze each local data source separately, we have to perform data cleaning, feature reduction locally. However, this runs the risk of eliminating useful columns.

Local Pattern Merging

There is no right way to merge the local patterns and synthesize global patterns from them. This synthesis is mostly performed ad-hoc.
Resolution of conflicting or inconsistent local patterns also runs the risk of eliminating useful local patterns.

Screen Shot 2014-04-11 at 5.02.44 AM

INFORMATION FUSION

Another technique for Multisource data analysis is information fusion. In this technique, we first stitch all the features from the different data sources together into one data set. However, this is only feasible if the features from the different data sources are well aligned i.e. they have similar time scales, similar volume etc. We can then use traditional data mining techniques to analyze the data. This is especially beneficial because it makes it much easier to generate global patterns. Fusion can be early (“merging the feature sets extracted from different sources)” or late (“combining the outputs of classifiers built on each feature set separately”)[4].

CHALLENGES W/ INFORMATION FUSION

However, even though information fusion is beneficial because it eliminates the problem of synthesizing global patterns from local ones, it is also problematic because:

Most multisource data cannot be easily fused together because they usually vary in time scale, volume etc.
We lose out on any information we could have gained from local pattern discovery.
Fusing features could possibly lead to an increase in model complexity because of the amount and dimension of the fused dataset.

CONCLUSION

Local pattern analysis and information fusion are two current techniques that exist to solve the problem of multisource data mining. While they are certainly effective they are also limited by several factors. The choice of either technique by the data analyst should be based on situational factors and the type of solution wanted. Using these techniques, the analyst can reap the myriad of benefits that come with multisource data analysis.

FOOTNOTES

[1]Adhikari, Animesh, Pralhad Ramachandrarao, Bhanu Prasad, and Jhimli Adhikari. “Mining Multiple Large Data Sources.” The International Arab Journal of Information Technology Vol. 7.No. 3 (2010). Online.

[2]Zhang, Shichao, and Mohammed Zaki. “Mining Multiple Data Sources: Local Pattern Analysis.” Data Mining and Knowledge Discovery 12 (2006): 1. Online.

[3] Zhu, Xingquan, Ruoming Jin, Yuri Breitbart, and Gagan Agrawal. “SIGKDD Explorations.” MMIS-07, 08: Mining Multiple Information Sources Workshop Report Vol. 10.#2 (2007): 62. Print.

[4] Prem, Yan Liu, Richard Lawrence, Ildar Khabibrakhmanov, Cezar Pendus, and Timothy Bowden. “Finding New Customers Using Unstructured and Structured Data.” The First International Workshop on Mining Multiple Information Sources / KDD’07 1 (2007): 22. Print.

Analysis of Multi-Source Data: Users

04 Friday Apr 2014

Posted by edubrovs in 22_Analysis of multi-source data

≈ Leave a comment

Marketing Managers

Companies in the 21^st century use an abundance of different marketing channels to communicate with their target audience and advertise products. One of the essential questions that arises from these activities is: how to allocate a finite marketing budget across these marketing channels? A company may simultaneously use Google search advertising, TV advertising and newspaper advertising. Knowing which marketing channel contributes in which way to a certain target KPI is of central interest to the marketing decision-makers in this firm. Consider a famous quote by an advertiser from the pre-digital age: “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.” Given today’s rich availability of multi-source data from different marketing, companies can now estimate the impact of different channels. The analysis of these multiple data-sources, however, results in various challenges:

Varying granularity of data-sources
Whereas online marketing channels such as Google search advertising provides the client with hourly or even per-minute data, other marketing channels like TV advertising usually offer highly aggregated statistics on a daily basis. Even worse, newspaper advertising agencies sometimes only offer monthly statistics on reader counts and demographics.

Different analysis dimensions among data sources
In contrast to aggregate statistics obtained from newsletter and TV advertising (audience count, content groups, etc.), google search advertising keywords and advertising texts could be used for further pattern detection using text mining techniques. As a result, different data sources allow different types of analysis. However, their results would need to be merged in order to generate a holistic picture.
Different response behaviors
Marketing channels also trigger inherently different response behaviors. In other words, different responses result from different advertising activities. Whereas online advertising activities frequently show an immediate uplift in sales (e.g. special promotions), the effect of TV advertising or newsletter advertising is only observable in the long run.

Based on these challenges, it is clear that the analysis of multiple data sources is a challenging task. However, marketing managers are a persistent user-base for this type of analysis. In order to evaluate marketing strategy, it is necessary to determine underlying patterns, relationships, and impacts amongst different marketing channels when viewed within their interaction.

Scientists, Medical Researchers, and Technologists

As in marketing, in scientific experimentation and observation, as well as in technology deployment, there is an increasing capacity to receive and monitor multi-source data signals. Frequently, the analysis of multi-source data is essential in order to reduce risk, increase the accuracy of alert systems, and detect break-downs. Real-world examples are omnipresent and expand across domains:

Bioinformatics: genetic and medical research
Robotics
Aviation, Car manufacturing
Supply Chain Management
Security, police, and military applications
Nuclear simulation and testing
Physics experimentation and simulation

Conclusion

As shown above, users of multi-source data analysis are present across domains. Our group’s project is in particular interested in multi-source data analysis for marketing channels. In our future blog posts, we will attempt to illuminate this specialization.

Introduction to Multi-Source Data Analysis

28 Friday Mar 2014

Posted by TJ in 22_Analysis of multi-source data

≈ Leave a comment

Figure 1: Different sources of Big data. From http://cloudtweaks.com/wp-content/uploads/2013/11/Big-Data-Intelligence.png

Background /Introduction

The analysis of multi-source data is motivated by the increasing availability of different types of data and the desire to create data-driven computer systems to help human beings make complex decisions. Students in this course might be most familiar with data mining as it relates to the search for local patterns: patterns in single data repositories or from a single data source. The analysis of multi-source data refers to the exploration of global patterns – patterns detected by combining different data sources into a unified perspective, to derive big picture insights e.g overall user behavior. For example, the infographic below illustrates the different sources that usually present them in healthcare data analysis. Here, the Patient’s Records data might be combined with their Test Results data to develop big picture insights about the patient’s health.

Figure 2: Sources of data for Healthcare Data Analysis, From: https://infocus.emc.com/wp-content/uploads/2012/10/New-Health-Sciences-Data-Sources.png

The difficulty in analyzing multi-source data can be partly attributed to the challenge of integrating, normalizing and analyzing data sets that differ in dimensionality, domain, time characteristics (e.g. historical vs. live), type (e.g. image vs. text) and so on.

Is Multi-Source Data Analysis Worth It?

Case studies on the analysis of multi-source data present different ways of addressing this problem. For example, one approach, called “late fusion,” avoids the task of combining multi-source data. Rather, it proposes the construction of a separate classifier for each dataset, and the subsequent development of the optimal classifier based on these classifiers. However, some use cases particularly in the fields of medicine and geoscience require the early combination of disparate data in order to construct a unified visualization and/or simulation environment for the user.

Given the difficulties of building a model from multi-source versus single-source data, we may wonder whether it is worth it, or what benefit is derived from the greater costs. In some scenarios, for instance when building a “probability-to-purchase” predictive model, the use of multi-source data can be optional.

In a case study on this topic (1), structured data was taken from an existing consumer database and unstructured data was pulled by crawling websites. Training sets were available for each data source. For instance, an expert was brought in to classify positive and negative instances for unstructured website data. A model built using both data sources had superior performance to models built on each source in isolation.

In another case study (2), multi-source data was used to determine medical association rules of most “subjective” interest. A data-mining algorithm was used to discover association rules in a medical database. Instead of using experts to carry out the time-consuming process of separating obvious rules from new and insightful ones, each association rule, upon its discovery, was fed to a different medical data source to determine high or low support. A rule with high support indicated “obviousness”.

In both of these examples we can see that on a basic level, the analysis of multi-source data can increase classifier accuracy simply due to the increase in the amount of data—particularly training data—available for model-building. In addition, different data sources, whether in combination or in isolation, can provide insights that would otherwise go unnoticed.

In other applications where multi-source data is used, optionality does not exist – each source is vital. For instance, in field of nuclear testing (3), the combination of multi-source current non-nuclear testing experiment data, historical nuclear testing experiment data, and simulation data has allowed computer programs to replace reliance on dangerous nuclear tests. In this case, and across other case studies that we have explored, we have noticed that visualization is an especially important model component. In our future posts, we hope to further explore the challenges of multi-source data-cleaning, algorithm-building, and visualization.

Sources

(1) “Finding New Customers Using Unstructured and Structured Data” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.1675&rep=rep1&type=pdf

(2) “Incorporating Background Knowledge from the World Wide Web for Rule Evaluation using the Minimum Discriminative Information Principle” http://dl.acm.org/citation.cfm?id=1337308

(3) “Multi-Source Data Analysis Challenges” http://viz.lanl.gov/publication/presentation/1.pdf

Hello from Team 17 – Analysis of multi-source data

28 Friday Mar 2014

Posted by mgutjahr in 00_Say hello, 22_Analysis of multi-source data

≈ Leave a comment

Tags

00_Say Hello

Hi everyone,

we are team 17: Adetunji, Ethel & Manuel. We will be writing about multi-source data analysis through this mini. – Already looking forward to share our findings with you!

Greetings,

Team 17

datascienceCMU

~ Learn,Explore,Network on Data Science

Category Archives: 22_Analysis of multi-source data

The Future of Multi-Source Data Analysis

Techniques for Analyzing Multisource Data

Analysis of Multi-Source Data: Users

Introduction to Multi-Source Data Analysis

Hello from Team 17 – Analysis of multi-source data