A great debate is raging in the industry, and it is being fanned by the adoption of "big data." The simple question is: Do we create better search techniques or do we go all the way to text analysis for integrating unstructured data? A simple answer is to say “yes” to both the questions, but there are hidden layers of complexity in the answer, which this article will attempt to explain.
Search vs. Analysis
At a fundamental level, both search and analysis engines operate on text data. Here is where the similarity ends. With search, you typically look for patterns and present the findings to the user in short order. There is no further transformation to the text. Analysis deals with the discovery of the pattern (akin to search); but, more importantly, transformations are applied to the text to create a meaningful outcome. Analysis assumes that text must be integrated and transformed before it can be analyzed. This advanced treatment of text in terms of analysis is where complexities arise, and the field – though decades rich in terms of algorithms, research and development, and published theses – continues to be nascent and niche.
The fundamental characteristic of text is termed best in one adjective “erose” (do not confuse with “verbose”). The Latin word “erose” means “irregularly notched, toothed, or indented”(from dictionary.com), and is used more in botany to describe leaves of a plant. The underlying reason for this attribution is text is long, complex and unpredictable. It is a combination of words and phrases to form contextual statements, which may contain repeatable patterns (this repeatability can also differ based on context within a single document or text). When discussing “unstructured” data, we use this lack of repeatability and the associated ambiguity to distinguish text data analysis and outcomes, as opposed to structured data where there is great repeatability of data, a structured and formatted storage architecture, which lends itself well to integration and analytics.
Applying Search for Unstructured Integration
With the available search infrastructure and algorithms, one can make the argument that in order to integrate any “unstructured data,” why not just extend search outputs? Why do we need to create a text analysis platform separately? There have been attempts at doing that, but including integration and transformation as part of search is not a good approach.
- Search engines or enterprise appliances will become lethargic and slow upon including integration and transformation to the normal workload. For example, let us assume that 10,000 searches have to be done for a contract database on a content management platform for every user query. Every search transaction will create operational structures and return quick hits on a set of patterns as its output. Adding analysis type of transformation introduces great inefficiencies of operation to this exercise. The critical reason here is analysis requires creating clarity and context around the unstructured information, and both of these operations are highly complex and require processing. The additional operation will cause immense slowdown of search.
- Search engines do a lot of pattern matching, metadata (taxonomy and ontology) based indexing and large-scale distributed data processing. Metadata and patterns are definitely nimble and agile techniques for transforming the minimal data required for search processing, but the same will not scale to support the complex nature of unstructured data analysis.
- Searches are designed to process patterns for every user query and are inconsistent by design. No two users will search for the same pattern at a given time. Thus, the same algorithms are replayed over and over, for multiple types of data patterns, which are short life cycle and efficient despite of processing inconsistencies.
While these are the key reasons where applying search to analyze unstructured data is not the best option, these are not the only reasons. Analysis of text requires a lot of additional processing including spelling correction, alternate spellings, synonyms, user defined rules and much more deep processing.
Let’s look at how analysis will be different from search:
Text analysis advances the integration of unstructured data beyond just light indexing and pattern matching of search.
Analysis consists of multiple transformation steps, each of which needs to be run once per set of patterns, metadata terms or context.
Analysis creates multiple iterations of metadata output as opposed to simple result sets of entire pages, which create a powerful set of indexes within the text and its context.
Analysis always processes data in a consist manner as opposed to search.
For example, here is a popular example found in Wikipedia under Natural Language Processing
The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.
- "I never said she stole my money" – Someone else said it, but I didn't.
- "I never said she stole my money" – I simply didn't ever say it.
- "I never said she stole my money" – I might have implied it in some way, but I never explicitly said it.
- "I never said she stole my money" – I said someone took it; I didn't say it was she.
- "I never said she stole my money" – I just said she probably borrowed it.
- "I never said she stole my money" – I said she stole someone else's money.
- "I never said she stole my money" – I said she stole something, but not my money
Depending on which word the speaker stresses, you can see how this sentence could have several different meanings.
If you search for this pattern, you will get all the statements, and you have to search for the extended meaning and interpret the same. If you process this through a text analysis platform, you can create a context-oriented result set that will provide you not only the result, but also the associated context, which is far more useful.
The need for transforming data before it becomes useful for analytics and reporting is not a new thought. We have always designed the data warehouse to process data in this fashion, and call it ETL. Extending this analysis to text creates a powerful concept: textual ETL.
This need for transformation and integration of text has some interesting challenges. One challenge is the size of the data to be transformed. Let us assume that you intend to take the Internet as your data set. Is it possible to transform and analyze all the text found on the Internet? In a nutshell, it is not practical or feasible. In such a situation, you primarily rely on search and can use a subset of data from the result set for deeper analysis.
But there are other data sets such as enterprise data that are large in volume, complex in formats and have multiple contexts, yet lend themselves to rigors of text analysis and processing. A simple example is the contracts existing across the different business divisions such as purchasing, supply chain, inventory management, logistics, transportation and human resources. Each of these contracts has a different purpose, and there may be many contracts of a type that can provide insights beyond just start and end dates. Insights include legal terms and conditions with applied context, liabilities and obligations and much more. After analysis, such text will create a powerful and rich metadata output with context that can be simply integrated into a decision-support system ecosystem.
Other challenges include the variety of formats, the volumes of text, the ambiguous nature of the data itself and lack of formal documentation, to name a few. But once the challenges are addressed, the output from such an analysis is powerful to create a huge visualization platform for looking into text and unstructured data within the enterprise. This is where you can leverage the data that has been stored on content management platforms for years for useful output of trends and behaviors.
The major differences between a result set produced by a search and text analytics system are as follows:
- Search is oriented to process informational needs of a single user query
- The search result set is proprietary to that user and cannot be shared
- Result set is temporary (under normal circumstances)
- Transformation rules are repeated with every query and are minimalistic
- Result set cannot be integrated with a DBMS
- Search processing cannot scale for large and complex operations – context-based search has always added significant overhead
- Can be defined by users for processing with business rules, like an ETL tool
- Produces a result set that is a key-value column pair often stored in an RDBMS
- Result set can be used for further analytical processing
- Result sets can be stored as snapshots for repeated processing
- Transformation of data and associated context is repeatable in multiple passes of processing cycles
- Text of different languages for global organizations can be stored in the same result database based on metadata integrations and rules
- Text analysis can scale easily based on the infrastructure capabilities
Based on the discussion here, you can discern that search is good for finding things on an ad hoc basis in a large set of data. Analysis is good for creating a platform that can be used repeatedly against a large but finite amount of textual data as related to a corporation.
In order to perform text analysis and deep text mining, you need to process the text rather than extend a search engine or appliance. A robust text analysis system will provide for the following:
- Spelling correction
- Synonyms, antonyms and homonyms
- Integration with taxonomies
- Business rules integration
- Reprocessing capabilities
- Document fracturing and processing
Each of these steps allows you to process large text and create the result database for processing. This database can be used with search to create guided search and navigation, and can be extended to machine learning using a search and analysis combination platform.
The major advantage of text analysis is the ability to track changes as they occurred or occur within the text environment in a similar manner to tracking changes in a dimension. This is the most powerful output that makes analysis such a better proposition than search and is called document mid-point reprocessing. You can extend this concept to emails, Excel spreadsheets and other document types very easily.
In conclusion, search and text analysis both serve different purposes for processing unstructured data and can be effectively leveraged. Search can be used for early stage data discovery, and text analysis can be used for the detailed analysis and downstream analytical processing. But remember this: Do not substitute search as the alternative to traditional text analytics.
SOURCE: Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Recent articles by Krish Krishnan