For every complex situation, there are many perspectives.
Consider the subjects of big data, unstructured data, text and structured text. You could describe these environments in many ways. In fact, entire books have been written on these subjects. But a Venn diagram may be a useful way to achieve a high-level perspective of what is going on and how these different environments relate to each other. Such a Venn diagram is shown in Figure 1.
Figure 1: Relationship of Big Data to Other Forms of Data
In Figure 1 it is seen that there is unstructured data, text, big data, and structured text. (Note: There are many other ways to construct this Venn diagram.)
So let’s examine this Venn diagram.
The large outer circle contains all unstructured data. The middle inner circle contains all textual data. This shows that all text is a form of unstructured data (in its most general sense.) This inner circle implies that there is unstructured data that is not textual such as images, voice, and other forms of data that are not textual.
Inside the textual circle, it is seen that textual data can be either unstructured or structured. Structured textual data is a proper subset of all forms of text. In the larger sense of structured and unstructured, structured text is a subset of all text. (My apologies to my professor in college who tried to teach me formal logic. I am sure he could explain this much better than I can.)
And overlaying the different circles is big data. It is seen that big data may contain unstructured data, unstructured textual data and structured textual data.
This Venn diagram then shows how these different worlds intersect with each other.
The Intersections of Big Data and Text
Now let’s looks at the different intersections of big data with the other forms of text. Let’s start with big data non textual data. The technology to process queries containing non textual data is still in its infancy. For example, doing a query on finding a frame of a video is still very difficult to do. Thus, doing analytical processing against non textual data is still a fantasy, or something that is on the far horizon.
Now let’s consider the structured text component of big data. It is entirely possible to take structured text and place it in big data. That is done one a regular basis. The problem is that big data only holds unstructured data. So much (if not all) of the metadata that makes data (or text) structured is lost (or at least compromised) when placing structured text into big data. For example, if you were to put data from IMS into big data, the pointers and other forms of metadata that make IMS are lost. It is now up to the analyst to take the data that once was IMS and reconstruct the structure of IMS in order to understand the data that has been placed into big data. The same process is true not just for data that came from IMS but for ANY form of structured data that has been placed into an unstructured format.
Then there is the intersection of unstructured text with big data. This is the “sweet spot.” Classical unstructured text fits well with big data.
The question then becomes: How do we do analytic processing against big data?
An interesting question. Part 2 of this article will address that question.
SOURCE: Big Data: A Venn Diagram
Recent articles by Bill Inmon