Venn diagrams are familiar to most data management professionals. They are a symbolic notation that describes how things can belong (or not belong) to two different classes. Venn diagrams are presented in many books of mathematics and modern logic as if they apply to all of reality. I have never seen data discussed in any of these publications, and I suspect that the authors simply assumed that data has no choice but to fall into the order prescribed by Venn diagrams. Yet, it this really the case? Or is there something special about data such that its characteristics cannot be fully captured by Venn diagrams? After many years of working with reconciling datasets, I have begun to think that data really is different.
Mr. John Venn (1843-1923) was a Cambridge (England) man where he lectured in moral sciences; and in 1881 he wrote a book called Symbolic Logic in which he introduced his diagrammatic technique. Venn was chiefly interested in propositional logic, but today everyone seems more familiar with the extended use of Venn Diagrams in set theory.
The Basic Situation for Data
Sets of things matter a great deal in data. Much work in data management consists of comparing the content of two sets of data. The example of a source of data that is copied to a target is familiar to everyone. Suppose we have a source set of Customer records that we copy to a target. After the data movement has occurred, we can compare the datasets. We expect to find Customer records that exist in the source are now in the target – an overlap. But perhaps, due to errors in our logic, there are some Customer records in the source that are not in the target. Perhaps also, because the target receives records from another source, there are some Customer records that exist in the target but not in the source.
Let us illustrate this basic case. Suppose we have a source and target with the Customer records shown in Figure 1.
Sample Source and Target Customer Tables
From this, we can see that Aristotle, Diogenes, and Plato are both in the source and target, but Socrates and Thales are only in the source. Likewise, Marx and Engels exist only in the target. We can use a Venn diagram to illustrate this situation with the number of records that pertain to the overlap, the source only and the target only, as in Figure 2.
Venn Diagram with Record Counts for Tables in Figure 1
The Problems for Data
Figure 2 is what we would expect. Now let us consider a more realistic situation for data. Suppose we have a source that is not at the Customer level but at the Account level, and suppose we want to populate a target that is truly at the Customer level. This is not uncommon in dealing with master data management
(MDM) applications. Suppose also that our users "abuse" the source table by recording Accounts before they really know the true name of the Customer. The users would, of course, not say they are abusing anything, but are constrained by the situation in which they find themselves. Figure 3 is an example of such a situation.
Figure 3: Source at Account Level and Target at Customer Level
Now everything is more complex. Five records in the source overlap with three records in the target. Two records in the target still do not overlap with anything in the source, and five records in the source do not overlap with any record in the target. Of the latter, two are duplicates for the same customer and two are nulls. Let us try to put all this into a Venn diagram. It is not easy, and the best I can do is shown in Figure 4.
Attempt at Venn Diagram for Records in Figure 3
Figure 4 is not a Venn diagram in any real sense. It violates the principles of a Venn diagram for two main reasons: the existence of duplicates, and the existence of nulls. Neither of these occur in reality – which is what the Venn diagram is based on – but they do occur in data.
The Problem of Nulls
If you ask a data professional what "null" means, you may well get into an argument that will last a long time. Null is not a value, but the absence of one. Whether the absence of a value is itself a value (since it represents some kind of knowledge) is difficult to think about, and I would like to avoid getting drawn into a debate from which I may never escape. However, I am sure of one thing, and that is that a null value belongs neither in the portion of the Venn diagram that represents the source, nor in the portion that represents the target. Hence we see the value "2" outside the mangled Venn diagram in Figure 4. And here I will call to my defense none other than Lewis Carroll, the author of "Alice in Wonderland."
Carroll (his real name was the Rev. Charles Dodgson) wrote a book in 1896 also called Symbolic Logic
and wrote letters to Mr. Venn about how Venn would use his diagrams to represent combinations of various propositions. One can only imagine how mortified Venn must have been to be asked to represent premises like "Some mermaids smoke cigars" and to deduce conclusions such as "Some persons who are not gamblers are not philosophers." However, Venn did reply to Carroll and produced some remarkable variants of his diagrams that Carroll included in his book. Even so, Carroll did identify that there are things that are in neither of the two classes (the source and target in our case) that the Venn diagrams cover. Carroll created an alternative diagrammatic method that has never been popularized, but did cope with nulls. As Carroll put it:
My method...differs from his [Venn's] method in assigning a closed area to the Universe of Discourse, so that the Class [the nulls in Fig. 4] which, under Mr. Venn's liberal sway has been ranging at will though Infinite Space, is suddenly dismayed to find itself "cabin'd, cribb'd confined" in a limited Cell like any other Class!
Carroll taught logic and mathematics at Oxford University, and knew what he was doing. Therefore, we can conclude that Venn diagrams do not support nulls, and since nulls are a persistent feature of data, Venn diagrams do not support data.
The Problem of Duplicates
Duplicates do not exist in reality, yet they do in data, as we can see in Figure 3 where the Customer Aristotle has two accounts, and thus two records exist for Aristotle. It is no use complaining that there is only one Aristotle in reality because we are not dealing with the real Aristotle but the two records in which he is represented as a Customer in our Account table in Figure 3. We are not trying to use Venn diagrams to describe reality, but to describe the representations of reality in data.
It could also be objected that we should only consider the distinct values in the source, just as distinct values are populated into the target. But, again, why should we? We are looking at records which are a fact, rather than distinct values which are an abstraction
from this fact. We have to reconcile records between the source and target. This is not at all like the basic idea that underlies a Venn diagram where a single instance of a thing can belong to two or more distinct classes. In data, a single instance of a thing can be represented many times, so there really is nothing corresponding to a single instance of a thing as conceived for a Venn diagram. Once again, Venn diagrams do not work for data.
Data can be a frustrating field to deal with. It often seems that we do not have the tools we need to deal with it. Venn diagrams have been one of the tools that many data professionals have tried to use to understand data. Yet, we have seen that Venn diagrams really cannot be expected to work for data. We need to find techniques that are oriented specifically to data, rather than borrowed from other areas of human experience. The lack of usefulness of Venn diagrams should not be a disappointment, but a spur to the maturity of data management.
SOURCE: Why Venn Diagrams Don't Work for Data
Recent articles by Malcolm Chisholm