Oops! The input is malformed!
Originally published 16 July 2008
Data modeling using E-R models is used to model the structural characteristics of data, though using constraints, a subset of behavioral characteristics can also be modeled. With introduction of object-oriented concepts, we see that databases are now capable of handling more of the behavioral capabilities of the data using methods. Though there can be a debate on what portion of the business rules (behavioral characteristics) of the data should be stored in the database, data (especially structured data) is generally stored in a database.
In this article, we will discuss some structural problems faced during logical data modeling and the solution thereof. Most of the examples used here are from the capital markets. The listing of these patterns and anti-patterns is not exhaustive, but the intention of the article is to bring out the concept of patterns for data models.
Listed in this section is a set of structural patterns which we have used in some data modeling projects.
Multiple inheritance is a common problem, which has been dealt with in the object oriented world, where some languages support multiple inheritance and some don’t. The issue of dealing with multiple inheritances in the E-R modeling world is the same as that in the object oriented world. The issue is in identifying each instance of the child entity which is inherited from multiple other entities.
The difference between the object oriented world and the E-R modeling world is with respect to methods (if user-defined data types are left out of discussion), and this gives one the flexibility of redefining an inheritance as an association.
Let’s take an example from the mutual fund industry. A mutual fund in its tangible form is an instrument like any other equity or bond and behaves like one as it can be openly traded (with some caveats based on the type of the mutual fund). A mutual fund can also be viewed as an abstract concept of a portfolio – a collection of instruments, and all the characteristics of the mutual fund would then depend on the characteristics of the individual instruments in the collection. [Since there are other entities in a mutual fund industry, which can be treated as a portfolio (example account) it may be beneficial to abstract out the portfolio related attributes from a fund entity into another entity].
We could have modeled the relationship between a mutual fund and an instrument based on the following premise shown in Figure 1.
Figure 1: Handling Multiple Inheritance
Under such circumstances how would the fund_instrument_id (which now is inherited from the instrument entity and the portfolio entity) be created? The solution to this problem was achieved by removing one of the inheritances as shown in Figure 2 (keeping the tangible one instrument inheritance as that would be used for most of the queries and converting the portfolio inheritance into an association).
Figure 2: Design of Funds after Removing the Second Inheritance Relationship
The concept of a composite pattern is borrowed from the object-oriented world where a composite pattern is defined as:
Compose objects into tree structures to represent part-whole hieracrchies.
Composite lets clients treat indivdual objects and compositions of objects uniformly.
In the relational world, too, the definition would hold true for any hierarchical data where the whole is to be treated similarly as the part. By “treated similarly,” we mean that the relationship between a child entity to any other entity in the model is the same as that between the parent entity and the other entity in the model. A tree structure is sometimes used to denote a hierarchical relationship such as manager and employees or a composition like a team and employees. For a composite pattern to work, the parent node in the tree structure should be a generalization of the child nodes, and hence the manager/employee situation is not a good example for the composition.
In most organizations a certain kind of task can either be done by a single person or a team or may be outsourced to an external entity. For example, in an organization a team of people and an individual both may behave in the same way as in a job could be assigned to an individual or a team based on the size of the job. Further, based on the complexity of the job, a team could be formed of cross-functional expert subteams. To form a comprehensive view of the roles played, one has to generalize the team and the individual as a party.
Figure 3: Design of Organization Structure
Here the whole (Team) behaves like a unit (Person) when a job assignment is to be defined, which can now be done to a party.The Team behaves like a collection (distinct from a person behaving as a unit) when the composition of the team is to be defined. The previous example of the portfolio fund and instrument is also a good example of composite pattern.
All data warehouses get data from multiple sources and integrate the data. During the integration, the data goes through validation, transformation and enhancement process. The data validation process may mark data as invalid as per the business rules but, in most cases, the data is not thrown away. Instead it is kept in the database, and based on a set of business rules, the validation issue is overruled and data is loaded into the database as clean data.
It is of great value to the business end user to be aware of the source of the data and the quality level of the data present in the data warehouse.
In cases of data enhancement, data across multiple loads may be used in composite data (specially scores/averages, etc.). During customer complaints which result from using this composite data, it is very beneficial for the end user to know which data points were considered while arriving at the composite data.
There are multiple approaches to store this information or retrieve the information if not stored explicitly. In this article, we deal with the options of storing and retrieving the information only for the data source or data quality-level identification. Identification of input data points for composite data requires a separate discussion on reverse functions.
All the options presented have one thing in common and that is to tag the row with an additional data element, one for providing information on the data source and the other for providing information on the data quality of each relevant data element in the row.
Since we have one data element for a type of information on a row, the data will have to be decrypted into a column_name and either source_id information or column_name and data_quality_id information. This would require helper tables to provide the information. Figure 4 shows a model that provides one possible representation of the lineage and quality tracking which uses a number based on bit pattern for storing the information.
Figure 4: Lineage Tracking Model
Let’s say there are 5 sources for the data in table “Data Table” – column 1 to column N. If the sources were numbered 1 to 5 we would need 3 bits to represent the sources distinctly, and we would need to string of 3*N bits to represent the source for each of the column (column 1 to column N). The helper table would contain the start and the end positions in the bit string for the source information of each column.
Using this approach allows one to point multiple columns to the same position in the bit string formed for lineage_id, which can be used for a set of related data elements for example, if we decide to take in all the address components from one of the sources completely.
There is a minor variation to the approach for data quality Id. Each error possible on a column would have an error_id associated with it. During validation the process may check for all possible errors on the data and we may want to store the complete set of results from all validations done on the data.
This means that in the data quality ID bit string, the position specified for a particular column (in the helper table) should contain information on each of the validations applied on the column. This will require a bit position for each validation, which can be switched on or off.
Figure 5: Data Quality Tracking Model
In most of the presentation and documentation on data modeling one comes across a set of problems with data modeling.
A fan trap is defined as a situation where two/more 1:n relationships fan out from a single entity, and a relationship between the outlier entities, which should have existed, is missing. Going through the central entity would result in ambiguous results.
The resolution of a fan trap is dependent on the scenario, but essentially it is all about having the complete set of relationships. Example: A financial services company may have multiple products such as savings banks, loan products, etc. Each product would have large number of accounts. Let’s take the loan product there can be large number of loans and some of the products could be sold/serviced over various channels.
Figure 6: Fan Trap – Which Account is Serviced by Which Channel?
A chasm trap occurs when a relationship is non-identifying and non-mandatory. Under such circumstances any third entity related to the parent entity may not have a path existing to the child entity.
This can be resolved by adding another relationship between the third entity and the child entity, which will also be another non-identifying and non-mandatory relationship. This leads to a cyclical relationship. Let’s take the example of a set of auto ancillary suppliers providing parts to a car manufacturer. The list of such suppliers can include a single location supplier or a multi-location supplier (having a headquarter and a set of branches).
Figure 8: Chasm Trap – Parts from a single location supplier cannot be listed in this model.
Figure 9: Chasm Trap – Parts resolved by extra relationship between organization and parts.
A cyclical loop can be a redundant relationship (one that describes the same business context as some other relationship) or as described earlier used to resolve an unmet requirement chasm trap in the data model design now has resulted in overlapping business contexts. The issue with a cyclical relationship is that it needs denormalization of some data whereas the logical model would seek to keep the model in a third normal form. Also the user now has more than one way to arrive at the same data.
In the previous examples, additional relationship between organization and parts would lead to denormalization of the org_id and branch_id relationship, which is now stored in two places branch entity and the part entity.
Over a period of time with changes and mergers and acquisitions, these relationships will have to be maintained in two places. If this data is not managed consistently across time data quality issues are sure to occur. Also for a user trying to answer which parts are provided by which organization, there are two options:
1. Join parts and organization
2. Join parts, branches and organization
Results from the two queries will not match as option 2 will return parts from only those organizations that have branches. Figure 10 shows the resolution of this problem.
Figure 10: Cyclical Relationship Resolved Using Context
In the diagram in Figure 10, making branch_id unique across org_id ensures that the org_id information in parts table comes only from the relationship between organization and parts, whereas in the
relationship path organization->branches->parts the org_id information is embedded in the branch_id data. This way we have made the two contexts non-overlapping. Now the query joining parts
to organization would return only parts from single location organizations and the query having organization branches and parts would return details for organizations having branches.