But, before jumping to fix your data issues, it’s important to establish a framework that ensures the data will be usable in the long run—not only immediately after a big cleanse, which is often time-consuming and expensive. This 5-part framework provides a comprehensive approach for addressing existing data quality issues, and prevent issues from arising in the future.
Let’s explore each step in detail.
1. Identify the data that drives business outcomes.
Data is the foundation of your organization’s ability to make an informed decision that helps realize its main business objectives, key goals, and vision. The first step is to identify and prioritize all data that is used by internal processes, business units, and projects. What information do they depend on to make informed decisions and realize objectives? Some data may not be obvious at first blush.
EXAMPLE: Time tracking is essential for consulting companies, and accurately reporting on billable time is usually not a challenge for a “time & materials” project. But what data is used to ensure optimal efficiency and productivity for work under fixed-price contracts or complex engagements with large project teams and parallel workstreams? Management needs to go beyond time tracking to understand how work is executed, because of the potential impact on customer satisfaction, profit margins, and overall project performance. In this situation, the organization depends on timesheet data as well as other KPIs that provide further insight into project execution.
2. Establish a single source of truth.
In theory, every data element should be stored exactly once. Linkages to those elements should be by reference only, and any updates to the data element in the primary location will propagate to the entire system without the possibility of a duplicate value somewhere being forgotten.
In reality, most organizations rely on multiple, disconnected systems and applications, which independently store the same type of information and data. This causes duplication in many places. It is often updated spontaneously, with little clarity as to which copy of the data is the most current.
Businesses that depend on aggregating data from multiple sources typically deploy a Data Warehouse, a Master Data Management, SaaS, or other cloud-based solution specifically for that purpose. Once the key pieces of data that drive business outcomes have been identified, organizations should determine the single source that will maintain each.
EXAMPLE: Those healthcare organizations that deal with clinical management (oversight of specific cohorts of patients based on conditions, cost, risk, etc.) routinely aggregate data from multiple sources. Clinical data comes from various EHR systems that don’t “talk” to each other, and financial and claims data comes from insurance carriers. In such cases, a data warehouse is deployed to support other systems and applications, providing a single source of truth for data aggregation, deduplication, creation of a single patient record, and reporting.
3. Implement a data governance policy.
Data governance applies to more than regulatory compliance situations and covers more than cybersecurity. Data governance defines and controls:
- the processes by which data is collected, stored, tracked, and used;
- which data is available and for what purpose;
- ownership at each step;
- how decisions are made that impact the state and use of data;
- and finally, how the integrity and security of data are maintained.
Once it has been established which data drives business outcomes and serves specific strategic goals, a comprehensive approach to managing that data and underlying infrastructure is required. Root causes of data issues should be determined and addressed at the points of data entry.
EXAMPLE: An organization that is trying to improve project performance and execution effectiveness may choose to track ETC (estimate to complete, i.e. remaining hours in the project) and EAC (estimate at completion, i.e. total project hours), compare them to baseline estimates, and have a formal change control process for deviations beyond a specific threshold. In this case, data governance would define how the information is collected, stored, and reported, who is responsible for data entry, how the decisions are made when planned hours are not met, and how the overages are mitigated. A data governance policy will also help establish or adjust specific workflows related to entering data into source systems.
4. Determine your data quality tolerance.
Completely clean and reliable data is a myth and is not an attainable goal in most cases. Just like in any statistical analysis, there is a margin of error. Data quality will also have a certain tolerance level that’s acceptable in any given situation. There is always a point when the quality of data is sufficient for a specific purpose, and businesses should determine this threshold. Of course, this assumes that there is adequate insight into what the data issues are, and the specific impact of those issues.
EXAMPLE: A social security number may be an effective unique identifier when merging multiple customer records, but it may be useless if source systems don’t routinely capture this field. Deduplication algorithms can use a combination of lookup factors like name, alias, address, and other unique identifiers which, in concert, are used for merging disparate records. In this case, the acceptable tolerance may be very low for null values in last name and address fields, but high for missing or incomplete social security numbers.
5. Implement a continuous improvement process to address data quality issues.
Acceptable data quality level is not a moment frozen in time, in most practical applications. Organizations need a sustained effort to continuously monitor data quality and address issues in an iterative fashion. Like with any continuous improvement process, one focused on data quality would need to continuously monitor Key Quality Indicators (KQIs) and adjusting data governance and addressing issues based on priority and impact.
The following Continuous Improvement Process is specific to data quality:
KQIs that should be monitored on a continuous basis are the following:
- Data which should not change (longitudinal)
- Data which should change based on other changes
- Data which should not be missing
- Datasets which should correlate with each other (data integrity)
- Data which should be within the defined value ranges
- Data which has time or duration constraints
A data quality stewardship role is appropriate for these efforts. This person is responsible for performing regular data audits on business data, metadata, and data models, and contributing to data reconciliation efforts by helping to identify and resolve the root causes of data quality issues. The findings of the audits and reconciliation efforts feed the continuous data quality improvement cycle.
The steps above must be addressed before any data is cleaned. Once established, the governance and continuous improvement processes will ensure the organization is positioned to effectively deal with an influx of additional data, and leverage the data effectively to achieve strategic objectives.
Next, it’s time to clean the data.