Dimensions of Data Quality
Data quality is based on a number of dimensions, which represent different ways to manage and understand the quality of data. These dimensions include:
- Uniqueness / Duplication
- Timeliness / Currency
- Unique identifier of entity
- Referential integrity (primary/foreign key match)
Data integrity is the most fundamental dimension and the one on which all other dimensions are based. Data integrity is the pragmatic determinant of whether the data “makes sense” given known business requirements.
Data integrity practices include profiling to identify unusual or outlying values, understanding expected distributions, and establishing and enforcing domains of value (i.e., what are the valid names assigned as an account owner or perhaps valid partner business names for a particular discount).
Predictable Data’s PDQ makes this easy by validating that all entries in a selected column are members of a custom domain list provided by you.
Data Validity means data conforms to the syntax (format, type, range, etc.) of its definition or a predetermined standard. A new user posits the email provided is valid, but is it? Is the data from another department, a client, or that you purchased as valid as they believe?
Predictable Data’s PDQ allows you to easily validate emails, hostnames, IP addresses, numbers, industry specific codes (SIC, NAICS), geolocation data and so much more.
- Values in specified range of valid values
- Values confirm to business rule
- Values conform to other attribute types & format
- Abbreviations are standardized
- Typos and misspellings are corrected
- Punctuation is standardized
- Entities are represented one way ("gold standard") wherever used
Ways to represent University of Texas
- U of Texas
- University of Texas
- University of Texas, Austin
- University of Texas at Austin
Data Uniqueness / Duplication
Data uniqueness / duplication is the flip side of completeness. A customer with two accounts may be able to receive multiple free introductory offers or exceed a credit limit. Often, abbreviations, typos, and misspellings are the culprit.
Perhaps you have users entering the university they attended in an online form (see list to the opposite side). All of these are non-identical duplicates of a particular entity and makes accurately reporting the number of users per school overly difficult.
Predictable Data’s PDQ allows powerful column-based search and replace features and even more powerful algorithms via List Curator to find duplicates based on fuzzy matches and map them to your desired “gold standard” value. Either way you can standardize your data in minutes.
Data completeness is self-explanatory. Is all the expected data there? Did your marketing team accurately code your entries? Are all fields populated with values? Phone numbers, emails, job titles, and addresses for people. What about company name, address, and URL? Do you know what industry each company is in?
For tiny data sets, the eyeball test can help, but you still need some tools to help you find missing values in larger lists. Predictable Data’s PDQ identifies missing values simply by sorting your column data, but also has tools to set default values, edit values in-place, and validate column values are members of the gold-standard list.
- Row population
- Column population
- Find or derive missing data
- How to treat missing values (delete record, ignore, count as zero, etc.)
- Agree with real-world
- Match to agreed source
- Represented in unambiguous form
- Missing values represent a single truth
Data accuracy is a different question than data integrity. There are two characteristics of accuracy: form and content. Form is important because it eliminates ambiguities about the content.
Does 1/4/2000 represent January 4th or April Fool’s Day? It depends on the expected form. Data profiling and exception reporting won’t uncover this error; you have to audit the data against reality. Format validation is one of Predictable Data’s PDQ core strengths.
Data Timelines and Currency
Data timeliness refers to the time expectation for accessibility and availability of information. Timeliness can be measured as the time between when information is expected and when it is readily available for use. Data Currency describes necessary processes to keep the data current. The data may have once been correct, but not anymore. A person moved or cancelled their order.
Data profiling over time can identify the “data decay” rate, that is, the expected half-life of a data set, and how often the processes will need to run to maintain data currency. Sometimes just re-validating and cleaning your data can be enough, but for larger systems you may want automated systems in place. Predictable Data’s consulting team can help you design and deploy these systems.
- Data Re-Validation
- Data Refreshes
- Data Decay Rates
- Time expectation for availability
- Concurrence of distributed data
- Free of conflict with other source
- Consistency in representation
- Don't replicate data errors
- Push corrections back to source data
- Data-driven processes and applications
Data consistency is the need to replicate data for good, or at least unavoidable, reasons. Are you storing data in Excel or your email instead of the CRM app? Are you exporting data for marketing or other data science projects or data warehouses? Make sure you fully sanitize your data before replicating and distributing errors. Conversely, make sure data corrections work their way back to the original source.
Keeping data in sync can seem like a difficult challenge for small businesses and startups, but developing data-driven guidelines, processes and applications that provide a single and consistent data source is a core competency for Predictable Data consultants.
Data Precision is the depth of knowledge encoded by the data. Precision comes in many forms such as number of decimal places, smallest unit of time (minutes, milliseconds, etc.) recorded, and the resolution of images.
Greater precision breaks composite fields into subcomponents. Data scientists, marketers, analysts, and report writers often spend several hours per project just parsing composite data to do their jobs. That’s why they love Predictable Data. For instance, with a click or two we parse a full name field into proper subcomponents like, Prefix (Mr, Mrs, Dr), First, Middle, Last, Suffix (Sr, Jr, III), and Post-Nominal (MD, CPA, ESQ). We can do the same for URLs, dates, times, phone numbers, and more, making your data more precise.
- Precision of data value
- Number of decimal places
- Composite field parsing into subcomponents
- Ease of attainability of data
- Need for access control
- Data APIs (REST, XML, JSON, CSV)
- Authentication (OAuth, Basic, Two-Factor)
- Monetizing Data
Data accessibility refers to the ease at which potential users are aware of and have access to the data. Is all that data you’ve worked so hard on easily available and usable to those who need it? Do they understand how to access data? Do they understand the value of using the data?
For example, sales reps are notorious for concealing customer and prospect data from marketing and CRM systems. Sometimes they don’t want to take the time, or perhaps justly, do not understand the value of updating the CRM. In many businesses the “CRM” system are sticky notes and spreadsheets–but even these need to be accessible.
Other examples include marketing keeping leads from sales, data engineers and scientists correcting data errors only in their copy of the data rather than updating the source, and so on.
From showing customers how to use a spreadsheet or a CRM to designing and developing data-driven warehouses, services and monetized products, Predictable Data consultants help businesses of all sizes.
Representation measures ease of understanding data, consistency of presentation, appropriate media choice, and availability of documentation (metadata).
The format, media type, and documentation of the data you store is incredibly important. For example, what does 1577836800 mean to you? This format of timestamp is stored in countless databases and happens to represent Wednesday, January 1, 2020 12:00:00 AM (GMT). With the prior timestamp form, you can subtract timestamps to calculate a duration. However, given the more expanded and readable format, you can parse into and analyze by: month, day, day-of-month, year, plus the time components.
Formatting values and splitting them into subcomponents is super easy with Predictable Data’s PDQ. With only a few clicks, PDQ splits fields like full name, timestamps, URLs, numbers, email addresses and more into their subcomponents.
- Ease to read & interpret
- Presentation language
- Media appropriate
- Complete & available metadata
- Includes measurement units
- Source documentation
- Segment documentation
- Target documentation
- End-to-End documentation
Data Lineage includes the data origin, what happens to it over time, and who and what systems have access to read, write, modify, and delete the data. How does data get from your website into your CRM? How does it travel to a report you may be creating or reading to make critical business decisions?
Knowing your data’s lineage greatly simplifies the ability to trace errors back to a root cause. Predictable Data helps decision makers and IT departments document and understand their total data lineage.