Bridging the Gap: Comparing Claims Data and Electronic Health Records in Real-World Healthcare Research

Real-world data has become a cornerstone of modern healthcare research, providing invaluable insights into disease trends, treatment outcomes, and patient populations outside controlled clinical trials. Among the primary sources of such data are claims data and electronic health records (EHRs). Although these data types serve different purposes and have unique features, they are increasingly being […]

Real-world data has become a cornerstone of modern healthcare research, providing invaluable insights into disease trends, treatment outcomes, and patient populations outside controlled clinical trials. Among the primary sources of such data are claims data and electronic health records (EHRs). Although these data types serve different purposes and have unique features, they are increasingly being […]

Real-world data has become a cornerstone of modern healthcare research, providing invaluable insights into disease trends, treatment outcomes, and patient populations outside controlled clinical trials. Among the primary sources of such data are claims data and electronic health records (EHRs). Although these data types serve different purposes and have unique features, they are increasingly being integrated to generate comprehensive evidence that informs clinical and policy decisions. This article explores the fundamental differences between claims data and EHRs, their respective strengths and limitations, and how their combined use can enhance real-world research.

What Is Claims Data?

Claims data originates from the billing process; it includes detailed information submitted by healthcare providers to payers—such as insurance companies or government programs—to obtain reimbursement for services provided. This dataset encompasses a wide array of demographic details like age, gender, location, as well as diagnosis codes, procedure codes, prescription information, and associated costs. Because claims are submitted each time a service occurs, they offer a longitudinal record of patient interactions across various healthcare settings (Pivovarov et al., 2019).

Large-scale claims databases, such as IBM’s MarketScan, aggregate data from millions of patients across numerous payers, creating extensive repositories of real-world evidence. For instance, the MarketScan databases contain claims information for over 240 million US individuals, collected from employers, health plans, and government programs (IBM Watson Health, 2022). Other notable aggregators include Optum Clinformatics, Premier Healthcare Database, and PharMetrics Plus.

Key Data Elements in Claims Data

  • Patient demographics: age, gender, geographic location, insurance details
  • Diagnoses: International Classification of Diseases (ICD) codes
  • Procedures: Current Procedural Terminology (CPT) and Healthcare Common Procedure Coding System (HCPCS) codes
  • Medications: National Drug Codes (NDC), dosage, refill history
  • Costs: total billed amounts, payer and patient responsibilities

Claims data is especially valuable for studying disease incidence, treatment pathways, medication adherence, and healthcare utilization patterns across large populations. Its extensive sample sizes make it a powerful tool for comparative effectiveness research, health economics, and outcomes analyses (Berger et al., 2017).

Limitations of Claims Data

Despite its breadth, claims data has notable constraints:

  • Diagnosis codes often reflect billing considerations rather than confirmed clinical diagnoses; providers may submit provisional codes during diagnostic workups.
  • Coding inaccuracies or omissions can occur due to improper documentation or billing errors, leading to missing clinical details.
  • Prescription records do not confirm medication adherence or proper refilling.
  • Data elements are primarily limited to what is necessary for reimbursement, lacking detailed clinical context such as disease severity or lifestyle factors.
  • Linking data across family members or tracking patients who change insurance plans over time is challenging.
  • Variability in data quality and completeness exists across different claims sources.
  • The dataset may be biased, reflecting specific payer populations rather than the general public.

While these limitations restrict certain types of analyses, claims data remains instrumental for epidemiology, population health management, and health economics when its nuances are carefully considered.

What Are Electronic Health Records?

Electronic health records are comprehensive digital documentation of individual patient care generated during clinical encounters. Maintained by healthcare providers and institutions, EHRs include detailed, structured information as well as unstructured narratives. They cover a broad spectrum of data points such as (Hersh et al., 2013):

  • Demographic details: age, gender, ethnicity, language
  • Medical history: diagnoses, allergies, immunizations, procedures
  • Medications: prescriptions, dosing instructions, administration records
  • Vital signs: blood pressure, heart rate, weight, height
  • Laboratory results: blood tests, imaging reports
  • Clinical notes: physician observations, discharge summaries

Unlike claims data, EHRs provide rich clinical context, including detailed narratives, images, and quantitative measurements, which enables a nuanced understanding of disease progression, treatment responses, and patient complexity.

Large EHR systems, such as Vanderbilt University’s Synthetic Derivative, compile data from millions of patients. The BioVU database, for example, links de-identified medical records with genetic data, facilitating genomics research (Roden et al., 2008).

Benefits of EHR Data

EHRs offer several advantages for research:

  • Access to detailed clinical information beyond billing codes, including physician notes and lab results
  • Ability to review unstructured data through natural language processing techniques
  • Identification of patient cohorts with complex inclusion and exclusion criteria
  • Longitudinal tracking of disease trajectories and treatment pathways
  • Linking diverse data types such as imaging, pathology, and genetics for comprehensive phenotyping (see virtual reality in medicine perspectives and features)

These benefits enable researchers to explore disease mechanisms and treatment effects with a depth that claims data cannot provide.

Challenges of EHR Data

However, working with EHRs also presents challenges:

  • Data may be incomplete or inconsistent if providers fail to document encounters properly
  • Patients receiving care at multiple, unconnected health systems can lead to fragmented records
  • Variability exists in how data is captured—structured fields versus free-text notes—making standardization difficult
  • Lack of uniform data standards complicates cross-system normalization
  • Extracting and analyzing unstructured notes require advanced data science expertise and resources
  • Interpretation of clinical narratives demands domain knowledge
  • Data extraction and processing are more resource-intensive compared to claims datasets

Despite these hurdles, EHRs are invaluable for capturing the clinical nuances necessary for precise patient stratification and outcome assessment.

Integrating Claims and EHR Datasets

Recognizing the complementary strengths of claims data and EHRs, many research initiatives aim to integrate these sources to conduct more comprehensive studies (Maro et al., 2019). Effective integration strategies include:

  • Linking datasets at the individual level via unique identifiers
  • Building cohorts based on claims diagnosis codes and subsequently reviewing detailed clinical data in EHRs
  • Applying natural language processing to extract additional insights from clinical notes
  • Using algorithms to identify lapses in care, adverse events, and other clinical phenomena in both datasets
  • Combining pharmacy fill data from claims with medication orders in EHRs to evaluate adherence
  • Merging cost information with clinical details for health economic analyses

Networks such as PCORnet have developed infrastructure to facilitate such data integration, enabling large-scale, patient-centered outcomes research. When combined thoughtfully, these sources provide a multifaceted view of patient health that surpasses the capabilities of either dataset alone.

Claims data and EHRs, while distinct, are both critical in generating real-world evidence. Claims data excels in longitudinal, population-wide analyses but lacks clinical depth. Conversely, EHRs offer detailed clinical narratives and measurements but face standardization and completeness challenges. Their integration allows researchers to leverage the advantages of both, producing more accurate and comprehensive insights to guide healthcare decisions and improve patient outcomes. At NashBio, we primarily utilize EHR data due to its depth and clinical richness, which helps us build high-fidelity study populations for our clients.

For further insights into how innovative technologies are transforming healthcare data collection and analysis, explore this resource on how XR is bridging gaps in medicine, or learn how immersive tools are shaping the future with virtual and augmented reality in medical and sports applications. The integration of artificial intelligence further enhances data analysis capabilities, enabling more precise and personalized healthcare strategies, as discussed in AI advancements in medicine.

By understanding and harnessing the strengths and limitations of each data source, researchers and clinicians can develop more robust, real-world evidence to improve patient care and health outcomes.