Understanding the Data

Healthcare analytics Hagimo's teams have been working with healthcare data companies for over 20 years. Our experience and relationships can be an invaluable resource for your organization when it comes to identifying the types of data you need to enable your business processes and locating the vendors who have it. The healthcare data we work with is generally divided into three types:
  • Health Histories. Data directly associated with the diagnoses and procedures history of an individual, or group of individuals — claims data, EMR / EHR data, pharmacy data, lab data, imaging data, and other clinical data. This is the richest, most tightly regulated, and most valuable class of healthcare data, and it is the focus of the de-identified claims data section below.
  • Industry & Public Data. Government, registry, and institutional resources — CMS, CDC, NIH, WHO, HHS, provider directories, and registries — that bring scale and context to analytics built on patient data.
  • Support & Reference Data. The coding systems and controlled vocabularies (ICD, CPT, HCPCS, NDC, SNOMED, LOINC, and more) required to normalize, link, and interpret everything else.
To build a complete health history for a group of individuals, data from multiple vendors is almost always required, and each vendor carries its own use-case requirements that define what data you can acquire and what you can do with it. Aligning those vendors, their licenses, and their data versions is exactly the kind of challenge Hagimo exists to solve.

De-Identified Claims Data

De-identified medical and pharmacy claims data is the backbone of real-world evidence, health economics, market access, and litigation analytics. Hagimo sources data from effectively every major licensor in this market — and we know how each one prices, packages, restricts, and approves access to it. Becoming an approved data buyer is genuinely difficult: every source requires a negotiated data license agreement, a defensible use-case attestation, and ongoing compliance monitoring. Without the use-case expertise to get through that door, simply trying to license this data directly is, in practice, close to futile. That is where we come in.

Two structural categories that shape every dataset

Open Claims

Originated at the clearinghouse

Captured as the claim is submitted by a provider, before the payer's final adjudication. Very large patient counts (300M+ lives) and near-real-time timing, but fragmented to the providers and pharmacies feeding a given clearinghouse, and lacking confirmed payment.

Closed Claims

Sourced from the payer

Reflect full adjudication and payment across a member's enrolled period. Complete and payment-confirmed, but limited to enrolled lives and typically lag three to six months. The right fit wherever confirmed payer payment matters.

Linkage & Tokenization

Blending the full patient journey

The most complete views blend open and closed claims through privacy-preserving tokenization, with Datavant serving as the dominant linkage layer across the industry. Standardizing on it lowers integration friction across sources.

De-Identification

Safe Harbor or Expert Determination

De-identification is performed under HIPAA Safe Harbor or Expert Determination, with patient identifiers tokenized for longitudinal linkage. The method chosen affects the linkage and re-identification-risk controls you inherit — a key diligence item.

Major De-Identified Claims Data Licensors

Hagimo sources de-identified claims data from each of the vendors below on behalf of our clients — the larger players that license this data to commercial buyers under a data license agreement with use-case attestation. We know how each one packages, restricts, and approves access, and we secure it for you.

Coverage and data types — the companies we source from
Vendor Data Type / Coverage
IQVIA Closed and open claims, Rx (LRx ~4B scripts/yr, ~92% coverage), EMR, and remittance / 835 data. Broadest patient base, including Medicare. Regulatory-grade; the E360 platform spans 1B+ records.
Komodo Health All-payer (open + closed) claims; 330M+ de-identified patient journeys. Strong for rare disease, RWE, and HEOR. MapLab / MapAI platform. Holds a CMS Innovator's License.
Merative (MarketScan) Closed commercial, Medicare Supplemental, and Medicaid claims since the early 1990s; 135M+ unique individuals and 200M+ lives across years. Tokenized via Datavant.
HealthVerity Marketplace aggregator. Among the most extensive open claims from the largest US clearinghouses, plus closed claims from 150+ payers. Lab (Labcorp / Quest), chargemaster, and SDOH. Inovalon preferred partner.
Inovalon Largest closed-claims source in the US; all-payer dataset (Medicare Advantage, 100% Medicare FFS, commercial); 454M+ unique lives and 97B+ medical events. Also operates an EDI clearinghouse.
Datavant The linkage / tokenization layer plus the Switchboard marketplace; 500+ real-world data partners and 60M+ records moving across the network. The de facto connective tissue for blended assets.
Clarivate (DRG) Open and closed claims, blended and patient-mastered across sources into analytic-ready repositories. Strong QA and normalization.
FAIR Health Independent nonprofit; the FH NPIC private-claims database drawn from payors nationwide, plus Medicare. Licenses de-identified aggregated datasets for commercial, policy, and academic research — among the more accessible licensors.
Veradigm NLP-enriched claims plus EMR cohorts; license by therapeutic-area cohort, custom cohort definition, or full network EHR dataset. Refreshed nightly / weekly / monthly. Often cited as regulatory-grade.
Symphony Health (ICON) Integrated Dataverse: medical and Rx claims plus prescriber-level data. A common IQVIA alternative.
Truveta Health-system-sourced EHR plus linked claims, mortality, and SDOH; emerging for regulatory use, with site-level feasibility strengths. Consortium model.

Clearinghouse-Native Open Claims Sources

These entities originate open claims as a byproduct of transaction processing. In practice their data most often reaches buyers through the aggregators above, but several license or contribute directly.

Where open claims originate
Clearinghouse / Source Notes
Optum / Change Healthcare One of the largest open-claims contributors in the US (Change now under Optum / UnitedHealth Group). Optum also licenses its own de-identified clinical and claims assets directly.
Waystar Major clearinghouse contributing open-claims volume to the market; data typically surfaced through aggregators.
Inovalon Operates an EDI clearinghouse and is also the largest direct closed-claims licensor (see above). It sits in both categories.
Availity Large provider-owned clearinghouse with a significant open-claims footprint.
Top US clearinghouses Aggregated marketplace open-claims coverage represents three of the top four US clearinghouses, totaling 300M+ patients — the practical route to aggregated clearinghouse data.

One relationship, the whole market. Vendors in this space do not publish rate cards — pricing, coverage, refresh cadence, and permitted use are all negotiated per engagement and driven by therapeutic scope, history depth, record volume, and whether tokenized linkage is included. We know these vendors, their priorities, and what they will and won't approve, so you don't have to learn it deal by deal.

How Hagimo turns this landscape into your advantage

Broker the deal

We leverage relationships built since 2006 to get your data ask in front of the right people at the right vendors — often where others simply can't.

Build the use case

The specificity and accuracy of your use-case attestation decide whether you are approved. We craft the winning case and go acquire the data on your behalf.

Curate the data

We match sources, manage collision rates and column densities, blend open and closed claims through tokenized linkage, and normalize it into analytic-ready form.

Run the analytics

From mapping to your existing models to delivering finished business intelligence, we can take the data all the way to insight — we can do it all.

Industry & Public Data

Data monitoring In addition to health histories, many other data resources are directly applicable to building analytics and business intelligence across medical data sets. Some of these include:
  • CMS Databases
    • Medicare Provider Utilization and Payment Data
    • CMS Medicare and Medicaid Statistical Supplement
    • Facility Comparison Data
  • CDC Data
    • National Health and Nutrition Examination Survey (NHANES)
    • Behavioral Risk Factor Surveillance System (BRFSS)
    • National Notifiable Diseases Surveillance System (NNDSS)
  • ClinicalTrials.gov, part of the National Library of Medicine.
  • HealthData.gov
  • NIH Data
    • NIH Clinical Center Data
    • National Database for Autism Research (NDAR)
  • World Health Organization
    • Global Health Observatory data repository
    • WHO Global Health Expenditure Database
  • HHS
    • Health Indicators Warehouse
    • National Database of Child Care Licensing Regulations
  • FAIR Health healthcare claims database
  • OptumLabs Data Warehouse
  • American Hospital Directory
  • National Plan & Provider Enumeration System (NPPES)
While rich and relatively accessible, nearly all of these data sets exist 'on their own' and can be challenging to aggregate and integrate into a business model. Curation methods, de-duplication strategies, general accuracy, allowed uses, licensing, and other factors all add complexity to an already complicated landscape. Hagimo can help you navigate these challenges.

Support & Reference Data

All of the disparate processes and systems that comprise the U.S. (and global) health system are necessarily complex, with many different entities curating different portions of the whole. To render actionable business intelligence based on these resources, it's necessary to refer to numerous external data resources to bring context to any analytics. Some of this data includes:
  • International Classification of Diseases (ICD): Used worldwide for morbidity and mortality statistics, insurance billing, and health management. ICD codes represent diagnoses and health conditions.
  • Current Procedural Terminology (CPT): A set of codes, descriptors, and guidelines developed by the American Medical Association. CPT codes describe medical, surgical, and diagnostic services and are used for billing and documentation.
  • Healthcare Common Procedure Coding System (HCPCS): Includes Level I (CPT codes) and Level II codes, which cover healthcare services and products not included in CPT, such as ambulance services, durable medical equipment, and certain drugs and medicines.
  • National Drug Codes (NDC): A unique, three-segment number for each medication listed under the Drug Listing Act of 1972. It identifies the labeler, product, and trade package size.
  • Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT): A comprehensive, multilingual clinical healthcare terminology. It provides a standard way to represent clinical phrases captured by healthcare professionals.
  • Logical Observation Identifiers Names and Codes (LOINC): Used to identify health measurements, observations, and documents. LOINC codes standardize the identification of medical test observations.
  • RxNorm: Provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software.
  • Unified Medical Language System (UMLS): A set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems.
  • Medical Subject Headings (MeSH): A comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. MeSH provides a consistent way to retrieve information that may use different terminology for the same concepts.
  • Diagnostic and Statistical Manual of Mental Disorders (DSM): Used for psychiatric diagnosis, it classifies mental disorders with associated criteria designed to facilitate more reliable diagnoses of these disorders.
  • Global Medical Device Nomenclature (GMDN): A standardized naming system for medical devices used for patient diagnosis, prevention, monitoring, treatment, or alleviation of disease or injury.
  • Orphanet Rare Disease Ontology (ORDO): A structured vocabulary for rare diseases, capturing information including synonyms, definitions, and relationships to other diseases, genes, and proteins.
These data sets are critical for building deep analytics and accurate business intelligence. They generally have licensing and use restrictions, in addition to variable pricing and availability. In many cases, different versions of these data sets must be properly matched with the versions and time periods associated with medical history data in order to render accurate analytics. Hagimo can help you select the right data sets to work with your health history data and other information assets.
Business intelligence Hagimo has been working exclusively with healthcare data for over 20 years. It's a complex landscape, but the power and insights to be gained from it are immense — and the hardest part is rarely the analytics. It's getting approved, getting the data, and getting it into a form you can use. We broker the deal, build the use case, curate the data, and run the analytics. Let us put this wealth of information to work for your company.

Contact Hagimo for a consultation on the data sets that fit your model: inquiries@hagimo.com  |  (844) 247-6973.