Skip to main content
PHI Dataflow Mapping

Mapping Semantic Dependencies in PHI Dataflow: A Practical Graph Approach

Introduction: Why Graph Models for PHI Dataflow?In healthcare IT, Protected Health Information (PHI) rarely stays still. It flows across systems, applications, and cloud services, often through chains of transformations, copies, and accesses. Traditional data lineage approaches—spreadsheets, static diagrams, or relational audit tables—quickly break down when you need to answer nuanced questions like: Which downstream systems depend on this particular patient record for analytics? Or, if we revok

Introduction: Why Graph Models for PHI Dataflow?

In healthcare IT, Protected Health Information (PHI) rarely stays still. It flows across systems, applications, and cloud services, often through chains of transformations, copies, and accesses. Traditional data lineage approaches—spreadsheets, static diagrams, or relational audit tables—quickly break down when you need to answer nuanced questions like: Which downstream systems depend on this particular patient record for analytics? Or, if we revoke a data-sharing agreement, which reports will be affected? This guide argues that a graph-based model of semantic dependencies is the most practical way to answer such questions at scale. We will define what we mean by semantic dependencies (as opposed to syntactic or lineage-only connections), explain why graph structures naturally capture the complexity of PHI flow, and lay out a concrete approach to building and maintaining such a graph. This overview reflects widely shared professional practices as of April 2026; verify critical details against current official guidance where applicable.

Core Pain Points Addressed

Teams managing PHI dataflows often face three persistent challenges: (1) understanding the blast radius of a change, (2) proving compliance with regulations like HIPAA or GDPR through accurate data maps, and (3) troubleshooting data quality issues that originate from upstream sources. Graph modeling addresses all three by creating a machine-readable representation of who uses what data, for what purpose, and under what constraints.

What This Guide Covers

We will start with the fundamental concepts of semantic dependency graphs for PHI, then compare three implementation approaches with a detailed table. Next, we provide a step-by-step guide to building your own graph from common data sources like audit logs and access control lists. Two composite scenarios illustrate practical trade-offs, and we close with an FAQ and a conclusion that summarizes key decision criteria.

Core Concepts: Semantic Dependencies vs. Data Lineage

To design a useful graph, we must first distinguish between data lineage and semantic dependency. Data lineage answers where data came from and how it transformed—think version control for data. Semantic dependency goes further: it captures why a downstream system needs that data and what constraints apply (like masking or retention policies). For example, a lineage trace might show that a patient's diagnosis code moved from an EHR to a data warehouse to a research report. A semantic dependency graph would also record that the research report has a purpose of quality improvement, that it must not contain direct identifiers, and that it expires after 90 days. This richer model enables automated impact analysis and policy enforcement.

Entities and Relationships in PHI Graphs

The core entities in a PHI dependency graph are: data subjects (patients), data assets (tables, files, API endpoints), processes (ETL jobs, queries, analytics), and policies (consent, retention, purpose). Relationships might include 'produces', 'consumes', 'requires consent for', and 'masked by'. Each entity and relationship carries attributes that capture semantic context—for instance, a data asset might have a PHI classification level (direct identifier, quasi-identifier, de-identified) and a retention period.

Why Graph Structures Fit PHI Dependencies

Graphs excel at representing many-to-many relationships, which are common in PHI flows: one patient record can be used by dozens of processes, and one process can consume records from many patients. Relational databases require cumbersome join tables for such patterns, while graph databases allow direct traversal. This makes queries like 'find all reports that consume data from patients in clinical trial X' efficient and intuitive.

Common Mistakes in Semantic Modeling

Teams often over-constrain or under-constrain their graph. Over-constraining means creating too many entity types and relationships, making the graph brittle and hard to maintain. Under-constraining means treating all dependencies as generic 'uses' with no semantic detail, losing the ability to differentiate between, say, a direct copy versus an aggregate statistic. A useful heuristic is to start with three relationship types: 'directly uses', 'derives from', and 'governed by', then expand only when you encounter a concrete need that cannot be expressed.

Comparing Approaches: Property Graph, RDF Triplestore, and Hybrid

When implementing a semantic dependency graph, you need to choose a storage model. We compare three common approaches: property graph databases (e.g., Neo4j), RDF triplestores (e.g., Apache Jena), and a hybrid approach that uses a property graph for query performance with an RDF layer for inference. The following table summarizes key differences:

FeatureProperty GraphRDF TriplestoreHybrid
Query languageCypher (declarative, pattern-matching)SPARQL (W3C standard)Both via connectors
Schema flexibilitySchema-optional, easy to evolveStrict schema via ontologiesDepends on layer; often schema-on-read
Reasoning supportLimited (custom code)Built-in OWL/RL reasoningSPARQL inference over RDF views
Performance on traversalExcellent for deep pathsModerate, optimized for joinsGood with caching
Ease of integration with existing toolsMany connectors (SQL, Spark)RDF libraries, often Java-centricHigher complexity
Best forOperational impact analysisInteroperability and inferenceLarge ecosystems needing both

When to Choose Property Graph

If your primary use case is real-time impact analysis—finding all downstream consumers of a given data asset—a property graph offers the fastest traversal and simplest development. Many teams start here because Cypher is easy to learn and the graph can be prototyped quickly. However, you lose the ability to perform standard inference (like transitive closure over 'part-of' relationships) without custom code.

When to Choose RDF Triplestore

If you need to interoperate with other organizations using standard ontologies (like FHIR or DCAT), or if you require automated reasoning (e.g., infer that a report is research-related if it uses data from a research-specific process), an RDF triplestore is more appropriate. The trade-off is that SPARQL queries can be slower on deep traversals, and the upfront schema design is more involved.

Hybrid: Best of Both Worlds with Added Complexity

A hybrid approach uses a property graph for operational queries and synchronizes a subset of entities to an RDF store for inference. This is popular in large healthcare systems where different teams need different query capabilities. The downside is maintaining two stores and ensuring consistency. Many teams find that starting with a property graph and later adding an RDF view for specific reasoning tasks is a pragmatic path.

Step-by-Step Guide: Building a Semantic Dependency Graph from Audit Logs

This walkthrough assumes you have access to system audit logs that record who accessed what data, when, and from which application. We will transform these logs into a semantic dependency graph in four phases: extraction, entity resolution, relationship modeling, and validation. The process is iterative—you will likely refine entity types as you discover new patterns.

Phase 1: Extract Entities from Audit Logs

Parse audit logs to identify unique entities: data assets (file paths, database tables, API endpoints), processes (application names, job IDs), and actors (users, service accounts). For each entity, collect attributes like PHI classification (from a separate metadata catalog) and purpose code (from a business glossary). Aim for a normalized set of entity types—for example, 'DataAsset' with properties: name, type, phiclass, retentionDays.

Phase 2: Resolve and Deduplicate Entities

Same entity may appear with different names across logs (e.g., 'patient_demographics' vs. 'PATIENT_DEMOGRAPHICS'). Use a simple canonicalization rule: lower-case and strip underscores. For more complex cases, such as a file that is copied under a new name, you need to link the two representations as 'same_as' relationships. This step is critical because unresolved duplicates break dependency chains.

Phase 3: Model Relationships with Semantic Labels

From log entries, create relationships: 'Process A accessed DataAsset B at time T' becomes a 'consumes' edge with a timestamp. Additionally, if a process writes to another data asset, create a 'produces' edge. For policy relationships, you may need to import from a separate consent management system—for example, 'Patient P has consented to Process A for purpose Q' becomes a 'governed_by' edge. Label each edge with a type and, optionally, a purpose code.

Phase 4: Validate and Refine

Run a set of validation queries: for example, list all data assets that have no outgoing 'governed_by' edges (orphan assets). Check that every 'consumes' edge connects to a known data asset. Fix any gaps by revisiting the source logs or adding manual annotations. After validation, the graph can be used for impact analysis: given a data asset, traverse all paths to find dependent processes and downstream assets.

Real-World Scenario 1: Hospital-Wide ETL Failure

Consider a mid-sized hospital system that runs nightly ETL jobs to load patient data from an EHR into a data warehouse for reporting. One night, a schema change in the EHR caused a column to be renamed, breaking the ETL. The data team spent three days tracing all downstream reports to understand the impact. With a semantic dependency graph, they could have run a single query: find all processes that consume the affected table, then find all reports those processes produce, and prioritize fixes.

How the Graph Helps

The graph would contain nodes for the EHR table, the ETL job, the warehouse table, and each report. Edges would be labeled 'consumes' (ETL consumes EHR table), 'produces' (ETL produces warehouse table), and 'consumes' again (report consumes warehouse table). A simple traversal from the EHR table node along 'consumes' and 'produces' edges reveals the full dependency chain in milliseconds. Furthermore, if the graph stores purpose codes, you can see that the reports are for quality reporting (required by regulators) versus ad-hoc analysis (less urgent).

Implementation Lessons

The hospital team later added attributes to the 'consumes' edges to record the specific columns used, enabling column-level impact analysis. They also integrated a consent management system to mark which patients' data could not be used for research, automatically blocking certain ETL paths. This case shows that the graph's value grows as you enrich edges with semantic detail like column names and consent flags.

Real-World Scenario 2: Multi-Cloud Data Sharing Agreement

A health-tech startup shares de-identified patient data with a research partner through a cloud-to-cloud pipeline. The data sharing agreement specifies that only aggregated statistics (no row-level data) may be shared, and that the data must be deleted after 90 days. Without a dependency graph, compliance audits required manually checking each data transfer and retention configuration. A semantic graph automates this check.

Modeling the Agreement

The graph includes nodes for the source database, the aggregation process, the shared dataset, and the partner's systems. Edges include 'produces' (aggregation produces aggregated data), 'governed_by' (aggregated data is governed by the agreement), and 'consumes' (partner consumes aggregated data). The agreement node itself carries attributes: purpose (research), retention (90 days), and allowed operations (read-only, aggregated). Regular graph queries can verify that no 'consumes' edge bypasses the aggregation process, and that all partner 'consumes' edges have a corresponding 'governed_by' edge with a retention policy.

Benefits and Limitations

This approach caught one violation early: a developer had set up a direct copy of the source database to the partner's cloud bucket, bypassing the aggregation step. The graph flagged this because there was no 'governed_by' edge for that direct path. However, the graph cannot automatically detect whether the aggregation process actually produces aggregated data—it only models the intended dataflow. Validation still requires periodic testing of the aggregation logic.

Common Questions and Troubleshooting

Practitioners often ask about scalability, integration with existing metadata catalogs, and handling of PHI classification. Below we address the most frequent concerns.

How large can a PHI dependency graph grow?

In practice, a mid-sized hospital system with 1000 data assets and 500 processes may generate around 10,000 edges (including policy relationships). Property graphs handle millions of nodes and edges on a single server, so scale is rarely a bottleneck. The bigger challenge is keeping the graph up to date as new data sources and processes are added. Automating the ingestion from audit logs and change management systems is essential.

How do I integrate with existing data catalogs?

Most data catalogs (like Apache Atlas or Collibra) already maintain metadata about tables and columns. You can export this metadata as CSV and import it into your graph as nodes. The key is to map catalog classifications (e.g., 'PHI', 'PII') to your graph's PHI classification property. Often, you can write a script that reads the catalog API and upserts nodes and edges on a daily basis.

What about PHI classification changes over time?

PHI classification is not static—a data asset may start as de-identified but later receive direct identifiers through a join. Your graph should store the classification as a property on the node, and you should update it whenever a process changes the asset's content. One approach is to attach a 'trust' property to each 'produces' edge indicating the level of de-identification applied, and let the downstream node's classification be derived from the most upstream source's classification and the transformations applied.

Can I use this approach for real-time compliance monitoring?

Yes, if your graph database supports streaming updates (many do via change data capture). You can set up triggers that evaluate policy rules whenever a new edge is added. For example, if a new 'consumes' edge is created for a data asset classified as 'direct identifier', and the consuming process's purpose is 'research', the graph can fire an alert because the purpose does not match the patient's consent. However, real-time monitoring requires careful tuning to avoid false positives.

Conclusion: Practical Steps to Get Started

Mapping semantic dependencies in PHI dataflow using a graph approach is not a one-time project but an ongoing practice. The most important step is to start small: pick a single data domain (say, patient demographics), build a prototype graph from audit logs covering that domain, and run a few impact analysis queries. This will surface gaps in your entity definitions and relationship modeling that you can fix before expanding to other domains.

Key Takeaways

First, distinguish between data lineage and semantic dependency—the latter includes purpose, policy, and classification context. Second, choose a graph model that balances query performance with reasoning needs; for most teams, a property graph is the best starting point. Third, automate ingestion from existing logs and catalogs to keep the graph current. Fourth, validate your graph with both automated checks and manual spot-checking against known dependency chains. Finally, remember that the graph is a model of your intended dataflows, not a perfect mirror of reality—it will always need periodic review and correction.

Call to Action

If you are responsible for PHI data governance in your organization, consider running a pilot project using the step-by-step guide in this article. Start with a subset of your data assets and a single use case (like impact analysis for schema changes). Measure the time saved in answering dependency questions compared to your current manual process. The results will likely justify expanding the graph to cover more of your PHI landscape.

About the Author

This article was prepared by the editorial team for this publication. We focus on practical explanations and update articles when major practices change.

Last reviewed: April 2026

Share this article:

Comments (0)

No comments yet. Be the first to comment!