Sign In Contact Us
Data Flow Mapping Data Minimization Shift-Left Privacy Third-Party Data Flow Tracking

Data Processing Agreement (DPA) Enforcement: Privacy by Design for Third-Party Integrations

Third-party app integrations have become foundational to modern software development. From streamlining workflows to accelerating feature deployment, integrations help organizations build more robust, feature-rich applications while focusing on their core value propositions.

The Benefits of Third-Party App Integrations

Common use cases

Advantages

The Dangers of Third-Party App Integrations

While third-party services unlock massive benefits, they also introduce risks, especially when privacy is not embedded by design.

SDKs full of security risks

Most integrations rely on SDKs that introduce:

How Data Processing Agreement Violations Happen

Despite the benefits, third-party integrations often become privacy minefields. Developers, and increasingly AI code assistants, can unintentionally introduce risks by oversharing sensitive data with third-party services, bypassing established data processing agreements (DPAs).

Diagram showing why tracking data flows to third-party integrations is crucial for ensuring that Data Processing Agreements are upheld
The gap: tracking data flows to third-party integrations is crucial for ensuring that Data Processing Agreements (DPAs) are upheld.

Rigorous vendor onboarding, but no continuous monitoring

Assume a company has developed a customer-facing application that integrates with Datadog for continuous monitoring, Google Analytics for tracking user sessions, Salesforce for updating customer data, and OpenAI to enable personalization. The appendix of most DPAs documents the categories of data subjects, categories of personal information, sensitive data processed, and the nature and purpose of processing.

In this scenario, the agreed-upon categories of personal information allowed for each vendor are as follows:

PlatformCategories of Personal Information Allowed in the DPA
Datadoghostname, ipAddress, deviceType
OpenAIrole, industry, companySize, age, gender
Google AnalyticsipAddress, deviceType, browserUsed
SalesforcefirstName, lastName, email, phoneNumber, role, companyName, industry

Security, privacy, and third-party risk management teams often spend significant time during vendor onboarding ensuring that vendors meet compliance requirements and agree to DPA terms. Unfortunately, many companies stop there. Once a vendor is onboarded, few controls are put in place to continuously monitor adherence to the agreed-upon data flows.

This is a critical gap. It is not just the vendor's responsibility to uphold the DPA. Your own developers play a major role. If an engineer mistakenly sends unauthorized fields (such as email or SSN) to a vendor like Datadog or OpenAI, the breach originates from your side, even if the vendor's own systems are secure and compliant.

Once that sensitive data reaches a third-party system, you are at the mercy of their internal data handling practices. In many cases, the data becomes deeply embedded within their ecosystem, replicated across logs, caches, dashboards, backups, and internal analytics tools. Deleting or correcting that data after the fact can be operationally complex and legally uncertain.

Strong vendor onboarding is not enough. Without continuous controls that keep data sharing in code aligned with what was contractually agreed, data overexposure is not just theoretical. It is inevitable.

Real examples: accidental sharing of entire user objects

As developers build integrations with analytics, observability, or CRM tools, it is common to pass contextual data to these platforms for better insights. Without clear guardrails, developers or AI coding assistants may accidentally transmit full user objects, exposing PII such as names, emails, phone numbers, and even Social Security Numbers. This often happens when objects are spread into function parameters or logged without filtering.

All examples below use this shared User object:

interface User {
  id: string;
  email: string;
  ssn: string;
  firstName: string;
  lastName: string;
  phoneNumber: string;
  role: string;
  companyName: string;
  industry: string;
}

Example 1: Datadog

function handleLogin(user: User) {
  // BAD: full user object, violates the DPA
  datadogLogger.info("User logged in", { user });

  // GOOD: only metadata permitted by the DPA
  const { deviceType, ipAddress, hostname } = getSystemInfo();
  datadogLogger.info("User logged in", { deviceType, ipAddress, hostname });
}

Why it is risky: Datadog's DPA allows metadata like hostname, ipAddress, and deviceType. Logging the full user object violates this agreement and may expose sensitive data into Datadog logs, which are hard to scrub post-ingestion.

Example 2: Google Analytics

function trackUserSignup(user: User) {
  // BAD: object spread leaks every field
  gtag("event", "user_signup", { ...user });

  // GOOD: only permitted fields
  const { deviceType, browserUsed, ipAddress } = getDeviceInfo();
  gtag("event", "user_signup", { deviceType, browserUsed, ipAddress });
}

Why it is risky: Google Analytics is not contractually permitted to receive PII like names, emails, or SSNs. Sending the full user object, especially via object spread, can leak sensitive information that is stored and processed against DPA terms.

Example 3: Salesforce

function syncUserToSalesforce(user: User) {
  // BAD: spreads the SSN and internal IDs into Salesforce
  sendToSalesforce("lead_create", { ...user });

  // GOOD: explicit, DPA-permitted fields only
  sendToSalesforce("lead_create", {
    firstName: user.firstName, lastName: user.lastName,
    email: user.email, phoneNumber: user.phoneNumber,
    companyName: user.companyName, role: user.role, industry: user.industry });
}

Why it is risky: although Salesforce may allow many fields under the DPA (name, contact info, and so on), PII like SSNs and user IDs are typically out of scope. Spreading the full object risks violating these agreements, especially if data visibility in Salesforce is not tightly controlled.

Example 4: OpenAI, tainted variables in prompts

Variables that begin clean may become tainted with PII. Developers and AI assistants often fail to catch this, especially when constructing dynamic prompts for AI models.

let promptContext = {
  audience: "Customer",
  notes: "Welcome to our platform.",
};

// BAD: the variable becomes tainted with PII
promptContext.audience = `${user.firstName} ${user.lastName}`;
promptContext.notes = `Welcome ${user.email} to the ${user.industry} platform.`;

// GOOD: use only permitted metadata in the prompt
const prompt = `Generate a welcome message for a ${user.role} in the ${user.industry} sector.`;

DPA breach summary

ScenarioPlatformBreach (Not Allowed by DPA)Allowed by DPA
Full user object to DatadogDatadogemail, ssn, firstName, lastName, phoneNumber, role, companyName, industryhostname, ipAddress, deviceType
Tainted variables in OpenAI promptOpenAIemail, firstName, lastNamerole, industry, companySize, age, gender
Full user object to Google AnalyticsGoogle Analyticsemail, ssn, firstName, lastName, phoneNumber, role, companyName, industryipAddress, deviceType, browserUsed
Full user object to SalesforceSalesforcessnfirstName, lastName, email, phoneNumber, role, companyName, industry

Policy violations by framework

When sensitive data is shared with third-party integrations beyond the scope of an established DPA, it constitutes a clear violation of applicable regulations, including:

Best practices

Methods of Tracking Third-Party Data Flows and Enforcing Data Minimization

MethodLayerProsCons
Static Code AnalysisCodeEarly detection pre-deployment, scales across repos, enforces privacy by design, works for developer and AI-generated codeMay miss runtime-generated data
Manual Code ReviewsCodeHuman judgment, can catch complex context-based issuesTime-consuming, not scalable, prone to human error
API Gateway MonitoringAPICentralized control over API traffic, can log, redact, or blockRequires all traffic to pass through the gateway, misses traffic that bypasses it such as SDKs and internal services
Network ProxyNetworkNo need to modify app codeHard to scale across microservices, lacks understanding of data context or meaning
Data Loss Prevention (DLP)Network / StorageDetects sensitive data in transit or at rest, integrates with the broader security stackReactive rather than preventative, lacks visibility into app-layer data flows and third-party SDKs

While API and network-level tools provide valuable safeguards, they are fundamentally reactive. These solutions sanitize data in transit but do not prevent the collection of unnecessary data, falling short of enforcing true data minimization, a cornerstone of privacy by design.

DIY PII detection in code scanning does not scale

Hardcoded RegEx rules are brittle, difficult to maintain, and often limited to basic log detection. Most DIY efforts stall before scaling meaningfully, especially when it comes to tracking data flows through third-party SDKs. These efforts lack context around data sensitivity, awareness of sanitization or transformations, and visibility into where data ends up. Complexity grows exponentially when trying to account for every RegEx variation per sensitive data type, variations in field names and object nesting, and all SDK invocations scattered across large codebases. As codebases evolve, accurate coverage becomes nearly impossible to maintain, making DIY approaches unsustainable for privacy and compliance at scale.

HoundDog.ai: The Privacy by Design Code Scanner Purpose-Built for PII Detection and Data Mapping

HoundDog.ai empowers security, privacy, and engineering teams to catch sensitive data leaks and privacy risks before code is deployed. Built from the ground up to enforce privacy by design, the static code scanner enforces data minimization and maps sensitive data flows across all storage mediums and third-party integrations, all directly within your source code.

HoundDog.ai diagram showing proactive data flow mapping across all third-party integrations, catching DPA violations before code is pushed to production
Proactive by design: data flow mapping across all third-party integrations catches DPA violations before code is pushed to production.

Blazing fast, built in Rust for scale

The scanner is written entirely in Rust, making it extremely fast and lightweight. It can scan millions of lines of code in under a minute, with virtually no impact on developer velocity. It is built for large monolithic or microservices codebases, high-frequency CI/CD pipelines, and multi-language repositories.

Unmatched detection accuracy across the full data lifecycle

HoundDog.ai goes far beyond regular expressions, delivering precise, context-aware detection of sensitive data elements (PII, PHI, PIFI, CHD, and other regulated identifiers), risky data sinks (including hundreds of third-party tools and SDKs across observability, analytics, sales, marketing, and AI), and sanitization gaps, flagging data only when it is unsanitized to reduce noise and surface real risks.

Endlessly flexible and built for compliance

Tailor detection logic to your unique tech stack and regulatory requirements: define custom data element types based on internal policies or legal obligations, apply granular allowlists to enforce which data elements are permitted per data sink or third-party integration, and add custom sanitization functions to meet your internal security standards. Whether you are aligning with GDPR, HIPAA, PCI DSS, or internal policies, HoundDog.ai adapts to your needs.

Enterprise ready, developer first, CI integrated

HoundDog.ai fits directly into existing engineering workflows: connect to GitHub, GitLab, or Bitbucket to scan pull requests, block risky changes, and leave actionable code comments. Use Managed Scans to offload scan execution for continuous, hands-off coverage across all repositories with compliance-grade reporting. Or inject scans into pipelines via GitHub Actions, GitLab CI, Jenkins, and more.

Privacy by design for AI applications

AI applications introduce a unique set of risks, and HoundDog.ai is purpose-built to address them. The scanner detects sensitive data leaks in AI-specific mediums including prompt logs, embedding stores, and temporary files, and flags unsanitized inputs passed into LLMs. This ensures AI features comply with your privacy standards before anything reaches production.

Map every third-party data flow in your code

Try the free Privacy Code Scanner and see exactly which data elements reach each SDK, API, and AI integration, before your DPAs are violated.