How People Data Labs Sources Data
March 11, 2021
PDL offers unique access to over 2.5 billion B2B records, providing our clients with the freshest and most accurate data available. Our mission is to empower engineers, product developers, and data scientists with the data they need to power innovation and provide better and more comprehensive results for their end-users, and transparency is a huge part of that mission. In this multi-part series, we’ll unpack how we source, build, standardize, index, and QA our data.
People Data Labs draws from thousands of individual data sources. These sources may include data from a single company, website, or a broad, public crawl of the open web. We cast a wide net to collect as much data as possible, but all of our sources fit into two, distinct categories.
The People Data Labs dataset includes data from ~100 public sources. These are data sources available to anyone in the world with an internet connection. We crawl the web to extract this information in much the same way that commercial search engines like Google and Bing do. Public data sources provide us with information on companies, schools, and locations, as well as individuals' work history, educational background, and more. Instead of indexing this data like a search engine, we aggregate and resolve this data to our various entities: person, school, company, etc...
Data Union Sources
This is data shared with us by our customers or through strategic partnerships. Each of these individual sources has different data in it. For example, we may get business card data from one partner, education history from another, contact information from a third, and so on. We then merge data from several different sources and combine them into a single, unified dataset. This allows us to anonymize the data we’ve sourced from our Data Union and account for the specifics of individual customer data using source-specific tactics before it’s subjected to the more generalized merging process with our production data set.
The end result of this process is an increase in high-quality linkages within our production data set, allowing our customers to develop better insights. We currently utilize roughly thousands of these aggregated linkage sources created from PDL customer data. Data Union sources supply the majority of emails, phone numbers, birth dates, street addresses, and other personally identifiable information in our dataset. You can learn more about our Data Union below.
What is our Data Union?
In simplest terms, the PDL Data Union is all the data shared with us by our customers who choose to opt in. Many of our customers share their data with us, allowing us to enhance and expand the data and unique linkages we offer to all our customers. We have no real-time reliance on any data union sources, so we aren’t dependent on any single data union source to keep specific fields in our dataset up to date. This means that the amount of data available to the PDL Data Union can only grow as new customers join, increasing exponentially the value customers can derive from our dataset.
A few key stats from our Data Union:
We ingest 45 million new records from our data union every month.
We've ingested 1k or more records from over 1.3k data union customers.
The size and structure of data union members varies, but in general, they are companies with fewer than 1,000 employees and occupy one of the following key verticals:
Sales & Marketing Tech
Real Estate Tech
Do we pay for data?
Currently, People Data Labs does not purely purchase data from any external sources or vendors -- all our partnerships are mutualistic. We make two key guarantees:
Reputable Sources: We're receiving the data from a reputable vendor who has sourced the data compliantly
Mutualistic and Perpetual: We provide mutual value to our data union partners so that our relationships continue. We own the data from our data union outright. This allows us to guarantee that our data does not degrade.
How do we handle privacy?
Data privacy is an increasingly hot-button issue across the digital landscape, and PDL has always put customer privacy at the forefront of our thinking about how we handle, manage, and use both open-source data and data acquired through our data union. We also expect all of our data union customers to be in full compliance with current data privacy laws and regulations, including the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), ensuring that any customers who utilize our production dataset can be confident that the data they access is likewise compliant.
Per our, all data union members must certify that:
They've provided any necessary notices to the individuals whose data they're sharing with PDL
They've obtained any required consents concerning the collection, use, processing, transfer, and disclosure from the individuals whose data they're sharing with PDL.
PDL regularly conducts comprehensive source audits to ensure that data sources are compliant in addition to their existing certifications. We review each source's compliance on a continual basis. We also honor opt-outs and information requests globally, and these requests are built into our engineering workflows. We work with many public enterprise privacy teams, and are constantly updating our policies to adhere to the latest data privacy policies.
Our goal is to arm our clients with the fresh, reliable, high-quality data they need to innovate, build, and deliver better results for their end-users. To learn more about how we think about data, you can continue to part 2 of this series which provides more detail about our data build process, visit ouror to set up a demo.
Like what you read? Scroll down and subscribe to our newsletter to receive monthly updates with our latest content.