Recruiting Part 1 [In-Depth Tutorial]
June 8, 2021
Introduction
Welcome to the third installment in our Use Case Series! In this tutorial, we’ll take a look at a recruiting use case. We are very familiar with this use case at PDL, and the processes we'll explore make up a key component of our own internal hiring processes. Specifically, in this tutorial we'll see how we can use the Person Search API to source and rank candidates for a particular job role.
This tutorial will be broken up into 2 parts. In part 1, we'll provide some background on the Person Search API, explore a framework for building your own queries, and take a look at the mechanics of using the API. In part 2, we'll take this knowledge and use it to build and optimize a query for our recruiting use case.
This tutorial will be a little different from our previous ones in that we will deep dive into the Person Search API, and focus specifically on the process for constructing and refining a search query. In the previous tutorials, we simply gave you the final query and explained why it was structured that way, but this tutorial will peel back the curtains to show you the thought process for coming up with these queries in the first place.
Prerequisites
If you'd like to follow along with this tutorial and run the code yourself, then the only thing you will need is a Free PDL API Key and our Recruiting Tutorial Colab Notebook. Please check out our Self-Signup API Quickstart Guide for instructions on how to sign up for your own API key.
This tutorial will also cover some relatively advanced features of our Person Search API, so it will help to read through some of the previous tutorials for context (such as our B2B Audience Generation and Investment Research), though it is not strictly required. As always, we'll include links to relevant content wherever possible.
Let's get started!
The Recruiting Scenario
Today, we'll imagine that we are an up-and-coming tech company in the Virtual/Augmented Reality (VR/AR) space: Virtually Reality, Inc. We recently raised a large round of funding and one of our first goals is to build out our engineering department. Suppose we have the following job role that we would like to hire for:
Software Engineer, Full Stack Web (Remote - US)
Department: Engineering
Location: San Francisco, CA (Remote)
Virtually Reality's Engineering Team is seeking a Web Front-End Engineer to build intuitive, scalable, and easy-to-use web applications to support the development of our Virtually Real (TM) platform used to deploy our massive open-world VR social experiences.
Responsibilities
Design and implement full-stack web features in Java and Javascript.
Qualifications
You have a BS, MS, or PhD in Computer Science or a related technical field, or equivalent experience.
You have 3+ years of experience working as a software engineer on building web applications with Javascript and web frameworks in a professional setting, particularly using React.
You have experience with server-side technologies, such as Java.
Bonus Points if:
You have built web applications involving manipulation of geospatial, customer management, or analytics data.
You have worked on websites that are cross-device and browser compatible.
You have experience in the VR / AR space
In this scenario, we have several openings for this position, so we want to find multiple highly-qualified candidates to fill this role. We’re aiming to run an automated email campaign to ~2,000 candidates.
Now that we have our task defined, let's take a look at how we can use the PDL tools and APIs to build out a pool of candidates.
Setup
To start, let's set up our API key, notebook, and python environment.
Getting a Free PDL API Key
If you haven't already done so, sign up for a Free PDL API Key by filling out the form on the signup page. For a detailed walkthrough of this process, take a look at our Self-Signup API Quickstart Guide. Once you have your API key, you will enter into the cell below in the next steps.
Setting up Colab
Next, save a copy of the Colab notebook by clicking "File > Save a copy in Drive" from the toolbar at the top (you'll need to sign in to a google account). This will create a copy of this notebook for your own use and let you run and customize the code as well.
Setting up our Python environment
For the last piece of setup, plug in our API key to the form below and run the next 3 cells. Doing this will store our API key, import some python libraries, and define some custom helper functions for us to use as well.
Reminder: you can run each cell by hitting "Shift + Enter" or clicking the play icon in the top left corner of the cell when hovering with your mouse.
Part 1 - Warmup Query
Now that we've gotten everything set up, let's write our first query! To get us started, we'll do a simple warmup query to send and receive a simple Person Search API request. While this can sometimes be overkill, it is generally a good first step to ensure that everything is set up correctly before building more complex queries. In this warmup exercise, we will:
Build and send a simple query
Check the total number of available matches
Pull the top 10 match results
For our scenario, we'll just search for software engineers living in the US. While this is probably too simple to be useful in source candidates for our job description, we can use it to illustrate the key concepts of the Person Search API.
Quick Background on the Person Search API
At this point, it would be good to quickly skim through the Person Search API documentation. In particular, looking at the high-level descriptions and some of the examples.
A short summary of the basics is that the Person Search API is PDL's tool for extracting targeted segments of person profiles from our full person profile dataset. The process for using the Person Search API is:
Build a search query
Send the query
Retrieve the matched profile results
Query Building Framework
The first step of using the Person Search API is to build our query. There are many ways of doing this, but for anyone less familiar with this process, here is a helpful framework for building simple queries:
Definition: Specify which person profiles you want to find as precisely as possible
Fields: Look at the to see what profile fields are available to build queries with
Formatting: Check the to see how to format the values for each field
Criteria: Define query criteria for the fields you've selected
Logic: Construct the logic to join the criteria together into a full query
Syntax: Write the query using one of the supported Person Search API syntaxes
Let's work through these steps to see how we can construct our first query for the Person Search API.
Step 1: Definition
Specify which person profiles you want to find as precisely as possible.
The goal of this first step is to come up with a precise definition (using plain english) of the profiles we are interested in finding.
For this warmup query, we know we have the goal of finding software engineers who live in the US. To be even more precise, however, we'll say we are looking for people who have the job title "software engineer" and live in the US.
The goal of this first step is to come up with a precise definition of the profiles we are interested in finding using plain english.
Step 2: Fields (Person Schema)
Look at the person schema documentation to see what profile fields are available to build queries with.
Our next step is to reference the person schema, which is a comprehensive listing of all the fields that are available in a person profile (along with a brief description and example for each field). This document is important because we can only use the fields shown in the Person Schema to construct our queries.
Based on our query definition in step 1, the two fields that seem most relevant from the person schema are:
job_title - "a person's current job title"
location_country - "the current country of the person"
Step 3: Formatting (Person Manual)
Check the person manual documentation to see how to format the values for each field.
The third step in our query building framework is to check the person manual. This document is very similar to the person schema. However, it describes the exact formatting requirements of the value for each field. Just as importantly, it also specifies which fields are canonicalized along with the appropriate link to the list of canonical values.
Canonical Fields: Canonicalized fields are fields for which PDL has specifically listed out all the possible values that a particular field could take in our dataset. Not all fields are canonicalized, but several common ones are. One example is the industry field which can only take on the specific values listed in the canonical industries list.
Checking the person manual for each of our selected fields in step 2, we find:
job_title
The person manual indicates that this field is based on the experience.title.name field, which is a free text field. This means that the only formatting is the standardized format applied to all data in the PDL dataset (i.e. lowercase with no leading/trailing whitespace).
location_country
The person manual indicates that this field has been canonicalized, meaning that PDL has specifically listed out all the possible values for this field. The person manual provides a link to the list of canonical countries here: List of countries
Step 4: Criteria (Field Mapping)
Define query criteria for the fields you've selected.
Now that we know what fields we are interested in (Step 2) and how to format the values for them (Step 3), we can build the criteria for each field. For example, "industry must exactly be biotechnology" or "education.gpa must be greater than 3.7". This step is essentially about relating each field to a specific set of values.
Let's look at each field one at a time:
job_title
We are looking for people with the exact job title "software engineer" so our criteria for this field would be:
"job_title.keyword must exactly match software engineer"
This translates to: a matching person must be currently employed with the job title "software engineer" (based on the field definition for job_title from Step 2).
Note:
We have appended .keyword to the field job_title because we are doing an exact match for a text-based field. See the next section Criteria for Text Fields - Term vs Full-Text Matches below for details on this.
We spelled software engineer with the standardized formatting for free text fields.
location_country
We only want profiles for people living in the US, so we will let our criteria be:
location_country must exactly match united states
This translates to: a matching person must be currently living in the US (based on the field definition for location_country from Step 2).
Criteria for Text Fields - Term vs Full-Text Matches
While most query criteria are straightforward, special consideration must be made when specifying criteria for text-based fields (i.e. fields with any field type related to "String" in the person schema). The reason is that for text-based fields we have the option of specifying criteria as either an exact string (term-based) match or a less-strict (full-text) string match. The definitions of "term-based" and "full-text" matching come from Elasticsearch and are characterized as follows:
Term-Based Matching:
Text is not analyzed, so the character sequence must match exactly (including whitespaces, capitalization, etc...)
For example: "hello world" will not match any of "hello World", "helloworld", "world hello", "hello world!" when using term-based matching
Full-Text Matching:
Text is analyzed, meaning that things like dates/numbers are parsed, individual words are split out and some amount of spelling variance is tolerated
For example: "hello world" will match "Hello World", "oh hello world", "hello world!" using full-text matching
However, full-text matching would not match "hello world" to strings where the terms are out of order such as "world hello" or "hello oh world"
For more details see: Term-Based vs Full-Text Matching
This distinction between Term-based and Full-Text matching is important because text-based fields come in one of two types:
Apart from very specific cases, keyword type fields are used for Term-based (exact) matching and text type fields are used for Full-Text matching.
In our two criteria above, we require exact matches (so term-based queries), which means that our fields must be the keyword type.
We can check the exact type for each field by looking at the Full Field Mapping for the Person Schema:
"job_title": {
"type": "text",
"index": true,
"doc_values": false,
"fields": {
"keyword": {
"type": "keyword",
"doc_values": true,
"ignore_above": 256
}
}
},
In this excerpt from the full field mapping, we see that job_title has a type text (which supports full-text matching). but also that it has a subfield keyword with the type keyword, which should be used for term matching.
So in other words, if we wanted to do an exact term match, our criteria would be: job_title.keyword must exactly match software engineer
But if we wanted a full-text match (which is less-strict), our criteria would be: job_title should be similar to software engineer
As a counter-example, let's look at the location_country field again in the Full Field Mapping for the Person Schema:
"location_country": {
"type": "keyword",
"index": true,
"doc_values": true
},
Here we see that location_country is the keyword type and doesn't contain any subfields. This means that we can only do term-based matches to match exact character sequences. This makes sense because location_country is a canonicalized field meaning all the possible values for this field have been predetermined and listed out.
So to summarize this section:
Criteria for text-based fields can be either Term-based (exact) or Full-Text (flexible) matching
Text fields come in two types: keyword and text, where keyword is suited for Term-based matching and text is suited for Full-Text matching
We can check the to see what the exact types are for fields in the person schema
Step 5: Logic
Construct the logic to join the criteria together into a full query.
Now that we have criteria defined for each of the fields, we can combine them together to fully represent our query. This is quite straightforward for our simple query since we want both criteria to be true.
Using basic boolean logic, our query becomes:
job_title.keyword must exactly match software engineer AND location_country must exactly match united states
Step 6: Syntax
Write the query using one of the supported Person Search API syntaxes.
The last step in building our query is to write it in the correct syntax. The Person Search API supports two syntaxes:
SQL is generally easier to get started with since we can more intuitively translate our logic into the syntax, whereas Elasticsearch syntax generally has a bit of a learning curve. That being said, Elasticsearch is the native query language for the PDL database, so it is both more flexible and better supported. Because of these reasons, we recommend using Elasticsearch syntax whenever possible.
For this warmup, we'll demonstrate both syntaxes (however, we will focus on Elasticsearch in the remaining sections of this tutorial).
SQL Syntax
SQL syntax is based on tabular data (like a csv with multiple sheets) where you query against columns from a table. The Person Search API supports a subset of full SQL syntax (e.g. anything supported by the Elasticsearch SQL Translate API). For the Person Search API, the table is always person and columns are the field names. Here is what our warm-up query looks like in SQL:
query_sql = f"SELECT * FROM person "\
f"WHERE job_title.keyword = 'software engineer' "\
f"AND location_country = 'united states' "
The first line of the query uses the SELECT...FROM clause to specify what table we want to query and what information we want returned. In this case, we are asking for all the fields (given by the wildcard * token) from the table person.
The last two lines of this query use the WHERE clause to impose our criteria, which are joined by the AND keyword. Hopefully, it is clear how easily our logic from Step 5 translates into SQL syntax. The only other point to note in this example is that SQL syntax requires strings to be written using single quotes (e.g. 'software engineer' or 'united states').
Elasticsearch Query Syntax
Elasticsearch query syntax is slightly more complex than SQL. Queries are defined using nested JSON-like objects (i.e. dicts in python). The top level object will always be query and the next level object will almost always be bool. The deepest nested objects will be query criteria (aka what we defined in Step 4 above).
Within a bool object, there can be the following subquery sections (all are optional though):
must: all the query clauses contained in this section must match (e.g. performs AND operation)
must_not: All of these query clauses must not match (e.g. performs AND NOT operation)
should: At least one of the following query clauses must match (e.g. performs OR operation)
filter: All clauses must match, similar to must (this has limited use within our Person Search API)
Each subquery section can contain a list of query criteria (the Person Search API supports up to 100 per section). Each criterion is built using a particular type of query keyword, the most commonly used of which are:
term: used for exact matching a field to a keyword value
terms: used for exact matching a field to a list of keywords (satisfied if any element in list is a match)
match: used for full-text string matching
exists: used for requiring a field to be not null or empty
range: used range of numeric values for a field
for the full list of query types supported by the Person Search API see
Using these rules, here is what our query from Step 5 looks like in Elasticsearch syntax:
query_es = {
"query": {
"bool": {
"must": [
{"term": {"job_title.keyword": "software engineer"}},
{"term": {"location_country": "united states"}}
]
}
}
}
Here we can see the various syntax rules in action:
The query is a collection of nested dicts, with the topmost object being the query object followed by a bool object.
The bool object contains a must section and the other 3 optional sections (should, must_not, filter) are left out
The must section contains a list of query criteria which joins these criteria together using the logical AND operation
We use the term query on the job_title.keyword field to specify a term-based search for the exact character sequence software engineer
Similarly, we use a term query on the location_country field to specify a search for the exact character sequence united states (which is a canonicalized field value)
One way of reading this query is as follows: "job_title must exactly match the term software engineer AND location_country must exactly match the term united states" (which is what we found in Step 5 as well).
Query Sending and Retrieving Results
After building our warmup query using the Query Building Framework, our remaining steps are to send our query to the Person Search API, count the total number of matches, and retrieve the top 10 profile matches.
Sending a Request to the Person Search API
Since the Person Search API is a standard web-based API, it is quite straightforward to send and receive data from the API endpoint.
An API request for the Person Search API will consist of a header and a body. The header contains information about the format of the body, while the body contains the following parameters:
scroll_token: an offset key used for paginating through batches of results (a token value is returned by the Search API in every response).
size: the number of profiles to return in a single response (i.e. the batch size), which
pretty: whether the response object should be formatted to be human-readable or not
titlecase: whether profiles should be formatted using titlecase or lowercase
either query or sql: the actual search query, where the query keyword is used if the query is specified using elasticsearch syntax and sql is used when using SQL syntax
Additionally, the API key can be passed in either the header (using the X-api-key parameter) or the body using the api_key parameter).
The Person Search API documentation contains numerous examples for how to send queries to the Person Search API. In the Colab notebook, we defined some helper functions based on these examples to streamline this process (see the Helper Function Definitions [code] section). The main helper function we will use in this tutorial for sending requests is send_person_search_request() which is defined as follows:
def send_person_search_request(query, use_sql, size=1, scroll_token=None):
PDL_URL = "https://api.peopledatalabs.com/v5/person/search"
REQUEST_HEADER = {
'Content-Type': "application/json",
'X-api-key': API_KEY
}
REQUEST_PARAMS = {
"scroll_token": scroll_token,
"size": size,
"pretty": True
}
if not use_sql:
REQUEST_PARAMS['query'] = json.dumps(query)
elif use_sql:
REQUEST_PARAMS['sql'] = query
def send_search_request():
return requests.get(PDL_URL, params=REQUEST_PARAMS, headers=REQUEST_HEADER)
response = send_request_handling_rate_limiting(send_search_request)
success = check_errors(response)
return success, response.json()
Here we can see that this function simply constructs a header (containing the API key) and the body, and sends it to the correct endpoint (i.e. PDL_URL). It also uses another couple helper functions, one to ensure requests are sent within the rate limits for our API key (send_request_handling_rate_limiting()), and another which does some error checking on the response sent back by the Person Search API endpoint (check_errors()).
Retrieving Search API Results
If there are no errors in the search request and at least one profile match is successfully found, the response sent back by the Person Search API will take the form:
{
"status": 200,
"data": [
{
"id": "qEnOZ5Oh0poWnQ1luFBfVw_0000",
"full_name": "sean thorne",
...
},
...
],
"scroll_token": "1117$12.176522"
"total": 99
}
As seen, the successful response object contains the following fields:
status: this will be 200 for a successful response
data: this is a list of profile matches (the number of profiles is determined by the size parameter in the API request)
scroll_token: a token returned by the Search API used for scrolling through the full set of match results (this value can be passed into subsequent API requests to continue retrieving subsequent results)
total: this indicates the total number of matches found in the PDL database for the query in the API request
In order to retrieve all the results for a query, we have to retrieve the results batch-by-batch. This is done by sending sequential API requests with the same query in the body, but updating the scroll_token field passed with each subsequent request. The size parameter determines the number of matches returned with each response (i.e. the batch size). This is a relatively straightforward process, which is illustrated in the helper function retrieve_search_api_matches():
def retrieve_search_api_matches(send_request_func, num_desired_matches):
# Intializations
all_successful = False
total_available_matches = 0
matches = []
max_batch_size = 100
batch_size = min(max_batch_size, num_desired_matches)
success, response = send_request_func(size=batch_size, scroll_token=None)
all_successful = success
if not success:
print(f"Error from Search API: \n{response}\n\nCould not find matches")
return matches
matches += response['data']
initial_num_matches = len(matches)
total_available_matches = response['total']
scroll_token = response['scroll_token']
# Check total number of available matches
print(f"Total Number of Matches: {total_available_matches}")
if (total_available_matches < num_desired_matches):
print(f"Note: Person Search API found "\
f"[{total_available_matches}] "\
f"total matches which is less than the desired number matches "\
f"[{num_desired_matches}]")
num_desired_matches = total_available_matches
# Retrieve all matches batch-by-batch
while scroll_token and len(matches) < num_desired_matches:
batch_size = min(max_batch_size, num_desired_matches-len(matches))
success, response = send_request_func(size=batch_size,
scroll_token=scroll_token)
all_successful = all_successful and success
matches += response['data']
scroll_token = response['scroll_token']
return all_successful, total_available_matches, matches
The key point to note in this function is the while loop at the end, which repeatedly sends the same API request while updating the scroll_token using the value returned from each API response.
For our warmup exercise, we are only concerned with retrieving the top 10 results, which can be done in a single batch. As a result, we'll just use the send_person_search_request() function directly to retrieve a single batch of results:
# Send search request
# NOTE: make sure you have run the cells in Step 6 above!
use_sql = False # Use this to switch between using sql and es syntax
query = query_es if not use_sql else query_sql # select the desired query syntax
size = 10 # number of results to pull down (up to 100)
success, results = send_person_search_request(query, use_sql, size)
print(f"Person Search Request Successful: {success}")
Person Search Request Successful: True
We can see the total number of matches found in the PDL database for our simple warmup query as well:
# Count total number of results:
print(f"Total number of matches in PDL database: {results['total']}")
Total number of matches in PDL database: 290346
Because we specified size = 10 in our API request, the response contains the top 10 profile matches for our result. Let's print out some information to see how these profiles relate to our search query:
# Preview results:
# Number of results pulled down:
print(f"Number of records retrieved: {len(results['data'])}")
# Print the job title and locations of matches:
for idx, profile in enumerate(results['data']):
print(f"{idx+1}) "
f"Job Title: {profile['job_title']} - "
f"Location: {profile['location_country']}")
Number of records retrieved: 10
1) Job Title: software engineer - Location: united states
2) Job Title: software engineer - Location: united states
3) Job Title: software engineer - Location: united states
4) Job Title: software engineer - Location: united states
5) Job Title: software engineer - Location: united states
6) Job Title: software engineer - Location: united states
7) Job Title: software engineer - Location: united states
8) Job Title: software engineer - Location: united states
9) Job Title: software engineer - Location: united states
10) Job Title: software engineer - Location: united states
Our simple warmup query was to find profiles of people currently employed with the title software engineer living in the united states. As we can see above, our results correctly match our query.
The last thing we'll do for our warmup query is save the profile matches to a csv, which is one way we can export these results to run an automated email campaign for recruiting.
# Save profiles to csv (and download):
# Note: You may need to enable browser permissions to download files
filename = 'candidate_profiles_warmup.csv'
save_profiles_to_csv(results['data'], filename)
files.download(filename)
Wrote 10 lines to: 'candidate_profiles_warmup.csv'
With that, we have successfully built and run a query using the Person Search API, and have the top 10 results exported to csv!
Recap of Part 1 - Warmup Query
Part 1 of this recruiting tutorial was focused on building up the necessary background to understand the process of building and optimizing queries for the Person Search API. We first familiarized ourselves with the Person Search API at a high level and identified the 3 steps for using the API (building a query, sending the query and retrieving the results). Next, we looked at a useful framework we could use for building out search queries and explored the necessary concepts and documentation along the way. Finally, we took our query and looked at the mechanics of sending and receiving data using the Person Search API. We successfully used this information to complete a basic warmup exercise where we found 10 software engineers currently living in the US and exported their profiles to csv.
We are now ready to take a look specifically at how we can use the Person Search API to source candidates for our target job description as part of our recruiting scenario. At the end, you will have truly expert-level knowledge on how to use the Person Search API, so please join us for Part 2 of the Recruiting Tutorial!
Like what you read? Scroll down and subscribe to our newsletter to receive monthly updates with our latest content.