Macros for managing messy data: Handling duplicate study participants, and making fuzzy matches across multiple data sets

Publication Type:

Conference Paper


2011 SAS Global Forum, Las Vegas, NV (2011)


<p>The CSULB, Center for Behavioral Research &amp; Services (CBRS) conducts behavioral research among hard-to-reach populations in Southern California. During data collection, CBRS staff members attempt to identify each research participant and link the participant to a unified participant record which details all prior contact with the agency. However, incomplete or inconsistent personal information can cause staff members to establish a new ID rather than amend an existing record. These errors are often identified after substantial research data has been collected. CBRS uses a SAS® macro and hash object to manage participants with multiple IDs without revising raw data sets or requiring analysts to know anything about the IDs established for an individual. Changes can be rolled-back in the rare case of a misidentified duplicate (e.g., twins or a Sr./Jr. relationship misidentified as the same individual). CBRS researchers also use a macro to accomplish fuzzy matching of one data set to any number of additional data sets by ID and date.</p>