Advances in Ex-Post Harmonisation using Graph Representations of Cross-Taxonomy Transformations

Date

October 12, 2023

Format

25min talk + 5min Q&A

Presenter

Cynthia A Huang

Venue

NUMBAT group meeting, Dept. Econometrics and Business Statistics, Monash University

Abstract

Ex-post harmonisation refers to the transformation and merging of related but distinct datasets into a single analysis-ready dataset. The transformation of data between taxonomies, i.e. Cross-Taxonomy transformation, requires both prior domain knowledge to choose or design appropriate transformation mappings, as well as careful data manipulation. The provenance information associated with each task is often recorded separately or hidden in the idiosyncrasies of coding scripts. I illustrate how graph-based representations of the transformation can unify design, implementation, and provenance documentation considerations and challenges. I show that numeric data aggregation, redistribution and recoding operations can be expressed as edge-weighted bi-partite graphs, which I refer to as Crossmaps. I discuss the advantages of the Crossmap approach over matrix operation and imperative for-loop implementations of cross-taxonomy transformations. This includes standard methods for validating both transformation logic and transformed data using graph properties rather than ad-hoc inspection of the data pipeline and direct conversion between the transformation (i.e. the crossmap) into multiple provenance documentation formats (e.g. summary tables and multi-layer graph visualisations). In this talk, I will also discuss implementing the crossmap structure in the R package, xmap, and ongoing work on defining imputation metrics, analogous to missing data counts, for ex-post harmonised data.

 

Speaker Bio

Cynthia Huang is a PhD Candidate in the Department of Econometrics and Business Statistics at Monash University. She completed her undergraduate and honours degrees in Economics at the University of Melbourne. Her research focuses on principles and methods for complex data preparation in the social sciences.

Format: 25min talk + 5min Q&A

Slides