Advances in Ex-Post Harmonisation using Graph Representations of Cross-Taxonomy Transformations

Date

October 10, 2023

Format

15min talk + 5min Q&A (virtual)

Presenter

Cynthia A Huang

Venue

Submitted Talk, Monash EBS PhD Contest, IDWSDS

Abstract

Ex-post harmonisation refers to the transformation and merging of related but distinct datasets into a single analysis-ready dataset. The transformation of categorised numeric data between taxonomies, i.e. Cross-Taxonomy transformation, requires both prior domain knowledge to choose or design appropriate transformation mappings, as well as careful data manipulation. The provenance information associated with each task is often recorded separately or hidden in the idiosyncrasies of coding scripts. I illustrate how graph-based representations of the transformation can unify design, implementation, and provenance documentation considerations and challenges. I show that aggregation, redistribution and recoding operations can be expressed as edge-weighted bi-partite graphs, which I refer to as Crossmaps. I discuss the advantages of the Crossmap approach over matrix operation and imperative for-loop implementations of cross-taxonomy transformations. Crossmaps enable the validation of both transformation logic and transformed data using graph properties rather than ad-hoc inspection of the data pipeline. They can also be converted from transformation object (i.e. the graph edge list) into multiple provenance documentation formats (e.g. summary tables and multi-layer graph visualisations). Finally, I discuss how the crossmap format enables novel exploration of the statistical properties of ex-post harmonisation.

 

About

Presented for and awarded first place at

2023 Monash EBS Contest, Organised jointly with the Causus for Women in Statistics for the occasion of the International Day of Women in Statistics and Data Science.

Speaker Bio

Cynthia Huang is a PhD Candidate in the Department of Econometrics and Business Statistics at Monash University. She completed her undergraduate and honours degrees in Economics at the University of Melbourne. Her research focuses on principles and methods for complex data preparation in the social sciences.

Format: 15min talk + 5min Q&A (virtual)

Slides