Webscraping in FinEco Research: Risks and Opportunities

Date

April 5, 2024

Format

1 hour

Presenter

Cynthia A Huang

Venue

Invited Workshop, Dept. Banking & Finance, Monash University

 

Workshop Details

Presented at Banking & Finance Department Retreat, RACV Inverlock, Victoria

Format: 1 hour

Extended Abstract

drafted by Patrick Healy

Both Cynthia and myself will focus on a high level discussion of how to use web scraping and text as data in research. The framing is around best practices, potential risks, and what you need to know to manage a research assistant using these methods. Cynthia will present the first hour, focusing on collection of alternative data via web-scraping, and Patrick will present the second hour, focusing on analysis of unstructured text and safe usage of large language models.

Cynthia is focusing on Web Scraping.

  • How Cynthia is using web scraping in her own research.
  • An overview of the most important details and ‘best practices’ of using web scraping in research.
  • What is involved in collecting and preparing web scraped data from different sources and time periods.
  • What missing, biased and corrupted data means in the context of web scraping, the problems this creates for academic research and how to mitigate these risks.
  • Determining what you need to do to ensure that the data you collect is valid and reliable for your research question before making significant investments and how to manage a research assistant.
  • Highlighting some important recent published papers using web scraped data.

Patrick is focusing on working with unstructured text and Large Language models.

  • How Patrick uses these tools in his own research. Such as for text summarisation, data mining and text classification, survey analysis, structured output of unstructured text, using it to help write code.
  • The basics of how these models work and a general overview of the most important details of using text as data in research.
  • Discussion of the difficulty of determining ‘best practices’ for research given how fast the technology is developing. Focusing on what has developed as quite safe practices for lower level research tasks and how you can practically implement that in your own research.
  • What tasks you can’t reliably do yet.
  • Framing what bias means in the context of using large language models. Both in terms of the bias within the model itself as well as the bias that can be introduced by choices made by the researcher.
  • How to safely manage a research assistant using these tools.
  • Highlighting important recent research relevant to Finance researchers.

Slides