Enriching large-scale Twitter data: a case study

Ms Kim Doyle1, Dr Daniel Russo-Batterham2

1The University Of Melbourne, Melbourne, Australia, 2The University Of Melbourne, Melbourne, Australia


With around 140 million daily users and 500 million tweets per day, Twitter has amassed a hugely significant cultural archive since its creation in 2006. For researchers, this wealth of social data can be used to study many important topics, from predicting elections to tracking the spread of illness or even making mental health interventions. In recent years, a rich interdisciplinary literature has flourished by applying computational methods to this resource. Yet as the archive expands, so do the technical challenges of effectively navigating and manipulating it.

This presentation relates the experiences of two Research Data Specialists collaborating with researchers in psychological sciences on large-scale capture and analysis of Twitter data.  The collaboration focuses on creating a processing pipeline that can curate and geolocate tweets from a public dataset of around 6B tweets collected over eight years. Enriched with location information, the processed data will be housed in a new database that supports advanced queries. By facilitating access to the data, the database will permit natural language processing on a scale that has previously been difficult. Demographic and socio-economic data will be matched with the Twitter data to open original avenues of research.


From 2011 to 2013, Dr Daniel Russo-Batterham worked as a researcher at the Centre d’Études Supérieures de la Renaissance in Tours, France, while completing a Master of Music. Since graduating from his PhD in 2018, Daniel has worked on Digital Humanities projects across Australia and abroad. He has a background in python, data wrangling, relational database design, web scraping, quantitative methods, natural language processing, and a broad range of approaches to visualisation. He is currently working in the Melbourne Data Analytics Platform.

Kim Doyle is a Research Data Specialist at the Melbourne Data Analytics Platform (MDAP) and a PhD in Media and Communications at the University of Melbourne. Previously, she taught natural language processing and data mining to researchers at the University of Melbourne’s Research Platform Services at the for a number of years. Her research interests include political communication, social media and computational social science.


AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.