Text and data mining premodern Chinese texts with ctext.org
“Chinese Classic Text Mining and Processing” Series
Dr. Donald Sturgeon
Creator and administrator of the Chinese Text Project (https://ctext.org)
Assistant Professor, Department of Computer Science, Durham University
Venue: Digital Scholarship Lab, G/F, University Library
This hands-on workshop introduces participants to complete text and data mining workflows for materials written in classical Chinese, from digital transcription and annotation of premodern works through to computer-assisted extraction of data from their contents. It consists of four parts:
- Getting started: using the Chinese Text Project (https://ctext.org) crowdsourced editing platform to create and obtain accurate, linked digital transcriptions of premodern Chinese texts.
- Interactive text mining: extracting and visualizing statistical properties and relationships from transcribed texts. Types of analysis to be introduced include pattern matching of words and phrases, identification of text reuse, and identification of patterns of vocabulary usage; visualizations include summarization via interactive networks, charts, and textual heatmaps. Techniques will be demonstrated using classical Chinese materials from ctext.org, however these can all be applied equally to materials from other sources, as well as in other languages.
- Semantic annotation: disambiguating and linking explicitly references in texts to entities (such as names of people, places, and eras), connecting these references to authority databases, extracting knowledge claims about these entities (such as dates of birth, death, or appointment to a particular bureaucratic office) and contributing them to a crowdsourced knowledge base.
- Interactive data mining: extracting and visualizing data from annotated texts and extracted knowledge claims. This includes querying a knowledge base for particular types of information, and summarizing results. This section will make use of the Chinese Text Project’s Linked Open Data knowledge graph, containing data on people, places, dates, and many other historical entities covering a period of over 3000 years. A brief introduction to querying using the industry-standard SPARQL language will be provided; this language is also used by many other systems containing relevant data, in particular a wide variety of institutions in the GLAM [Galleries, Libraries, Archives, and Museums] sphere, as well as Wikidata, and (via Shanghai Library) the China Biographical Database (CBDB).
This workshop does not assume any prior background in digital methods, and requires only a computer with a web browser. Participants are encouraged to create a free account on ctext.org prior to the workshop: https://ctext.org/account.pl.
About the speaker:
Donald Sturgeon (德龍) is Assistant Professor in Computer Science at Durham University in the UK. He holds a doctorate in Philosophy from the University of Hong Kong, and has held postdoctoral fellowships at Hong Kong City University and Harvard University. Since 2005, he has developed and maintained the Chinese Text Project (中國哲學書電子化計劃) https://ctext.org, an online digital library of pre-modern Chinese writing which serves as a platform for exploring new ways of interacting with pre-modern Chinese texts made possible by the digital medium. His research focuses on the application of digital methods to the study of classical Chinese language, literature, and history. Current projects include developing a framework for crowdsourced annotation and knowledge base construction of pre-modern Chinese texts, and the application of machine learning to the dating of pre-modern Chinese writing.