As I noted in December, I’ve been able to work this semester with Thomas Posillico ‘20 analyzing Roman imperial coin legends in the invaluable OCRE data set, which you can download in RDF format from nomisma.org.
We’ve made enough progress that I want to present some of our ongoing work in a series of blog posts. To begin with, I want to identify requirements we defined for our project.
- We have converted the RDF data to a citable corpus of texts in the OHCO2 model.
- We are creating a parallel aligned corpus with all abbreviations expanded.
- Using Tabulae, we have built a morphological parser for the expanded corpus.
- All our work is pubicly visible on github
- Beginning from the raw RDF, every step of our work will be reversible. We are currently summarizing stages of work as a series of scripts that start from the RDF source and generate successive expansions of the texts’ abbreviations. The source files for building the morphological parser are included in the project github repository.
I’ll elaborate on each of these points in subsequent posts. To address some questions that are of immediate interest to us, our current focus is on completing the expanded corpus for:
- all OCRE entries from RIC 2.1
- all entries from RIC 1-9 that include a small set of themes we are studying