Corpona – The Pythonic Way of Processing Corpora

Every NLP researcher has to work with different XML or JSON encoded files. This often involves writing code that serves a very specific purpose. Corpona is meant to streamline any workflow that involves XML and JSON based corpora, by offering easy and reusable functionalities. The current functionalities relate to easy parsing and access to XML files, easy access to sub-items in a nested JSON structure and visualization of a complex data structure. Corpona is fully open-source and it is available on GitHub and Zenodo.


Introduction
In the era of machine learning, corpora have become one of the most important resources for NLP research. However, there is no one standard for annotating data or representing existing linguistic data, in fact there are several of them: Giella's XMLs [5], TEI (see [3]), CoNLL-U [17], ELAN XML [16], EXMARaLDA XML [12] and NewsML XML [1] to name a few. There is such a variety of different ways of representing data that an NLP researcher is bound to spend a whole lot of time in converting them from one format to another. For this purpose, we have implemented Corpona 1 . While several other corpus processing tools exist [13,14,6], we aim for simplicity and reusability with Corpona.
Corpona is an open source Python library licensed under the Apache-2.0 License. Each release version is uploaded and permanently archived automatically to Zenodo. The library is easy to install through pip (pip install corpona).
While working with XML data in various projects such as Ve'rdd [2] and neologism retrieval [11], we have found ourselves writing similar parsing code for a variety of different tasks. This called for a more centralized approach where code reuse can be maximized. This need gave the initial idea for Corpona, a one-stop tool for XML and JSON dataset processing. We needed a fast way of getting things done with as little new code as possible. As some of the XML structures we have worked with, such as Giella XML, are in use in multiple tools such as click-in-text dictionaries [7] and online learning tools [15], the features implemented in Corpona are potentially useful for a wider audience.

The Current Architecture
At the current stage, Corpona consists of three main modules: xml, summary and explorer. The main functionalities of these individual modules are easily accessible from the main corpona module. The modular structure is seen in Figure 1. This structure has been crafted keeping in mind the future development directions of the library. Only the xml module has classes. The class diagram is shown in Figure 2.

Functionalities
In this section, we will describe the main features of Corpona. We go through every module and show usage examples to illustrate their use with real world data. Fig. 3. An example of the find child method in the explorer module Figure 3 shows the main functionality of the explorer module. The find child method is useful for getting the data recorded in a sub-dictionary. The method takes in a dictionary and a path consisting of keys. The method will automatically go into lists and loop through their sub-elements finding dictionaries with keys indicated in the path. We find this feature useful as usually we just want certain data out from a JSON or an XML based on dictionary keys regardless of whether they were inside of a list or not. The method also takes an optional default value parameter. This value will be returned in case no item matched the query. As the method automatically loops through lists, it is possible that there are multiple dictionaries that meet the criteria set in the path, for this reason the method always outputs a list. Figure 4 shows how to use the summarize method from the summary module. The method takes in a complex dictionary structure loaded from a JSON file or parsed from an XML using Corpona. The method produces a quick overview of the structure of the data as seen in the example output. It is a fast way of seeing what keys are in the dictionary and what the data types are that are stored under each key. This is very useful for better understanding the structure of a new dataset one starts to work with. Figure 5 shows the XML parser in action. The parser takes in a path to a file, and parses it into a manageable Item structure. Corpona makes it possible to loop through the different parts of the XML in an easy fashion. The sub-elements can be looped by getting an item by tag name from the Item class. The XML attributes can be accessed directly e.g. d.href would return the href attribute of the Item object d.  Conversion between different formats is on the long-term road-map of this library. Some existing approaches such as converting Giella XMLs to TEI format [8] could be incorporated in the future. This not only makes the data more accessible in Python but also facilitates its reuse on different platforms that operate on different XML structures.
We are also thinking of different ways of visualizing data. The biggest challenge when you are given a dataset, is to know exactly what it has, what the structure is, what elements can contain lists, strings, numbers, null types etc. Having a simple way of visualizing the structure helps in understanding how to approach the data. The current implementation in the summary module is a good start but it might still produce an overly complex output for large and inconsistent dictionaries.
Despite the growing number of Universal Dependencies annotated corpora for endangered Uralic languages [10,9], we do not currently have any plans to in-corporate a CoNNL-U parser in Corpona as this feature is available in our other library called UralicNLP [4]. However, UralicNLP does provide rudimentary access to Giella XML dictionaries. In the future, Corpona should be included in UralicNLP as a dependency for better parsing these files.