Plant Health Bulletin Corpus
This repository contains the resources and scripts used to collect, transform and annotate the French Plant Health Bulletins (PHB). Its content is organized as follows:
- corpus_test/ : provides metadata on the various corpora available.
- src/ : provides the code used to collect, transform and annotate the bulletins. This code is organized in 4 subfolders:
- src/collecte/ : the worflow for collecting the bulletins from the relevant webpages (hosted by the French regional directorates responsible for food, agriculture and forests).
- src/alvsiNLP/ : contains the plans ran by alvisnlp to produce automatic annotations of domain specific mentions (crop usages, cultivars, development stages, harmful organisms, diseases and their vectors) and general domain mentions (dates and localities).
- src/xR2RML/ : the template and configuration to be used in conjunction with xR2RML to transform the data into the appropriate RDF.
- src/workflow/ : the worflow for updating the knowledge graph with the annotations and provenance data.
- sample/ : provides sample queries that can be executed to retrieve information stored in the knowledge graph pertaining to PHBs.
The transformation of the PDF bulletins to HTML is done using pdf2blocs.
A Unified Approach to Publish Semantic Annotations of Agricultural Documents as Knowledge Graphs
The state of the git repository corresponding to what is described in this paper can be found under the SAAD branch (Semantic Annotations of Agricultural Documents) here