Referent chain annotation
Syntax and information structure
in the first 16th century Portuguese
narrative about Brazil
Outline on January 6th, 2016
These are the initial results of an experimental annotation for chain of referents, as part of the Project Syntax and information structure in the first 16th century Portuguese narrative about Brazil. The broader aim of this procedure is outlined in the project Histories of Brazil: A linked-data repository of three 16th century Portuguese chronicles.
This annotation codes noun phrases in a syntactic annotated text according to their referents: each noun phrase in a text (argumental and non-argumental, projected by lexical or null constituents) was marked for its referent identity and for its position in the chain formed by the other occurrences of the same referent.
As an experiment, a preliminary annotation was applied to the text “Historia da Provincia Sancta Cruz”, by P.M. Gandavo (1576), previously annotated for syntax by the Tycho Brahe Corpus team (see the syntax annotation here).
The text contains 22,944 words, of which 4,165 are nouns, and 1,265 were considered as individual referents (listed in the List of Referents).
The current format of the proposal is mature as regards the basic rationale of the annotation, but still in an early stage of development as regards the markup technique.
The markup was applied to the text over the raw code of the parsed file on a trusted text-processor (Emacs), rather than on any dedicated application. However, the idea is that on the basis of this first annotated text, a better technique for marking-up may be developed, preferably with a more user-friendly interface – but, fundamentally, a technique that would allow for most of the stages to be applied automatically.
The general rationale behind the annotation was conceived in order to make this future automatic processing as easy as possible. Therefore, it can be said that the annotation scheme was conceived to attend two aims:
- The annotation tries to capture all referential relations deemed relevant for syntactic research (in other words, it does not aim at constituting a sophisticated semantic annotation);
- The annotation codes those referential relations in a way that will allow most of the markup to be reproduced by automatic programming in future applications, and that will allow the marked relations to be studied via automatic searches with the tool Corpus Search.
The balance between the first and the second aim meant some compromise had to be taken at some stages, with the option for markups that would be adequate for automatic scripting overcoming more sophisticated linguistic indications that would not be likely to be automatically processed.
The main point in which this guideline towards automation is present in the proposed markup is its dependency on the morphosyntactic tags of the previous syntactic annotation. Fundamentally, all the underlying annotation is based on the part-of-speech tags for nouns (N, N-P, NPR, NPR-P), and all the subsequent annotation is guided by the aim of building patterns of combination between such tags and the phrases they appear on, in such a way that those patterns may be captured by a logical formula in a later implementation.
With regard to this idea of a more sophisticated technical implementation for the markup, the present stage is of general proposals only. As regards the possibility of automatic searches over the results with Corpus Search, initial tests were conducted showing that, at least for the objectives of this project, the annotation is a useful tool of linguistic research.
1. Basic idea
In this annotation, each referential noun phrase in a text receives an ID related to the referent it presents or repeats, and is numbered according to their position in a sequence of previous mentions to that referent.
This is represented schematically in the example below, where there are three referential nouns: ‘monkeys’, ‘land’ and ‘trees’; we could code them, respectively, as 1, 2, 3, after a symbol ‘=’:
(1) Monkeys are everywhere in (2) this land. They live up in (3) the trees, which look heavy with them.
There are also other phrases that mention the same referents named by those three nouns: ‘they’ and ‘them’ for ‘Monkeys’, and ‘which’ for ‘trees’; let’s then schematically code each of the phrases that either contain or make reference to ‘monkeys’ by the tag 1, each phrase that contains or makes reference to ‘land’ by a tag 2, and each phrase that contains or makes reference to ‘trees’ by a tag 3;
(1) Monkeys are everywhere in (2) this land. (1) They live up in (3) the trees, (3) which look heavy with (1) them.
Let’s then further code each of these phrases according to the position they occupy in their respective chain of references, with a further tag -i, -ii, -iii:
(1=i) Monkeys are everywhere in (2=i) this land. (1=ii) They live up in (3=i) the trees, (3=iii) which look heavy with (1=i) them.
The fundamental point is that the code relative to each referent (1, 2, 3 above) is repeated if this referent is interpreted in different constituents in different parts of the text; and that a second code (i, ii, iii above) indicates the position of each element within a potential sequence of mentions to that referent.
These of course are schematic examples only. The actual annotation, because it is made based on a previous syntactic annotation, may be a little more useful for linguistic research, as it may target phrases marked for their syntactic functions and include empty categories. The annotation works in fact as a ‘sub-annotation’ of the syntactic markings in the Tycho Brahe Corpus (i.e., Penn-Helsinky) system.
This is shown (a little less schematically) below, where (N-P …) and other markings in blue represent the previous syntactic annotation in the Tycho Brahe/Penn-Helsinky system, and the markings in red, codes such as the ones added by the present system:
Schematic chain annotation of the referent 'Monkeys': - 1st occurrence, nominal subject: (NP-SBJ (CODE 1=i) (N-P Monkeys)) - 2nd occurrence, pronominal subject: (NP-SBJ (CODE 1=ii) (PRO They)) - 3rd occurrence, pronominal object: (NP-ACC (CODE 1=ii) (PRO them)) Schematic annotation of the referent 'land' - only occurrence, complement of PP: (NP (CODE 2=i) (D this) (N land)) Schematic annotation of the referent 'trees' - 1st occurrence, complement of PP: (NP (CODE 3=i) (D the) (N-P trees)) - 2nd occurrence, nominal subject: (WNP (CODE 3=ii) (WPRO which))
The biggest advantage is that, because there is a wealth of research over the syntactic annotation used here as base, the results from this ‘sub-annotation’ may be subjected to automatic searches with known and testes programs, such as Corpus Search (corpussearch.sourceforge.net). This makes the referential dependencies among the constituents systematically observable, by conducting automatic searches that may show how the patterns of the different phrases in the text is connected to their referentiality dependencies.
2 Present format of the annotation
2.1 Preparatory indexing of nouns
Indexes in sub-tags are appended to all POS tags N*
(N, N-P, NPR, NPR-P) in the corpus,
in the format of a four-digit number:
(N*=xxxx ...) (N=0001 noun) N, common noun, singular (N-P=0002 nouns) N-P, common noun, plural (NPR=0003 Noun) NPR, proper noun, singular (NPR-P=0004 Nouns) NPR-P, proper noun, plural
Each and every N* tag in the text is given their own number,
in the order they appear in the text.
This applies also to repeated occurrences of
the same ‘noun’ throughout the text:
(N=0001 noun) This is N* number 0001, containing 'noun' (N=1001 noun) This is N* number 1001, also containing 'noun' (N=1002 noun) This is N* number 1002, also containing 'noun' (N=1003 noun) This is N* number 1003, also containing 'noun'
As will be exposed below, the numbers on the POS N* tags
constitute the basis for the marking of reference IDs in the phrase level.
In order to help the coding of Referent IDs in the phrases,
in addition to the simple number sub-tagged in each tag N*,
a ‘clone’ of the number received by the first tag N*
in which the contained noun appeared for the first time in it
is added to the simple number, after a symbol ‘/’:
(N*=yyyy/a/xxxx ...) (N=0001/0/0001 noun) This is N* 0001, first occurrence of 'noun' (N=1001/1/0001 noun) This is N* 1001, second occurrence of 'noun' (N=1002/2/0001 noun) This is N* 1002, third occurrence of 'noun' (N=1003/3/0001 noun) This is N* 1003, fourth occurrence of 'noun'
2.2 Main annotation
The annotation is applied to the corpus in the format of a node CODE added to all NPs in the corpus, and containing the relevant markup. This is a a new feature added in December, 2015, and still in a very experimental stage.
“Code” nodes are added to all NPs in the text (high or low):
(NP* (CODE xxxx) (...)) (NP-SBJ (CODE xxxx) (...)) High NP, Subject (NP-ACC (CODE xxxx) (...)) High NP, Object (NP-LFD (CODE xxxx) (...)) High NP, Left-dislocated (NP (CODE xxxx) (...)) Low NP (e.g., complement of PP). (WNP (CODE xxxx) (...)) WNP
The general format of the annotation inside the Code node is as follows:
(NP (CODE AA=BBB=000=0000) (...)) CODE: AA= Head/Mention tags BBB= Construction type tags 0000= Reference ID tags 000= Number of occurrence tags
The categories in the sequence of codes (Head and Mention tags; Construction type tags; Number of occurrence tag; Reference ID tag) are detailed below.
The code includes tags explicitating each NP’s status as a head or a mention, and the relevant aspects of its internal constructions. These are letter tags, and they actually correspond to features that are deductible from the combination of the Referent ID and number of occurrence tags (numeric index) and the syntactic markup of each NP. They were added, however, to facilitate immediate identification of each case prior to searches in CS, and also to allow for some ‘fast track’ searches (hence the temporary name ‘mnemonic tags’).
This is also a new feature added in December, 2015, and still in a very experimental stage.
Heads and Mentions Tags
Each NP is marked with a tag indicating its status as a “head” or a “mention”, with sub-types. There are five heads tags and two mentions tags:
Heads: HA (contains a noun in its 1st occurrence and no modifier or complement) HH (contains a noun in its 1st occurrence and a modifier or complement) HE (contains a noun not in 1st occurrence and no modifier or complement) HD (contains a noun not in 1st occurrence and a modifier or complement) HR (is related to a previously occurring phrase; any internal structure) Mentions: MH (contains a noun in its 1st occurrence and a demonstrative) MR (does not mention one specific previous phrase; any internal structure) MM (mentions one specific previous phrase; any internal structure)
Construction type tags
Each NP is also marked with a tag indicating the relevant aspects of its internal construction. There are 25 construction type tags:
NNN (contains a noun) NNC (contains a noun and a modifier or complement) DNN (contains a noun and a definite determiner) DNC (contains a noun, a definite determiner, and a modifier or complement) DCC (contains a definite determiner and a modifier or complement) ENN (contains a noun and a demonstrative) ENC (contains a noun, a demonstrative and a modifier or complement) ECC (contains a demonstrative and a modifier or complement) EEE (contains a demonstrative) UNN (contains a noun and an indefinite determiner) UCC (contains a noun, an indefinite determiner and a modifier or complement) UUU (contains an indefinite determiner) DON (contains a noun, a definite determiner and OUTRO) DOO (contains a definite determiner and OUTRO) OON (contains a noun, and OUTRO) OOO (contains OUTRO) QQN (contains a noun and Q or NUM) QQQ (contains Q or NUM) PPS (contains a possessive pronoun PRO$, in any configuration) PPD (contains a definite determiner and a pronoun WPRO) PPP (contains a pronoun PRO, *pro*, CL, SE, WPRO) NAN (contains more than one noun, in any configuration) RCL (contains a free relative clause) CCC (contains no noun, no determiner, and none of: OUTRO, Q, NUM, PRO$, relative clause)
Annotation for Referent ID
Referent ID annotation is in the format of
a number with four digits (0000), to be understood as the unique index for each “referent” in the text. This number is initially generated on the basis of the numbers applied to the nouns.
(NP* (CODE AA=BBB=xxxx) (...))
– Additional annotation
Additional annotation may be appended after the target Referent ID
in cases where a new referent forms a relation with a previously mentioned referent (for instance, by being a particular case of a wider group or class),
and this is explicit in the internal structure of the noun phrase
(for instance, in the case of noun phrases with ‘outro’, ‘other’)
There are two formats for the additional annotation:
(i) the related index is added with the symbol ‘&‘ for relations
established with a previously occurring noun phrase:
(NP (CODE AA=BBB=yyyy&xxxx) (...)): (NP (CODE AA=BBB=0001) (D a) (N=1001/0/0001 noun)) (NP (CODE AA=BBB=5001&0001) (D an) (OTHER other) (N=1002/3/0001 noun)) → the NP 'another noun' is related to NP 'a noun'
(ii) the related index is added with the symbol ‘#’ for relations
established with a previously occurring noun:
(NP (CODE AA=BBB=yyyy#xxxx) (...)): (NP (CODE AA=BBB=0001) (D a) (N=1001/0/0001 noun)) (NP (CODE AA=BBB=2002#0001) (D those) (N=2002/0/2002 nouns)) → the NP 'those nouns' is related to N 'noun'
Annotation for number of occurrence
Number of occurrence annotation is in the format of a number
with three digits (000), appended with the symbol ‘=’,
after the Referent ID (i.e., after the additional annotation too,
where this applies):
(NP (CODE AA=BBB=xxxx=aaa) (...)), (NP (CODE AA=BBB=yyyy&xxxx=aaa) (...)): (NP (CODE AA=BBB=0001=001) (...)) → first occurrence of ID 0001 (NP (CODE AA=BBB=5000&0001=001) (...)) → first occurrence of ID 5000&0001
Each combination of number of occurrence and Referent ID is unique.
0.2.2 Application of the indexes
The origin of the Referent ID index for each phrase will depend on
whether the phrase constitutes the first mention of a referent in the text,
or a further mention of a previously occurring referent.
Application of indexes for first mentions
Phrases that constitute the first mention of a referent will receive
their own index.
This will work in one of two ways, depending on the simple or complex
nature of the new reference:
(i) In simple referent relations, the index for new referents is
the number given to the contained noun in its first occurrence.
Such cases will be marked =001 for number of occurrence:
(NP (CODE AA=BBB=0001=000) (D a) (N=0001/0/0001 noun))
(ii) In complex referent relations, the index for new referents
will be a new number, independent of the inventory of nouns.
Such cases will also be marked =001 for number of occurrence:
(NP (CODE AA=BBB=5000=000) (D an) (OUTRO other) (N=1002/3/0001 noun))
For complex referent relations, the numbber of a related phrase
or the number of a related noun is linked as additional annotation,
as mentioned above (after ‘&’ or after ‘#’, respectively). So,
(NP (CODE AA=BBB=0001=000) (D a) (N=0001/0/0001 noun)) (NP (CODE AA=BBB=5000&0001=000) (D an) (OUTRO other) (N=1002/3/0001 noun))
Application of indexes for further mentions
Phrases that constitute further mentions of a previously occurring referent
will inherit the index of the first occurrence of that previously mentioned referent (be them simple indexes, =xxxx, or complex indexes, =xxxx&yyyy; =xxxx#yyyy).
Such cases will be marked =002, =003, etc. for number of occurrence,
according to their position in the chain formed by all mentions to that referent:
(NP (CODE AA=BBB=0001=000) (D a) (N=0001/0/0001 noun)) (NP (CODE AA=BBB=0001=001) (D This) (N=1001/1/0001 noun)) (NP (CODE AA=BBB=0001=002) (D This) (ADJ mentioned) (N=1002/3/0001 noun))
Schematic example of a complete chain
To illustrate with a complete schematic chain for a Referent ID 0001
and a related Referent ID 5000&0001:
(NP (CODE AA=BBB=0001=000) (D a) (N=0001/0/0001 noun)) (NP (CODE AA=BBB=0001=001) (D This) (N=1001/1/0001 noun)) (NP (CODE AA=BBB=0001=002) (D This) (ADJ mentioned) (N=1002/3/0001 noun)) (NP (CODE AA=BBB=5000&0001=000) (D an) (OTHER other) (N=1003/4/0001 noun)) (NP (CODE AA=BBB=5000&0001=001) (D This) (ADJ mentioned) (N=1004/5/0001 noun)) (NP (CODE AA=BBB=2000#0001=000) (D those) (N=2000/0/2000 nouns))
M.C. Paixão de Sousa