Guidelines and issues

Referent chain annotation

Partial results, 19/11/2015

Presentation


Below are the guidelines for an experimental annotation for chain of referents, as part of the Project Syntax and information structure in the first 16th century Portuguese narrative about Brazil. The broader aim of this procedure is outlined in the project Histories of Brazil: A linked-data repository of three 16th century Portuguese chronicles.

The annotation marked the referents of noun phrases in a syntactically annotated text, Historia da Provincia Sancta Cruz, by P.M. Gandavo (1576), previously annotated for syntax by the Tycho Brahe Corpus team (see the syntax annotation  here).

In what follows, we present an outline of the format of the annotation (1), a summary of the involved procedures (2), an assessment of the issues that remain to be solved and the first results obtained in the light of the aims of the broader project (3), to conclude with a more detailed exposition of the technicalities of the annotation (4).


1. Outline


1.1 General remarks

This annotation codes noun phrases in a syntactic annotated text according to their referents: each noun phrase in a text (argumental and non-argumental, projected by lexical or null constituents) was marked for its referent identity and for its position in the chain formed by the other occurrences of the same referent.

The present guidelines were developed after experiments applied to the text Historia da Provincia Sancta Cruz, with 22,944 words – of which 4,165 are nouns, and 1,265 were considered as individual referents (listed in the List of Referents).

The current format of the proposal is mature as regards the basic rationale of the annotation, but still in an early stage of development as regards the markup technique. 

The markup was applied to the first text over the raw code of the parsed file on a trusted text-processor (Emacs), rather than on any sophisticated application. However, the idea is that on the basis of this first annotated text, a better technique for marking-up may be developed, preferably with a more user-friendly interface – but, fundamentally, a technique that would allow for most of the stages to be applied automatically.

The general rationale behind the annotation was conceived in order to make this future automatic processing as easy as possible. Therefore, it can be said that the annotation scheme was conceived to attend two aims:

  1. The annotation tries to capture all referential relations deemed relevant for syntactic research (in other words, it does not aim at constituting a sophisticated semantic annotation); 
  2. The annotation codes those referential relations in a way that will allow most of the markup to be reproduced by automatic programming in future applications, and that will allow the marked relations to be studied via automatic searches with the tool Corpus Search.

The balance between the first and the second aim meant some compromise had to be taken at some stages, with the option for markups that would be adequate for automatic scripting overcoming more sophisticated linguistic indications that would not be likely to be automatically processed.

The main point in which this guideline towards automation is present in the proposed markup is its dependency on the morphosyntactic tags of the previous syntactic annotation. Fundamentally, all the underlying annotation is based on the part-of-speech tags for nouns (N, N-P, NPR, NPR-P), and all the subsequent annotation is guided by the aim of building patterns of combination between such tags and the phrases they appear on, in such a way that those patterns may be captured by a logical formula in a later implementation.

With regard to this idea of a more sophisticated technical implementation for the markup, the present stage is of general proposals only. As regards the possibility of automatic searches over the results with Corpus Search, initial tests were conducted showing that, at least for the objectives of the project Syntax and information structure in the first 16th century Portuguese narrative about Brazil, the annotation is a useful tool of linguistic research.


1.1.1 The expected annotation and its goals

In this annotation, each referential noun phrase in a text receives an ID related to the referent it mentions or repeats, and is numbered according to their position in sequence of previous mentions to that referent.

This is represented schematically in the example below, where there are three referential nouns: ‘monkeys’, ‘land’ and ‘trees’; we could code them, respectively, as 1, 2, 3, after a symbol ‘=’: 

Monkeys=1 are everywhere in this land=2. 
They live up in the trees=3, 
which look heavy with them.

There are also other phrases that mention the same referents named by those three nouns: ‘they’ and ‘them’ for ‘Monkeys’, and ‘which’ for ‘trees’; let’s then schematically code each of the phrases that either contain or make reference to ‘monkeys’ by the tag 1, each phrase that contains or makes reference to ‘land’ by a tag 2, and each phrase that contains or makes reference to ‘trees’ by a tag 3;

Monkeys=1 are everywhere in this land=2. 
They=1 live up in the trees=3, 
which=3 look heavy with them=3.

Let’s then further code each of these phrases according to the position they occupy in their respective chain of references, with a further tag -i, -ii, -iii:

Monkeys=1=i are everywhere in this land=2=i 
They=1=ii live up in the trees=3=i, 
which=3=ii look heavy with them=1=iii ...

That is it, basically.

The fundamental point is that the code relative to each referent (1, 2, 3 above) is repeated if this referent is interpreted in different constituents in different parts of the text; and that a second code (i, ii, iii above) indicates the position of each element within a potential sequence of mentions to that referent.

These of course are schematic examples only. The actual annotation, because it is made based on a previous syntactic annotation, may be a little more useful for linguistic research, as it may target not “words”, but actual phrases marked for their syntactic functions. The annotation works, in fact, as a ‘sub-annotation’ of the syntactic markings in the Tycho Brahe Corpus (i.e., Penn-Helsinky) system.

This is shown (a little less schematically) below, where (N-P …) and other markings in blue represent the previous syntactic annotation in the Tycho Brahe/Penn-Helsinky system, and the markings in red, codes such as the ones added by the present system:

Schematic chain annotation of the referent 'Monkeys':

- 1st occurrence, nominal subject:    (NP-SBJ=1=i (N-P Monkeys))
- 2nd occurrence, pronominal subject: (NP-SBJ=1=ii (PRO They))
- 3rd occurrence, pronominal object:  (NP-ACC=1=ii (PRO them))

Schematic annotation of the referent 'land'
- only occurrence, complement of PP:  (NP=2=i (D this) (N land))

Schematic annotation of the referent 'trees'
- 1st occurrence, complement of PP:   (NP=3=i (D the) (N-P trees))
- 2nd occurrence, nominal subject:    (WNP=3=ii (WPRO which))

The biggest advantage is that, because there is a wealth of research over the syntactic annotation used here as base, the results from this ‘sub-annotation’ may be subjected to automatic searches with known and testes programs, such as Corpus Search (corpussearch.sourceforge.net). This makes the referential dependencies among the constituents systematically observable, by conducting automatic searches that may show how the patterns of the different phrases in the text is connected to their referentiality dependencies. A few test searches have been conducted and are described in 3.2 further below. 


1.1.2 Note on the term ‘reference’

In the guidelines for this initial annotation we shall use some terms in a very blunt way. This is particularly the case of ‘reference‘. By this we mean that one constituent mentions, by repetition or by using an explicit morphosyntactic equivalent, a previously occurring constituent.

More specifically, we mean that a noun phrase (NP) mentions, by repetition or by using an explicit morphosyntactic equivalent, a previously occurring constituent expressed by a noun (N or NPR). By ‘morphosyntactic equivalent’ we mean, here, word classes such as pronouns (lexical or null); phrasal combinations such as demonstratives plus nouns (DEM + N); and ‘abstract’ combinations involving ellipses:

Monkeys are everywhere in this land. 
They live up in the trees, which look heavy with them. 
Those monkeys are sometimes funny. 
Those are really the funniest animals. 
In fact, everybody loves apes. 

Examples of elements marked as sharing a noun's Reference ID:
(categories taken as morphosyntactic indicatives of co-reference) 

they, them    → (NP=1=ii  (PRO them))
Those monkeys → (NP=1=iii (DEM Those)(N-P monkeys)) 
Those         → (NP=1=iv  (DEM Those))

In a nutshell: in this system, each noun in the text corresponds to a ‘referent’, and the noun phrases will mention these referents, either explicitly (when the noun itself is included in the phrase’s projection) or indirectly (by morphosyntactic indication, such as the use of a pronoun) – and this mentioning is in fact the ‘reference‘ to be marked.


1.1.3 Note on the term ‘referents in relation’

The annotation tries to code relations between constituents that share a referential relation to one another, such as being a part, or a group, of another mentioned referent – however, this is done in a very specific frame.

The aim of coding those relations is to allow them to be retrievable by future searches, for reasons explained in 3.1.1 further below. Therefore, ‘relations’ between referents are taken in a strict way, pertaining explicit relations expressed in morphosyntactic forms or syntactic constructionsRelations’ are marked only when they are formally explicit: either by virtue of the fact that the relating constituent is a pronoun, as said above, or by the fact that there are explicit marks in a relating NP, such as quantifiers (‘Some’, ‘Each’); ‘Others’; etc

In the absence those markers, even if there is an abstract interpretation of a logical relation, it is left uncoded (more technical details on this annotation are on 2.3 further below). Note that this means the annotation does not target synonyms or any form of ‘shared reference’ that is not morphosyntatically explicit. It could be argued, for instance, that ‘apes’ in the example further above is a synonym, or at any rate somehow refers to ‘Monkeys’; however, in the present system, this relation would not be marked. There could be a further implementation with a provision for more delicate relations such as this; however, at the present stage the priority is to make a system that is the most linked to the morphosyntactic categories as we can, for the reasons exposed above.

So, to reinforce: this, clearly, is not a semantic annotation; not a marking for logical relations among referents; not an abstract ontology – but, rather, a marking for relations that are morphosyntatically explicit in the text. 


1.2 Format of the annotation


1.2.1 Referent ID

Referent ID annotation is in the format of a number with four digits (0000), appended to the syntactic tags for Noun Phrases (NP*) with the symbol ‘=’ :

 =ReferentID
 =0001
  • Note: The symbol ‘=’ is used rather than ‘-‘, as the latter is already present in the basis syntactic annotation (this makes the Referent ID markings entirely independent from the syntactic annotation).


1.2.2 Number of occurrence

Number of occurrence annotation is in the format of a number with three digits (000), appended with the symbol ‘=’, after the Referent ID:

 =ReferentID=NumberOfOccurrence
 =0001=001 → this is the first occurrence of Referent ID 0001

Each combination of number of occurrence and Referent ID is unique.

The example below applies this format to the schematic example given further above, targeting the referent ‘monkeys‘:

Monkeys are everywhere in this land. 
They live up in the trees, which are heavy with them.
The referent 'Monkeys': =0001

 - 1st occurrence, as a nominal subject: (NP-SBJ=0001=001 (N-P=0001 Monkeys))
 - 2nd occurrence, as pronominal subject:(NP-SBJ=0001=002 (PRO they))
 - 3rd occurrence, as pronominal object: (NP-ACC=0001=003 (PRO them))


1.2.3 Additional annotation

Additional annotation for dependent reference relations may be appended after the target Referent ID in special cases (see 2.3 further below):

 =ReferentID/AditionalAnnotation=NumberOfOccurrence
 =0002&0001=001 → 0002 is part of a class expressed by 0001
 =0002#0001=001 → 0002 is a quality of 0001
 =0002$0001=001 → 0002 is possessed by 0001
 =0002+=001     → 0002 refers to a previously mentioned event


1.3 Marked categories


1.3.1 On the part-of-speech level: N, NPR

Numbers in sub-tags are appended to all nouns, N, N-P, NPR, NPR-P, as a basis for the marking of reference IDs in the phrase level:

 (N=0001 noun)     N, common noun, singular
 (N-P=0002 noun)   N-P, common noun, plural
 (NPR=0003 noun)   NPR, proper noun, singular
 (NPR-P=0004 noun) NPR-P, proper noun, plural
  • Note: Each and every noun in the text is given their own number, even if the ‘same’ noun is repeated throughout the text. This number is not a reference ID at this point, just the ID of each instance of each noun; it is the basis of the referent IDs for the phrases (details in 2.1 below).


1.3.2 On the phrase level: NP, WNP

1.3.2.1 NPs

Reference ID and Number of Occurrence sub-tags are appended to all NP nodes:

 (NP-SBJ=0001=001 (...)) High NP, Subject (* see 1.3.2.1.1)
 (NP-ACC=0001=002 (...)) High NP, Object
 (NP-LFD=0001=003 (...)) High NP, Left-dislocated
 (NP=0001=004     (...)) Low NP (e.g., below PPs).

1.3.2.1.1 Special cases

NPs containing expletive subjects, (NP-SBJ *exp*). Such NPs are not tagged for reference, for obvious reasons.

NPs projected by the clitic SE, (NP-SE …). All instances of the clitic SE in the syntactic annotation project an NP  (NP-SE … ); but some NP-SEs are co-indexed with expletive NPs. Following the logics of what is proposed above for expletives, NP-SEs coindexed with expletive NPs do not get marked for referent ID. More on this in 3.1.3 further below.

NPs containing traces. For NPs containing traces, the Number of Occurrence tag will be the same as the non-trace category to which it is co-indexed. There will be a differencce regarding the code for number of occurrence, to be dealt with in 2.2.1 further below.

1.3.2.2 WNPs

Reference ID and Number of Occurrence sub-tags are appended to all WNPs:

(WNP=0001=005 (WPRO ... ))
  • Note: The Number of Occurrence tag in WNPs will be repeated on the trace to which it is co-indexed (see 1.3.2.1.1 above).


2. Procedures


2.1 Counting of nouns

The first procedure in the annotation is the counting and number-labeling of all nouns in the file. This stage can be performed semi-automatically, and is described in detail in 4.1 further below.

It basically consists in the numbering of each word tagged as N, N-P, NPR or NPR-P in the previously syntactically annotated file with a sub-label =0000. The following example (which is the first sentence of the test-text) shows this, first as a list of the counted nouns, then as the complete tree:  

(1)
Reinando aquele muito católico e sereníssimo príncipe el-rei Dom MANUEL, 
fez-se uma frota para a Índia 
de que ia por capitão mór Pedro Álvares Cabral...
'On the kingdom of that very catholic and serene Prince the King Dom Manuel, 
a fleet to India was made 
of which went as captain general Pedro Álvares Cabral...'

List of Nouns: 

 (NPR=0001 Príncipe)
 (NPR=0002 el-Rei)
 (NPR=0003 Dom)
 (NPR=0004 MANUEL)
 (N=0005 frota)
 (NPR=0006 Índia)
 (N=0007 capitão)
 (NPR=0008 Pedro) 
 (NPR=0009 Álvares) 
 (NPR=0010 Cabral)
In the tree:

 ( (IP-MAT (IP-GER (VB-G REINANDO)
		  (NP-SBJ (D aquele)
			  (ADJP (Q muito)
				(ADJ católico)
				(CONJP (CONJ e)
				(ADJX (ADJ-S sereníssimo))))
			   (NPR=0001 Príncipe)
			   (NP-PRN (NPR=0002 el-Rei) (NPR=0003 Dom) (NPR=0004 MANUEL))))
	  (, ,)
	  (VB-D fez-)
	  (NP-SE-4 (CL -se))
	  (NP-SBJ-4 (D-UM-F uma)
		    (N=0005 frota)
		    (PP (P para)
			(NP (D-F a) (NPR=0006 Índia)))
		    (CP-REL (WPP-2 (P de)
		                   (NP (WPRO que)))
			    (IP-SUB (VB-D ia)
			    (PP (P por)
				(NP (N=0007 capitão)
				    (ADJ mór)
				    (NP-GEN *T*-2)))
			    (NP-SBJ (NPR=0008 Pedro) (NPR=0009 Álvares) (NPR=0010 Cabral))))) ...))

The semi-automatic steps to achieve this may include (apart from the simplest form shown above) a few further mnemonic aids, as also described in 4.1 further below.

The important part here is that all the N(-P) and NPR(-P) in the file will have a correspondent number (from 0001 to 4164 in the test file). Notice that this includes, at this stage, even the ‘repetitions’ of a noun; i.e., repeated instances of the ‘same’ noun. It is to be understood that this number is the number of each instance of a N, N-P, NPR or NPR-P tag, not the actual referential ID yet.

These are the numbers that will serve as the basis for the marking of the noun phrases – both noun phrases that directly include the numbered nouns, and noun phrases that in some form ‘refer’ to a previously occurring numbered noun, by means of morphosyntactic devices such as the use of a pronoun or certain reference structures, as will be described right below.


2.2 Identification of referents


2.2.1 Marking reference IDs

The most important stage is the identification of referents in the phrases and application of IDs and number of occurrence to each, accordingly. Here are a few schematic guidelines on how to manually mark NPs for Reference IDs (based on a text prepared according to 2.1 above); in 2.2.2 are the general guidelines on numbering the marked NPs.

2.2.1.1  Noun Phrases containing nouns

i.e., (NP (N ...))
      (NP (D ...)(N ...))
      (NP (NPR ...)(NPR ...))
      (NP (N ...) (CONJ ...) (N ...))

The referent ID to be given to a Noun Phrase projected by a noun – (NP (N …)) – depends on the position of this noun in its own chain of reference:

In the first occurrence of a noun projected as a phrase, the NP is coded with the number of the contained noun, i.e., the number automatically attributed to that noun after the preparation stage.

The example below shows the first occurrence of a noun ‘Bugios’, ‘monkeys’, in the test-file; this noun’s N-P tag was numbered as 1904, and so this first NP that refers to it (and contains it) will receive this number as it’s “Reference ID”:


(2)
Bugios há na terra muitos e de muitas castas...
'Monkeys there are in the land many and of many castes...'
(NP-SBJ      (N-P=1904 Bugios)) >
(NP-SBJ=1904 (N-P=1904 Bugios)) 

All subsequent occurrences of a given noun in other projections will be marked with the number of the noun in its first reference.

In the case of the noun above, all further occurrences of ‘Bugios’ will be tagged =1094, irrespective of the number previously tagged to the actual noun ‘bugios’ in the place of repetition. In the test-file, the next time this noun appears is as N-P number 3773; the NP containing it, however, will given the same ID as the one above, i.e., 1094:

(3)
e quando se querem ajuntar assobiam como pássaros, ou como bugios
'and when they want to reunite they whistle like birds, or like monkeys'
 where (N-P=3773 bugios)
previous mention: 
Bugios há na terra muitos e de muitas castas 
where (N-P=1904 Bugios) 

(NP-SBJ      (N-P=3773 bugios)) > 
(NP-SBJ=1904 (N-P=3773 bugios))
  • Note:  Because in such cases, the Reference ID on the phrase level will not match the number of the noun that it contains, a slightly more complex format of the numbering of nouns would come in handy. If the noun ‘bugios’, in the example above (‘como bugios’), had the number 1904 repeated in all its instances, the equivalence ‘bugios’ = 1904 would be immediately visible at the moment of annotation (more will be said on this in 4.1.2.3):
(NP-SBJ=1904 (N-P=3773/1904 bugios))

The two examples above, for first mention and for further mention, are quite simple cases in terms of the internal structure of the NP – i.e., in both instances, we have an NP formed solely by a noun, (NP (N-P Bugios)) and (NP (N-P bugios)). For noun phrases containing determiners and nouns – (NP (D …) (N …)), the rationale is exactly the same as for those more simple cases: the number of reference will depend on wether the phrase represents the first mention of a referent, or a subsequent mention. The three further examples below show that

(4) 
As fontes que há na terra, são infinitas ... 
'The springs that are in the land, are infinite...' 

'as fontes', first mention of N 'fontes', =0333: 
(NP-SBJ (D-F-P As) (N-P=0333 fontes) -> 
(NP-SBJ=0333 (D-F-P As) (N-P=0333 fontes) 
(5) 
fez-se uma frota para a Índia ... 
'a fleet to Índia was made ...'

'uma frota', first mention of N 'frota', =0005: 
(NP (D-UM-F uma) (N=0005 frota)) -> 
(NP=0005 (D-UM-F uma) (N=0005 frota))
(6)
E depois de haver bonança, junta outra vez a frota ...
'And after there was good weather, gathered again the fleet...'

'a frota', 8th mention of N 'frota', =0005:
(NP      (D-F a) (N=0029 frota)) ->
(NP=0005 (D-F a) (N=0029 frota))

Beware, only, that not all structures which contain a determiner and a noun represent a simple reference to the contained noun. Instances where this is not the case will be dealt separately in 2.3 below, as will the cases of NPs containing combinations of nouns with certain determiners, quantifiers, etc.

Here, the only ‘special’ case we will tackle as regards NPs containing Nouns are the cases where more than one noun is involved:

2.2.1.1.1 Noun phrases containing more than one noun
i.e. (NP (N ...) (CONJ ...) (NX (N ...)))
     (NP (NPR ...)(NPR...))

Noun phrases that contain more than one noun include the ones formed by coordinated nouns and NPs formed by ‘combined’ proper nouns. In such cases, there comes the problem of which of the noun’s ID would be used as the Referent ID for the projected NP.

One option would be to chose one of the nouns as a ‘core’ referent, but this is not adequate, as explain right below; the general solution, then, is to use an entirely new number for the Referent ID, quite independent from any of the contained nouns. Technically, this means that for those cases, extra ID numbers must be generated (i.e., numbers that are not part of the initial inventory of numbered nouns). This could be done more obviously with the extra numbers starting from the last noun number, i.e., if the last noun was number 4164 as in the test file, the first extra-number would be 4165, and so on. Another system is to start the new numbers in a new thousands-series, eg., 5000, 6000, 7000 etc., if the last noun was 4164. This is the system currently in use for this annotation, with a different series attributed for each different case that needs a new number (making it easier to spot and if necessary change the numbering later):

NPs formed by multiple NPRs. Some NPs are formed by more than one proper noun, clearly forming one single referent, like Pedro Álvares Cabral. The reason why it is not a good idea to chose one of the contained nouns as the ‘core’ (and mark its number on the NP) is that as such ‘combined’ nouns may be recurrent, and the recurrences do not always contain all the nouns used in the first occurrence  – i.e., in this case, the referent Pedro Álvares Cabral may alternatively appear as Pedro Álvares or Cabral).

Currently, all cases of NPs formed by ‘combined’ NPRs receive a Referent ID entirely independent from the contained NPRs, in the 9000’s basis:

(7)
(NP-SBJ      (NPR=0008_Pedro Pedro) 
             (NPR=0009_Álvares Álvares) 
             (NPR=0010_Cabral Cabral))  →
 
(NP-SBJ=9001 (NPR=0008_Pedro Pedro) 
             (NPR=0009_Álvares Álvares) 
             (NPR=0010_Cabral Cabral))

NPs formed by coordinated nouns. Another case are noun phrases formed by nouns in conjunction. The same logics apply here as above: rather than choosing one of the contained nouns as the ‘core’, the NP is marked with an additional, separate number. This is particularly important since one of the nouns in coordination may well appear in other, uncoordinated NPs.

At present, the cases of NPs containing coordinated nouns are being numbered with sequences in the 6000’s (note that the NXs contained in the complex coordinated NPs each get their own appropriate reference ID code, in blue below):

(8)
...grandes haveres de ouro e pedraria 
'...great possessions of gold and silver'
 
(NP=6015 (NX=0461=001 (N=0461 ouro))
                      (CONJ e)
                      (NX=0462=001 (N=0462 pedraria)))

2.2.1.2 Noun phrases containing pronouns

i.e.,  (NP (PRO ...)
       (NP (*pro*)) 
       (NP (WPRO ...)) 
       (NP (CL ...))

Phrases that are projections of (lexical or empty) pronouns are coded with the same reference ID as their interpreted referent nouns, i.e., with the number of these nouns first reference (much like NPs formed by recurrent nouns).

In the example below, the phrases containing the pronoun ‘eles’, ‘they’; the null pronoun *pro*, and the clitic pronoun ‘os’, ‘them’, are marked with the same reference ID as the previously mentioned noun ‘Bugios’, ‘monkeys‘ (1904), interpreted as co-referent:

previous mention:
Bugios há na terra muitos e de muitas castas ... 
where (N-P=1904 Bugios)

(9)
e por serem tão conhecidos em diversas partes ...
'and because [*pro*] are so well known everywhere' ...

 (NP-SBJ      *pro*) →  
 (NP-SBJ=1904 *pro*)

(10)
a todas as pessoas que a eles se chegam
'to all the people who get close to them'

 (PP (P a) (NP      (PRO eles))) → 
 (PP (P a) (NP=1904 (PRO eles)))

(11)
e se os tratam com as mãos...
'and if one handles them...'
 (NP-ACC=0001=(CL os)) →
 (NP-ACC=1904 (CL os))
(12)
 (WNP-2      (WPRO que))  →
 (WNP-2=1904 (WPRO que))

2.2.1.3 Noun phrases not containing nouns or pronouns

There are two situations in which a noun phrase may not contain either a noun or a pronoun: phrases that contain traces, and phrases that are formed only by determiners, demonstratives, adjectives, etc. Each are treated with a different rationale, as shown below.

2.2.1.3.1 Noun phrases containing ‘elided’ nouns 

i.e., (NP (D-UM ...))
      (NP (D ...) (ADJ ...))
      (NP (DEM ...))
      (NP (Q ...))

We will call the phrases formed exclusively by determiners, demonstratives, adjectives, etc., phrases containing  ‘elided‘ nouns, in the sense that some noun is clearly referenced, although it is not mentioned. 

Here we deal only with the most frequent case of such phrases with elisions, and the most straightforward too as pertains referent ID annotation: the cases where the phrase does not contain a noun, but makes clear reference to a noun previously mentioned in other configurations. In such cases, the NP is marked with the same Referent ID as the noun taken as ‘elided‘.

In the example below, ‘Estes’, ‘Those‘, is interpreted as meaning ‘estes Açores’, with reference to ‘Açores’, ‘Sea-hawks‘, previously mentioned; so the phrase (NP-SBJ (D-P Estes)) receives the Referent ID corresponding to ‘Açores’ (much as if the phrase were ‘Estes Açores‘, in fact):

(13)
Há nesta província muitas aves de rapina muito formosas e de várias castas, convém a saber , Águias , Açores, e Gaviões (...).  
Os Açores são como os de cá (...) 
Estes são muito ligeiros e de maravilha lhe escapa ave (...)
'There are in this province many birds of prey very handsome and of many castes, e.g., Eagles, Sea-hawks, and Hawks (...).
The Sea-hawks are like those from here (...)
Those are very swift and seldom does a bird escape them (...)'

where: 
Açores:    (NP=2045     (NPR=2045 Açores)) 
Os Açores: (NP-SBJ=2045 (NPR=2067 Açores)) 
 
Estes:     (NP-SBJ=2045 (D-P Estes)) 

Beware, however, that other constructions with ‘elided nouns’, in the exact format as above, do not simply share the referent with a previously mentioned noun – rather, they form a new referent, not mentioned before. Mostly, this new referent will correspond to a group or sub-group of an already mentioned referent. Those cases are part of the more complex ‘referents in relation’ category, that will be dealt with in detail in 2.3. Here it suffices to point out that such phrases will be coded with a new ID number, additionally coded as ‘related’ with the previously mentioned, related referent. 

2.2.1.3.2 NPs containing traces

NPs containing a trace receive the same reference ID as the lexical category to which they are co-indexed, as mentioned in 1.3.2.1 (see 2.2.2.1 below however for how this affects the number of occurrence label):

(14)
previous mention: 
 (NP-SBJ=1904 (N=1904 Bugios))
 (WNP-2      (WPRO que)) (NP-SBJ      *T*-2) →
 (WNP-2=1904 (WPRO que)) (NP-SBJ=1904 *T*-2)

2.2.1.8 Special Cases: fixed labels

First-person subjects (singular or plural) are given the fixed ID ‘0000’:

 (NP-SBJ=0000 (PRO eu)),
 (NP-SBJ=0000 *pro*)

Indeterminate subjects (in Portuguese, a *pro*) are given the fixed ID ‘XXXX’:

 (NP-SBJ=XXXX *pro*)


2.2.2 Numbering NPs marked for reference

The number of occurrence sub-label refers to the position of each phrase in each referent’s ‘chain of reference‘, i.e., in the (potential) sequence of mentions to this same referent.

The rule is quite simple: each time a referent is annotated in a Noun Phrase, the occurrence is counted, and so subsequently:

(15)
Bugios há na terra muitos e de muitas castas ...
e por _ serem tão conhecidos em diversas partes ...
e se os tratam com as mãos...
a todas as pessoas que a eles se chegam
e quando se querem ajuntar assobiam como pássaros, ou como bugios

'Monkeys there are in this land many of many castes...'
'and because [they] are so well known everywhere...'
'and if one handles them...'
'to all the people who get close to them...'
'and when they want to reunite they whistle like birds, or like monkeys'
 
(NP-SBJ=1904=0001 (N-P=1904 Bugios)) 
(NP-SBJ=1904=002 *pro*)
(NP=1904=003 (PRO eles))
(NP-ACC=1904=004 (CL os))
(NP-SBJ=1904=005 (N-P=3773 bugios))

There are two special cases, both pertaining NPs in which the number of occurrence is not exclusive:

2.2.2.1 NPs containing Traces

NPs containing traces (*T*) receive the same ‘number of occurrence’ as the lexical category to which they co-index (but with the addition of an indicator, T_):

(16)
(WNP-2=1904=005  (WPRO que))  ->
(NP-SBJ=1904=T_005 *T*-2)
  • Note: Of course, (WNP-2=1904=005 (WPRO que)) and (NP-SBJ=1904=T_005 *T*-2) above don’t actually ‘share  the same referent’ – rather, they constitute the same occurrence of a referent. Or better: the lexical phrase and its trace aren’t actually two separate entities in a semantic perspective, as the annotation of this ‘subject’ as a separate constituent containing a ‘trace’ is an option for better syntactic description only. It would be adequate, therefore, to simply repeat the annotation =1904=005 for both ‘entities’, down to the number of occurrence code. This, however, causes disturbances in quantifying refereces and correcting potential mistakes; so the option here was to mark the trace as a special case in the annotation of number of occurrence. This is not a very elegant solution and should be revised in the future.

2.2.2.2 NP-SE in ‘passive’ constructions

In the syntactic annotation, NP-SEs in ‘passive’ constructions are marked as co-indexed with the (referential) NP-SBJ of the clauses they appear in. In the present annotation, such NP-SEs are marked with the same Referent ID and the same number of occurrence as that NP-SBJ:

fez-se uma frota para a Índia
'a fleet was made to Índia':
 
(NP-SE-4=0005=001 (CL -se))
(NP-SBJ-4=0005=001 uma frota para a Índia)
  • Note: It is not clear if this is the best option for those cases; more on this problem, and on the annotation of other types of NP-SE, in 3.1.3 further below.


2.2.3 Step-by-step example

Let us take an example sentence and mark the reference IDs for all its NPs; we shall use the opening in the test text, already shown above as regards the numbers of nouns:

(17)
Reinando aquele muito católico e sereníssimo príncipe el-rei Dom Manuel, 
fez-se uma frota para a Índia 
de que ia por capitão mór Pedro Álvares Cabral...
'On the kingdom of that very catholic and serene Prince the King Dom Manuel, 
a fleet to India was made 
of which went as captain general Pedro Álvares Cabral...'

Combining the previous syntactic annotation, and the numbering of all words tagged as nouns, we would have the tree shown in 2.1 above; we shall now add to this a ‘stub’ after each numbered noun, with a copy of the numbered noun (following a procedure explicitated in 4.1.2 below), as a mnemonic device. Noun numbers and ‘stubs’ are marked in blue; finally, the targets of the annotation – i.e., the NPs – are marked in red below:

(18)
 ( (IP-MAT (IP-GER (VB-G REINANDO)
		  (NP-SBJ (D aquele)
			  (ADJP (Q muito)
				(ADJ católico)
				(CONJP (CONJ e)
				(ADJX (ADJ-S sereníssimo))))
			   (NPR=0001_Príncipe Príncipe)
			   (NP-PRN (NPR=0002_el-Rei el-Rei) (NPR=0003_Dom Dom) (NPR=0004_MANUEL MANUEL))))
	  (, ,)
	  (VB-D fez-)
	  (NP-SE-4 (CL -se))
	  (NP-SBJ-4 (D-UM-F uma)
		    (N=0005_frota frota)
		    (PP (P para)
			(NP (D-F a) (NPR=0006_Índia Índia)))
		    (CP-REL (WPP-2 (P de)
		                   (NP (WPRO que)))
			    (IP-SUB (VB-D ia)
			    (PP (P por)
				(NP (N=0007_capitão capitão)
				    (ADJ mór)
				    (NP-GEN *T*-2)))
			    (NP-SBJ (NPR=0008_Pedro Pedro) (NPR=0009_Álvares Álvares) (NPR=0010_Cabral Cabral))))) ...))

2.2.3.1 Reference IDs

The first step is to mark each of the target NPs with a reference ID; this will work differently according to which type of NP we are looking at, as seen above shortly – NPs projected by nouns (simple or complex), by pronouns, etc.:

NPs containing a noun. Let us mark the NPs containing ‘Príncipe’, ‘frota’, ‘Índia’ and ‘capitão’ (for the moment ignoring the second NP contained in the first and in the latter): each will receive the ID of the nouns they contain, and will be numbered as the first occurrence of each:

(19)
(NP-SBJ=0001 aquele muito católico e sereníssimo (NPR=0001_Príncipe Príncipe) el-Rei Dom Manuel)
(NP-SBJ-4=0005 (D-UM-F uma)(N=0005_frota frota)) 
(NP=0006 (D-F a) (NPR=0006_Índia Índia)) 
(NP=0007 (N=0007_capitão capitão) (ADJ mór) (NP-GEN *T*-2))

NPs containing more than one noun. In this example there are two NPs containing by more than one noun: ‘el-Rei Dom Manuel’ and ‘Pedro Álvares Cabral’; as mentioned above in 2.2.1.1.1, these will each receive each their own Referent ID, independent of any of the contained nouns (and because both are cases of ‘combined’ NPRs, this number will be in the 9000 series):

(20)
(NP-PRN=9000 (NPR=0002_el-Rei el-Rei) (NPR=0003_Dom Dom) (NPR=0004_MANUEL MANUEL))
(NP-SBJ=9001 (NPR=0008_Pedro Pedro) (NPR=0009_Álvares Álvares) (NPR=0010_Cabral Cabral))

NP containing WPRO. Let’s now mark the WNP, (WNP (WPRO que)), in ‘de que ia por capitão mór…‘ – this is co-referent with ‘frota‘, so it will be marked as occurrence number 2 of 0005:

(21)
(WPP-2 (P de) (NP=0005 (WPRO que)))

NP containing trace. Now the null (NP-GEN *T*-2) , which contains a trace co-indexed to (WNP=002=005 (WPRO que)) – i.e., the second NP referent to ‘frota’ – will receive the same Reference ID as this WNP (but see further below for what will happen to its number of occurrence):

(22)
(NP-GEN=0005 *T*-2)

NP containing the clitic pronoun SE. The phrase (NP-SE-4 (CL -se))  is co-referenced with (NP-SBJ-4 frota ) in the syntactic annotation, and so will receive the same Reference ID as this subject (see below for its number of occurrence):

(23)
(NP-SE-4=0005 (CL -se))

Up to this point, then, we would have the following annotation of all noun phrases:

(23)
 ( (IP-MAT (IP-GER (VB-G REINANDO)
		  (NP-SBJ=0001_Príncipe (D aquele)
			       (ADJP (Q muito)
			       (ADJ católico)
			       (CONJP (CONJ e)
			       (ADJX (ADJ-S sereníssimo))))
			   (NPR=0001 Príncipe)
			   (NP-PRN=9000 (NPR=0002 el-Rei) (NPR=0003 Dom) (NPR=0004 MANUEL))))
	  (, ,)
	  (VB-D fez-)
	  (NP-SE-4=0005_frota (CL -se))
	  (NP-SBJ-4=0005_frota (D-UM-F uma)
		    (N=0005 frota)
		    (PP (P para)
			(NP=0006_Índia (D-F a) (NPR=0006 Índia)))
		    (CP-REL (WPP-2 (P de)
		                   (NP=0005_frota (WPRO que)))
			    (IP-SUB (VB-D ia)
			    (PP (P por)
				(NP=0007_capitão (N=0007 capitão)
				             (ADJ mór)
				             (NP-GEN=0005_frota *T*-2)))
			    (NP-SBJ=9001 (NPR=0008 Pedro) (NPR=0009 Álvares) (NPR=0010 Cabral)))))...))

2.2.3.2 Numbers of occurrence

Now, all the phrases must be marked for number of occurrence, with those who share the same Referent ID numbered in sequence.

In the sentence above a good example is ‘frota’, ‘fleet’, with multiple occurrences (notice that on numbering, the auxiliary stubs, ‘_frota’,  may be removed):

(24)
 (NP-SBJ-4=005_frota (D-UM-F uma) (N=0005_frota frota) ...)
 (NP-SE-4=0005_frota (CL -se))
 (NP=0005_frota (WPRO que)
 (NP-GEN=005_frota *T*-2) 
 ->
 (NP-SBJ-4=0005=001 (D-UM-F uma) (N=0005 frota) ...)
 (NP-SE-4=0005=001 (CL -se))
 (NP=0005=002 (WPRO que)
 (NP-GEN=0005=T_002 *T*-2)

Observe two special cases:

NP containing trace. The null (NP-GEN *T*-2) had received the same Reference ID as the WNP it was co-indexed with, (WNP=005 (WPRO que)); now it received and the same number for occurrence as this WPN, plus the indicator T_:

(25)
(NP-GEN=0005=T_002 *T*-2)

NP containing the clitic pronoun SE. The (NP-SE-4 (CL -se)), co-referenced with (NP-SBJ-4 frota), received this subject’s Reference ID as we saw; now it will receive also the same number of occurrence (see 3.1.3):

(26)
(NP-SE-4=0005=001 (CL -se))

In conclusion, for this clause we would have the following annotation of all noun phrases, complete with Reference IDs and numbers of occurrence (0000=000):

(27)
 ( (IP-MAT (IP-GER (VB-G REINANDO)
		  (NP-SBJ=0001=001 (D aquele)
			  (ADJP (Q muito)
				(ADJ católico)
				(CONJP (CONJ e)
				(ADJX (ADJ-S sereníssimo))))
			   (NPR=0001 Príncipe)
			   (NP-PRN=9000=001 (NPR=0002 el-Rei) (NPR=0003 Dom) (NPR=0004 MANUEL))))
	  (, ,)
	  (VB-D fez-)
	  (NP-SE-4=0005=001 (CL -se))
	  (NP-SBJ-4=0005=001 (D-UM-F uma)
		    (N=0005 frota)
		    (PP (P para)
			(NP=0006=001 (D-F a) (NPR=0006 Índia)))
		    (CP-REL (WPP-2 (P de)
		                   (NP=0005=002 (WPRO que)))
			    (IP-SUB (VB-D ia)
			    (PP (P por)
				(NP=0007=001 (N=0007 capitão)
				             (ADJ mór)
				             (NP-GEN=0005=T_002 *T*-2)))
			    (NP-SBJ=9001=001 (NPR=0008 Pedro) (NPR=0009 Álvares) (NPR=0010 Cabral)))))...))


2.3 More complex cases: Referents in relation


2.3.1 General remarks

In what was described up to this point, we have considered as ‘new referents’ the noun phrases formed by nouns mentioned for the first time, with examples like ‘Bugios’, ‘Açores’ (i.e., (NP (N …)), or  ‘uma frota’, ‘a Índia’ (i.e., (NP (D …) (N …))). We did mention some special cases such as NPs containing more than none noun, such as ‘Pedro Álvares Cabral’, which involve special provisions for the indication of Referent ID – but still, those are cases where the ‘novelty’ of the referent is given by the appearance of some noun for the first time.  

However, nouns phrases may introduce new referents (or in fact, build new referents) even when they do not include a noun mentioned for the first time at all. These cases are more complex for the annotation, and are described in detail now.

Such new referents, ‘built’ by particular morphosyntactic combinations, normally constitute entities or events that are related to previously mentioned entities or events – but should not be conflated with them.

One example would be the following sequence of clauses.This is the full sequence of the 19 paragraphs of chapter 6 in the text (which can be read in full here), and it is representative of other passages in this text, as regards the way in which ‘new’ referents are introduced.  The chapter is about ‘the animals in this land’; the first paragraph introduces the topic ‘animals’, and all the subsequent paragraphs will present and describe one kind of animal at a time. Now: each time a ‘new’ animal is introduced, this is done either directly by naming it (‘veados’, ‘deer‘; ‘coelhos’, ‘rabbits‘; ‘Bugios’, ‘monkeys‘ etc.) or by the use of certain constructions such as ‘Outros animais’, ‘Other animals‘, etc. What we see, then, is a sequence of ‘new’ referents, all which, at the same time, constitute a sub-group of a previously mentioned referent, ‘animals’. In the snippet below, the first constituent for the referent ‘animal’, and all the subsequent constituents that either directly name or refer to a kind of animal are highlighted in blue:

(28)
História da Província Santa Cruz - Chapter 6
Title and initial snippets of the 19 paragraphs

Capítulo 6: Dos animais e bichos venenosos que há nesta província .

§ Como esta província seja tão grande, e a maior parte dela inabitada e 
  cheia de altíssimos arvoredos e espessos matos, não é de espantar que haja nela  muita diversidade de animais ...
§ Há muitos veados; e muita soma de porcos de diversas castas ...
§ Também há uns animais na terra, a que chamam Antas ....
§ Outros animais há a que chamam Cotias ...
§ Há também outros maiores , a que chamam Pacas ...
§ Outros há também nestas partes muito para notar ...
§ Há também coelhos como os de cá da nossa pátria ...
§ Finalmente que desta e de toda a mais caça de que acima tratei, participam (como digo) todos os moradores  ....
§ Outros animais há nesta província muito feros ...
§ Outro gênero de animais há na terra, a que chamam Cerigoês ...
§ Um certo animal se acha também nestas partes, a que chamam Preguiça ...
§ Outro gênero de animais há na terra a que chamam Tamanduás ...
§ Bugios há na terra muitos e de muitas castas como já se sabe ...
§ Há uns ruivos não muito grandes ....
§ Outros há perto maiores que estes ...
§ Há também uns pequeninos pela costa  ...
§ Há também pelo mato dentro cobras muito grandes ...
§ Outras há de outra casta diferente, não tão grandes como estas ...
§ Também há lagartos muito grandes ...
§ Outros muitos animais e bichos venenosos há nesta província ...

To annotate this sequence for referents, in theory we could simply say each of the constituents highlighted in blue in the clauses above (‘veados’, ‘porcos’, ‘coelhos’, ‘Bugios’ ‘lagartos’; ‘uns animais’, ‘outros animais’, ‘outros’, etc.) is the first occurrence of a ‘new’ referent. However, this would not be not entirely precise – or rather, this would erase an important detail: each of those ‘new’ referents is, in fact, a new example within a class. Some of them, furthermore, appear in constructions whose purpose is, precisely, to show this – such as ‘uns animais’, ‘outros animais’, ‘outros’, etc.

Now, because believe this referential relatedness is relevant for the linguistic study of the texts, we would like to indicate it. On the other hand, because this annotation is strictly based on morphosyntactic criteria, we opt to annotate this only in the cases where the referential relation is morphosyntactically explicit – i.e, we opt to annotate the relation between ‘outros animais’ and ‘animais’, but not the relation between ”veados’, ‘porcos’, ‘coelhos’, ‘Bugios’ ‘lagartos’, and ‘animais’ (more on that restriction to follow). 

Therefore, this annotation indicates that some new referents have a relation to previously mentioned referents, with a simple additional markup: in short, such referents are marked as ‘related to’ a previously mentioned referent by the use of a symbol ‘&’, linking  the new Referent ID to the Referent ID of the previously mentioned relevant category (i.e., here, as …&1473, since 1473 = ‘animals’) :

(29)
(NP-ACC=5810&1473=001 Outros animais) há a que chamam Cotias ...
(NP-ACC=5820&1473=001 Outros) há também nestas partes muito para notar ...
(NP-ACC=5833&1473=001 Outros animais) há nesta província muito feros ...
etc.

As mentioned, the relations are marked only when they are formally explicit in the NPs; here is a complete list of constructions considered for this kind of markup. Notice that it is not the case that these constructions are always involved in the building (and markup) of ‘new’ referents related to previously mentioned referents, only that they may be so (in many cases, of course, constructions such as the ones below will simply be referring to a previously mentioned referent):

Constructions that may be involved in 
'referents in relation' markup:
   (N.B. (i): some could include (ADJ-P ...) projections)
   (N.B. (ii): some could include agreement sub-taggings, i.e., *-P, *-F, *-F-P)

1. NPs containing determiners, D (including D-UM):
   (NP (D ...) (N ...)),
   (NP (D ...))

2. NPs containing 'outro', OUTRO:
   (NP (OUTRO ...)(N ...)),
   (NP (OUTRO ...))

3. NPs containing quantifiers, Q (including Q-G): 
   (NP (Q ...) (N ...))
   (NP (Q ...))

4. NPs containing numerals, NUM:
   (NP (NUM ...) (N ...))
   (NP (NUM ...))

5. NPs containing demonstratives, DEM:
   (NP (DEM ...))

6. NPs containing possessive pronouns, PRO$:
   (NP (PRO$ ...) (N ...)),
   (NP (PRO$ ...))

In 2.3.2 and 2.3.3 below are the explicitation of the markup format and examples of its application.


2.3.2 Format

The property of ‘reference in relation’ will be codified as an additional code ‘&’ after the new referent number:

 =NewReferentID&PreviouslyMentionedReferentID 

 examples:
 =5001&0001 (5001 is a class of 0001; eg. 'The other monkeys', 'Others')
 =5002&0001 (5002 is an amount of 0001; eg. 'five monkeys', 'five')
 =5003&0001 (5003 is possessed by 0001; eg. 'his tail', 'his') 
 =5004&     (5004 refers to a group of referents, or events; eg. 'All I said above')

To facilitate future revisions, such cases of new Reference ID numbers are currently been given within the 5000 series. 


2.3.3 Application

In order to expose each application, the examples below will show marking of each of the constructions 1 to 6 listed above, trying to give a wide enough context of occurrence to justify why each case was considered a ‘new’ referent, albeit related to a previous referent. Where relevant, we show a case of new reference with and without a noun in the NP.

The example includes the syntactic trees of each case – stripped however of any other Reference ID markings, to focus on the referents in relation examples. 

2.3.3.1 In NPs containing determiners

– With the structure (NP (D …))

(30) uns pequenos, NP=5203&2046

in

Os Gaviões também são muito destros e forçosos.
Especialmente uns pequenos como esmerilhões o são muito (...)	
 'The Hawks are also very able and strong.
  Particularly ones as small as merlins are quite so (...)'

where:
'Gaviões' =2046 

( (IP-MAT-PRN (ADVP (ADV especialmente))
	      (NP-SBJ=5203&2046 (D-UM-P uns)
				(ADJP (ADJ-P pequenos)
				      (CP-CMP (C como)
					      (IP-SUB (NP-SBJ (N-P esmerilhões))))
	      (PP (P em)
		  (NP (PRO$ sua) (N quantidade)))))
	      (NP-ACC (CL o))
	      (SR-P são)
	      (ADVP (ADV-R tanto)
		    (, ,)
		    (CP-DEG (C que) ...))))
(31) 'uma', NP-SBJ-2=5020&1193 

in

Há todavia farinha de duas maneiras: uma se chama de guerra, e outra fresca.
'There is however flour of two kinds: one is called of war, the other fresh'

where 'farinha', N=1193

( (IP-MAT (IP-MAT-1 (NP-SBJ-2=5020&1193 (D-UM-F uma))
		    (NP-2=5020&1193 (CL se))
		    (VB-P chama)
		    (IP-SMC (NP-SBJ=5020&1193 *-2)
			    (PP (P de)
				(NP (N guerra)))))
	  (, ,)
	  (CONJP (CONJ e)
		 (IP-MAT-1 (NP-SBJ (OUTRO-F outra))
			   (ADJP (ADJ-F fresca))))
	  (. .))
  (ID G_008,16.185))

– With the structure (NP (D …))

(32) 'O da banda do norte', NP-SBJ=5016&0681 
'o da banda do sul', NP-SBJ=5016&0681 

in

Mas porque de umas a outras há muita distância, e a gente vai em muito crescimento, repartiu-se agora em duas governações, convém a saber, da capitania de Porto Seguro para o Norte fica uma, e da do Espírito Santo para o Sul fica outra: e em cada uma delas assiste seu governador com a mesma alçada. O da banda do norte reside na Bahia de todos os Santos, e o da banda do sul no Rio de Janeiro.

 O da banda do norte reside na Bahia de todos os Santos,
 e o da banda do sul no Rio de Janeiro
 'The one from the part of the north resides in Bahia de todos os Santos,
 and the one from the part of the south in Rio de Janeiro'

where  governador =0681

( (IP-MAT (IP-MAT-1 (NP-SBJ=5016&0681 (D O)
				      (PP (P d@)
					  (NP (D-F @a)
					      (N banda)
					      (PP (P d@)
					          (NP (D @o) (N norte))))))
		    (VB-P reside)
		    (PP (P n@)
			(NP (D-F @a)
			    (NPR Bahia)
			    (PP (P de)
				(NP (Q-P todos) (D-P os) (NPR-P Santos))))))
	  (, ,)
	  (CONJP (CONJ e)
		 (IP-MAT-1 (NP-SBJ=5017&0681 (D o)
					     (PP (P d@)
						 (NP (D-F @a)
						     (N banda)
						     (PP (P d@)
							 (NP (D @o) (NPR Sul))))))
			   (PP (P n@)
			       (NP (D @o)
				   (NPR Rio)
				   (PP (P de)
				       (NP (NPR Janeiro)))))))
	  (. .))
  (ID G_008,15.148))

2.3.3.2 In NPs containing ‘outro’

 – With the structure (NP (OUTRO …) (N …))

(33) 'Outros animais', NP-ACC=5833&1437

in

Outros animais há nesta província muito feros, e prejudiciais a toda esta caça, e ao gado dos moradores; aos quais chamam Tigres, ainda que na terra a mais da gente os nomeia por Onças

  Outros animais há nesta província muito feros
 'Other animals there are in this province very fierce'

where 'animais' =1437 

( (IP-MAT (NP-SBJ *exp*)
	  (NP-ACC=5833&1473 (OUTRO-P Outros)
		            (N-P=1734/10/1473 animais)
			    (CP-REL *ICH*-1))
	  (HV-P há)
	  (PP (P n@)
	      (NP (D-F @esta) (N província)))
	  (ADJP (Q muito)
		(ADJ-P feros)...)))

– With the structure (NP (OUTRO …))

(34) 'outros', NP-SBJ=5058&0079

in

Junto delas havia muitos Índios , quando os Portugueses começaram de as povoar: mas porque os mesmos  Índios se levantavam contra eles e faziam-lhes muitas traições , os governadores e capitães da terra destruíram-nos pouco a pouco e mataram muitos deles: outros fugiram para o sertão, e assim ficou a terra desocupada de gentio ao longo das povoações.
 outros fugiram para o sertão...
 'others ran away towards the inland...'

where 'Índios', =0079 

( (IP-MAT (NP-SBJ=5058&0079 (OUTRO-P outros))
	  (VB-D fugiram)
	  (PP (P para)
	      (NP (D o) (N sertão)))
	  (, ,))
  (ID G_008,10.74))

2.3.3.3 In NPs containing quantifiers

a) With the structure (NP (Q…) (N …))

(35) alguns Portugueses, NP-SBJ=5004&0012 

in

Aqui se metem dois rios nele que vem do sertão, por um dos quais entraram alguns Portugueses quando foi do descobrimento que foram fazer no ano de 35 e navegaram por ele acima duzentas e cinqüenta léguas, até que não puderam ir mais por diante por causa da água ser pouca e o rio se ir estreitando de maneira, que não podiam já por ele caber as embarcações.

 por um dos quais entraram alguns Portugueses
 'through one of which some Portuguese entered'

where Portugueses =0012 (in previous, distant context)


(CP-REL-4 (WPP-3 (P por)
	         (NP (D-UM um)
		     (PP (P d@)
			 (NP (D-P @os) (WPRO-P quais)))))
		    (IP-SUB (PP *T*-3)
			    (VB-D entraram)
			    (NP-SBJ=5004&0012 (Q-P alguns) (NPR-P=0012 Portugueses)) ...))

b) With the structure (NP (Q …))

(36) algumas em particular,  NP=5019&6042 

in

SÃO tantas e tão diversas as plantas, frutas e ervas que há nesta província , de que se podiam notar muitas particularidades, que seria coisa infinita escrevê-las aqui todas e dar notícia dos efeitos de cada uma miudamente. E por isso não farei agora menção, senão de algumas em particular, principalmente daquelas, de cuja virtude e fruto participam os Portugueses.

 E por isso não farei agora menção senão de algumas em particular
 'And because of that I will not give mention now but of some in particular' 

where 'plantas, frutas e ervas' =6042 

( (IP-MAT (CONJ E)
	  (NP-SBJ *pro*)
	  (PP (P por)
	      (NP (DEM isso)))
	  (NEG não)
	  (VB-R farei)
	  (ADVP (ADV agora))
	  (NP-ACC (N menção)
	  (, ,)
		   (PP (SENAO senão)
			(P de)
			(NP=5019&6042 (Q-F-P algumas)
				      (PP (P em)
				      (ADVP (ADJ-G particular))))))
	  (, ,) ...))
(37) 'alguns', NP-SBJ=5799&1473

in

  Os outros animais que na terra se acharam, todos são bravos de natureza, 
  e alguns estranhos nunca vistos em outras partes.
 'The other animals that were found in the land, all are fierce of nature, 
  and some strange never seen elsewhere'.

where animais =1473 

(IP-MAT (NP-SBJ=5799&1473 (Q-P alguns)
					          (CP-REL *ICH*-1))
			 (ADJP (ADJP (ADJ-P estranhos))
			       (CONJP (ADJP (ADV-NEG nunca)
					    (VB-AN-P vistos)
					    (PP (P em)
						(NP (OUTRO-F-P outras) (N-P partes))))))...)

2.3.3.4 In NPs containing numerals

a) With the structure (NP (NUM …) (N …))

(38) 'duas barras', NP-SBJ=7020&0594 

in
  
Esta ilha em que os moradores habitam divide da terra firme um braço de mar que a rodeia onde também se ajuntam alguns rios que vem do sertão. E assim ficam duas barras lançadas cada uma para sua banda, e a ilha no meio.
 
  E assim ficam duas barras lançadas cada uma para sua banda, e a ilha em meio.
 'And thus lay two sandbanks thrown each to its side, and the island in the middle'

where barras =0594

( (IP-MAT-3 (CONJ E)
	    (ADVP (ADV assim))
	    (VB-P ficam)
	    (NP-SBJ=7020&0594 (NUM-F duas)
			          (N-P=0594/0/0594 barras)
			          (CP-REL *ICH*-4))
	    (ADJP (VB-AN-F-P lançadas)
		  (QP (Q-G cada) (D-UM-F uma))
		  (PP (P para)
		      (NP (PRO$ sua) (N banda))))
	    (, ,)
	    (IP-MAT-PRN-3 (CONJ e)
			  (NP-SBJ (D-F a) (N ilha))
			  (PP (P em)
			      (NP (N meio))))...))

2.3.3.5 In NPs containing demonstratives

Note: Constructions with demonstratives sometimes involve cases when the demonstrative is linked to complex groups of previously mentioned entities or events. Because of this complexity in co-reference, the co-referent target is not indicated; the code ‘&’ simply states that the marked constituent refers to ‘something’ mentioned before, and this ‘something’ is not specified.

With the structure (NP (DEM …))

(39) 'Isto', NP-SBJ=5184& 

in

  Isto geralmente se costuma nestas partes
 'That generally is the used way in these parts' 
 
where 'that' refers to a sequence of different situations described in the previous context
( (IP-MAT (NP-SBJ-1=5184& (DEM Isto))
	  (ADVP (ADV geralmente))
	  (NP-SE-1=5184& (CL se))
	  (VB-P costuma)
	  (PP (P n@)
	      (NP (D-F-P @estas) (N-P partes)))
	  (, ,))
  (ID G_008,15.161))

2.3.3.6 In NPs containing possessive pronouns

a) With the structure (NP (PRO$ …) (N …))

(40) 'O seu mantimento', NP-SBJ=1192&1182

in

Um certo animal se acha também nestas partes, a que chamam Preguiça (que é pouco mais, ou menos do tamanho destes) o qual tem um rosto feio , e umas unhas muito compridas quase como dedos. Tem uma gadelha grande no toutiço que lhe cobre o pescoço , e anda sempre com a barriga lançada pelo chão , sem nunca se levantar em pé como os outros animais : e assim se move com passos tão vagarosos , que ainda que ande quinze dias aturado, não vencerá distância de um tiro de pedra. O seu mantimento, é folhas de árvores, e em cima delas anda o mais do tempo.

 O seu mantimento é folhas de árvores ...
 'Its keeping are tree leaves'

 where 'preguiça' =1182
 and 'seu mantimento' is read as genitive to 'preguiça'

( (IP-MAT (NP-SBJ=1192&1182 (D O) (PRO$ seu) (N=1848/3/1192 mantimento))
	  (, ,)
	  (SR-P é)
	  (NP-ACC (N-P folhas)
		  (PP (P de)
		      (NP (N-P árvores)))))
  (ID G_008,22.342))

2.3.3.7 Important notes

(1) Note on the alternation of nouns within complex relation constructions

In some cases, the constructions that mark complex referent relations include an alternation of nouns sharing the same referent. Nouns that are interpreted as related in reference will be annotated with a common indicator, one linking to the other, with the Referent ID of each noun added to the other, with the symbol ‘#”.

Below is an example with ‘óleo’, ‘oil‘, and ‘bálsamo’, ‘balm‘, ‘licor’, ‘liquor‘. The noun ‘bálsamo’ appears in the first NP,  (NP bálsamo muito salutífero e proveitoso em extremo…), and the NPs that follow, in the structure (NP (D..) (N ..)), contain the names ‘óleo’ and then ‘licor’ – but are interpreted as referring to the first NP.  The annotation of the three NPs will mark, primarily, the number of the first noun involved in the reference, and secondarily, the number of the nouns that alternate with it, liked by the symbol # (thus forming the Referent ID 1439#1245#1463)’:

Um certo gênero de árvores há também pelo mato dentro na capitania de Paranambuco a que chamam Copahíbas de que se tira 
bálsamo muito salutífero e proveitoso em extremo... 
Este óleo não se acha todo ano perfeitamente nestas árvores ...
E quando querem tirá-lo, dão certos golpes ou furos no tronco delas , pelos quais pouco a pouco estão estilando do âmago 
este licor precioso...

'A certain genre of trees are also in the bush of the capitany of Paranambuco which is called Copahíbas, of which it is extracted 
very wholesome and extremely useful balm ...'
'This oil is not found all year round perfectly on these trees....' 
'And when they want to take it, they give certain blows or holes into their branches, of the heart of which they little by little extract 
this precious liquor...'

where 
bálsamo, N=1439
óleo,    N=1245
licor,   N=1463

>

(NP-SBJ-6=1439#1245#1463 bálsamo muito salutífero e proveitoso em extremo...)
(NP-SBJ-1=1439#1245#1463 Este óleo)
(NP-ACC=1463#1245#1439   este licor precioso)

There are very few cases of this kind of alternation in the text; however, their annotation is complex, and so this and other examples will be exposed with further detail in 3.1.2 below.

(2) Note on constructions with determiners that do not form a new referent

As mentioned above, not all structures dealt in this chapter will always represent ‘new referents’. We shall reinforce this here with three examples – with regular, (NP (D …) (N..)) constructions; with (NP (D…)) constructions; and (NP (OUTRO …)) constructions.

Structures in the form (NP (D …) (N..))  that make reference to previously mentioned names (.e., that do not ‘form’ new referents) have already been shown further above, but are repeated here just as a reminder (in the structures (NP (D …)(N …)) and (NP (D …)), which we called ‘elisions’ further above):

(41) E depois de haver bonança, junta outra vez a frota
'And after there was good weather, gathered again the fleet'

'a frota', 8th mention of N 'frota', =0005:
(NP      (D-F a) (N=0029 frota)) ->
(NP=0005 (D-F a) (N=0029 frota))
 

(42)
Os Açores são como os de cá (...) 
Estes são muito ligeiros e de maravilha lhe escapa ave (...)
'The Sea-hawks are like those from here (...)
Those are very swift and seldom does a bird escape them (...)'

where: 
Açores:    (NP=2045     (NPR=2045 Açores)) 
Os Açores: (NP-SBJ=2045 (NPR=2067 Açores))  
Estes:     (NP-SBJ=2045 (D-P Estes)) 

Below is a new example, in a sentence with one (NP (D-UM …)) and one (NP (OUTRO …)), both clearly making simple reference to a previously mentioned element – and is thus marked simply with this element’s Referent ID:

(43) 'uns', NX=5715=007 and
'outros', NP-SBJ=5716=003 

in

Há também uns pequeninos pela costa de duas castas pouco maiores que doninhas, a que comumente chamam  Sagüis , convém a saber , 
há uns louros, e outros pardos. Os louros têm um cabelo muito fino, e na  semelhança do vulto e feição do corpo quase se querem parecer com leão : são muito formosos , e não os há senão no Rio de Janeiro . Os pardos se acham daí para o Norte em todas as mais capitanias. Também são muito aprazíveis : mas não tão alegres à vista como estes . 
E assim uns como outros, são tão mimosos e delicados de sua natureza, que como os tiram da pátria e os embarcam para este Reino, tanto que chegam a outros ares mais frios quase todos morrem no mar, e não escapa senão algum de grande maravilha . 

 E assim uns como outros
 'And so ones as others'

where

 a) uns louros, =5715
    and 'uns' above is interpreted as the seventh mention of 'uns louros'. 
 
 b) outros pardos, =5716
    and 'outros' above is interpreted as the third mention of 'outros pardos' 
>

(NX=5715=007 uns) 
(NP-SBJ=5716=003 outros)

( (IP-MAT (CONJ E)
	  (ADVP (ADV assim))
	  (NP-SBJ=6098 (NX=5715=007 (D-UM-P uns))
				   (CP-CMP (C como)
					   (IP-SUB (NP-SBJ=5716=003 (OUTRO-P outros)))))
	  (, ,)
	  (SR-P são)
	  (ADJP (ADV tão)
		(ADJ-P mimosos)
		(CONJP (CONJ e)
		       (ADJX (ADJ-P delicados)
			     (PP (P de)
				 (NP (PRO$ sua) (N natureza)))))
		(, ,)
		(CP-DEG (C que) ...))))


3. Remaining issues
and first results


3.1 Issues

Below are some of the most serious issues to be covered in the development of the annotation.


3.1.1 The marking of  ‘referents in relation’

At present, it is not clear whether the additional annotation for previous reference is adequate for different research purposes. 

The aim of marking those referential relations is strictly connected to the research proposal in the project, as mentioned above. It is our hypotheses that the fact that a referent establishes a relation of this kind with a previously mentioned constituent is an important factor governing its positioning in the clause.

However, of course each of the constructions listed in 2.3.3 above will indicate different kinds of relations: referents that are a class or group within previously mentioned referents; referents that are an amount of a previously mentioned referents; referents that are possessed by previously mentioned referents; referents that refer to a combination of previously mentioned referents or events. In a previous version of this markup, each of these relations were indicated by a special symbol (with the categories ‘Class-of’, ‘Quality-of’, ‘Possessed-by’ etc.). In the present version, this was abolished, as it turned out to be of difficult application, technically clumsy, and (most of all) unnecessary. In fact, by combining the simple annotation of an “&” relation to the classes of constituents above, a good systematic observation of those constructions may already be observed, with no need of special symbols in the markup. Some examples of query conditions that may single-out the different types of relation coded above with & are the following:

To find referents in relation in "OUTRO" phrases:

// query: (NP*=*&* exists)
   AND (NP*=*&* iDominates OUTRO-*)
//

To find referents in relation in phrases with numerals:

// query: (NP*=*&* exists)
   AND (NP*=*&* iDominates NUM)
//
etc.

In a different perspective, we can target, with the query, all the phrases that relate to a certain referent (this would, in fact, pretty much amount to mapping the complete ‘net’ of references to a ‘topic’ in a text):

To find all elements related to the referent 'animals', 1473:

// query: (NP*=1473 exists)
   OR (NP*=*&*1473 exists)
//

In short, we feel that the more simple marking contributes for the speed of markup and does not hinder linguistic research, provided the researcher is able to use Corpus Search to make findings more sophisticated.

Apart from that, for the specific aims of this project, the annotation is already enough to allow for interesting linguistic findings. Not only that, but also some initial tests with searches seem to show that this annotation, as it is, is pretty much essential for the analysis that we propose in the project; see 3.2.1 below for those specific results involving referents in relation.


3.1.2 The marking of alternating nouns

As mentioned briefly above, this annotation includes a very rudimentary marking of nouns that appear in alternation in a chain related to the same referent. There are two main cases: a chain of noun phrases referring to the same entity, but using different nouns in some points; and a chain of phrases in which at some points plural and singular forms of the same noun is used.

The problem in those cases is how to preserve the indication that there is a consistent chain of reference, when this chain is (apparently) ‘interrupted’ by a another noun, or another form of the same noun (singular/plural); and, a the same time, not erase the information that there are different nouns involved in the chain. As mentioned briefly above, the solution was to add the numbers of all involved nouns as additional information (with ‘#’) into the relevant Referent IDs. This is not a very elegant solution, and must certainly be revised in future versions of this system. For the moment, at least those cases are easily recoverable.

It is important to observe that this does not consist of an annotation of ‘synonyms’ in a general, abstract  sense; it is only applied to noun phrases where there is an explicit relation marker (i.e., the same types of phrases included in 2.3 above, with determiners, etc.), forming a ‘chain’ in which the noun involved is not the same in all cases.

3.1.2.1 Alternating independent nouns (‘synonyms’)

An example with alternation of different nouns has been mentioned above for ‘óleo’, ‘oil‘, and ‘bálsamo’, ‘balm‘, ‘licor’, ‘liquor‘; here is a large portion of text that shows how those two names alternate while referring to the same entity. Because the relevant portion of text is so large, here the tree is not given; below the snipped, is the simplified chain of annotation of all NPs interpreted as referencing ‘bálsamo muito salutífero e proveitoso e proveitoso em extremo para enfermidades de muitas maneiras’, from the first to the 14th reference. As can be observed, all are marked, primarily, as 1439 (corresponding to ‘bálsamo’), and secondarily, as 1245 (for ‘óleo’) and 1463 (for ‘licor’), in order to link the IDs of the other two nouns into the chain – forming the complex ID 1439#1245#1463:

Um certo gênero de árvores há também pelo mato dentro na capitania de Paranambuco a que chamam Copahíbas de que se tira 
bálsamo muito salutífero e proveitoso em extremo para enfermidades de muitas maneiras, principalmente nas que procedem de frialdade causa grandes efeitos e tira todas as dores por graves que sejam em muito breve espaço . Para feridas ou quaisquer outras chagas, tem a mesma virtude: as quais tanto que com ele lhe acodem, saram muito depressa , e tira os sinais de maneira, que de maravilha se enxerga onde estiveram, e nisto faz vantagem a todas as outras medicinas. 
Este óleo não se acha todo ano perfeitamente nestas árvores , nem procuram ir buscá-lo , senão no estio, que é o tempo em que assinaladamente o criam . E quando querem tirá-lo , dão certos golpes ou furos no tronco delas , 
pelos quais pouco a pouco estão estilando do âmago este licor precioso . Porém não se acha em todas estas árvores , senão em algumas a que por este respeito dão nome de fêmeas : e as outras que carecem dele chamam machos, e nisto somente se conhece a diferença destes dois gêneros : que na proporção e semelhança não diferem nada umas das outras. As mais delas se acham roçadas dos animais que por instinto natural quando se sentem feridos, ou mordidos de alguma fera, as vão buscar para remédio de suas enfermidades. 
Outras árvores diferentes destas , há na capitania dos Ilhéus , e na do Espírito Santo a que chamam Caborahíbas, 
de que também se tira outro bálsamo : o qual sai da casca da mesma árvore , e cheira suavíssimamente....
... de que se tira bálsamo muito salutífero e proveitoso em extremo... 
Este óleo não se acha todo ano perfeitamente nestas árvores ...
...pelos quais pouco a pouco estão estilando do âmago este licor precioso...

'of which it is extracted very wholesome and extremely useful balm ...'
'This oil is not found all year round perfectly on these trees....' 
'of which they little by little extract this precious liquor...'
where  
 bálsamo, N=1439 
 óleo,    N=1245 
 licor,   N=1463 

>
 
Um certo gênero de árvores há também pelo mato dentro na capitania de Paranambuco a que chamam Copahíbas de que se tira 
(NP-SBJ-6=1439#1245#1463=001 bálsamo muito salutífero e proveitoso em extremo para enfermidades de muitas maneiras), principalmente nas que procedem de frialdade 
(NP-SBJ=1439#1245#1463=002 *pro*) causa grandes efeitos e 
(NP-SBJ=1439#1245#1463=003 *pro*) tira todas as dores por graves que sejam. Para feridas ou quaisquer outras chagas, 
(NP-SBJ=1439#1245#1463=004 *pro*) tem a mesma virtude: as quais tanto que com 
(NP=1439#1245#1463=005 ele) lhe acodem, saram muito depressa, e 
(NP-SBJ=1439#1245#1463=006 *pro*) tira os sinais de maneira, que de maravilha se enxerga onde estiveram, e nisto 
(NP-SBJ=1439#1245#1463=007 *pro*) faz vantagem a todas as outras medicinas. 
(NP-SBJ-1=1439#1245#1463=008 Este óleo) não se acha todo ano perfeitamente nestas árvores, nem procuram ir buscá 
(NP-ACC=1439#1245#1463=009 -lo), senão no estio, que é o tempo em que assinaladamente 
(NP-ACC=1439#1245#1463=010 o) criam. E quando querem tirá 
(NP-ACC=1439#1245#1463=011 -lo) , dão certos golpes ou furos no tronco delas, pelos quais pouco a pouco estão estilando do âmago 
(NP-ACC=1439#1245#1463=012 este licor precioso). Porém não 
(NP-SBJ-2=1439#1245#1463=013 *pro*) 
(NP-SE-2=1439#1245#1463=013 se) acha em todas estas árvores, senão em algumas a que por este respeito dão nome de fêmeas: e as outras que carecem d@ 
(NP=1439#1245#1463=014 @ele) chamam machos, e nisto somente se conhece a diferença destes dois gêneros : que na proporção e semelhança não diferem nada umas das outras. As mais delas se acham roçadas dos animais que por instinto natural quando se sentem feridos, ou mordidos de alguma fera, as vão buscar para remédio de suas enfermidades.

What this basically means is that the chain of reference is not ‘interrupted’ by the appearance of ‘óleo’ and ‘licor’; but at the same time, the fact that those three nouns alternate to make reference to ‘bálsamo muito salutífero e proveitoso e proveitoso em extremo para enfermidades de muitas maneiras’ is acknowledged, and may be recovered in later searches. Notice (incidentally) that the next appearance of the noun ‘bálsamo’ in this part of the text will be in the NP ‘outro bálsamo’, which will receive its own Referent ID, related to, but independent from the first NP containing ‘bálsamo’:

Outras árvores diferentes destas, há na capitania dos Ilhéus, 
e na do Espírito Santo, a que chamam Caborahíbas, de que também se tira
(NP-SBJ-6=5267&1439=001 outro bálsamo) : o qual sai da casca da mesma árvore, 
e cheira suavíssimamente. 

3.1.2.2 Alternating singular and plural forms

A similar problem occurs with the alternation of singular and plural versions of nouns. At some points in the text, it is clear that the word in the plural and in the singular are simply alternating in reference to the same entity, even independent of agreement marks in the NPs and verbs. However, plural and singular forms receive, obviously, different numbers in the basic annotation.

There are not many cases; however, some of these are very recurring, like ‘rio’/’rios’ (‘river, rivers‘), and ‘peixe’/ ‘peixes’. It seemed desirable therefore that there was a way to ‘link’ singular and plural forms of related referents when they present themselves in alternation, at least to check the annotation later. For the moment this is being done by marking a # relation, just like with the cases described above.

In the example below, the plural noun ‘peixes’, ‘fishes‘ has the number 2379; the singular noun ‘peixe’, ‘fish‘, has the number 2389. The NP ‘Outros peixes’, ‘Other fishes‘ opening the paragraph (and the topic ‘the fish called Camboropins‘)  is numbered as 5358&2379. The NP ‘Este’, following a few sentences later, contains a singular determiner; therefore, the immediate hypothesis is that this determiner is in agreement with ‘peixe’, singular, not plural. However, the expression is clearly related to what was mentioned before as ‘Outros peixes’, in the plural. This ‘duplicity’ is indicated with the sub-mark …2389#2379, i.e., with the number for the singular noun ‘added’ to the referent ID of the singular NP ‘Este peixe’. The wider context is given above the tree to justify the interpretation that ‘Este peixe’ actually refers to what was mentioned before as ‘Outros peixes’:

(44)

Outros peixes há , a que chamam Camboropíns, 
que são quase tamanhos como Atuns. 
Estes tem umas escamas muito duras, e maiores que os outros peixes: também se matam com arpões, 
e quando querem pescá-los , põem-se em alguma ponta ou pedra, ou em outro qualquer posto acomodado a esta pescaria. E o que é bom pescador (para que não faça tiro em vão) 
quando os vê vir deixa-os primeiro passar , e espera até que _ fiquem a jeito 
que possa arpoá-los por detrás de maneira, que o arpão entre no peixe sem as escamas o impedirem, porque são (como digo) tão duras que se acerta de dar nelas de maravilha as pode penetrar. 
Este é um dos melhores peixes que há nestas partes, porque além de ser muito gostoso , é também muito sadio , e mais enxuto de sua propriedade que outro algum que na terra se coma.

Este é um dos melhores peixes que há nestas partes
'This is one of the best fishes that there is in these parts'

where

(NP=5358&2379#2389=001 Outros peixes) há, a que chamam Camboropíns,
(WNP-1=5358&2379#2389=002 que) são quase tamanhos como Atuns.
(NP=5358&2379#2389=003 Estes) tem umas escamas muito duras, e maiores que os outros peixes: também 
(NP-SE-2=5358&2379#2389=004 se)
(NP-SBJ-2=5358&2379#2389=004 *pro*) matam com arpões, e quando querem pescá-l@ 
(NP=5358&2379#2389=005 os) põem-se em alguma ponta ou pedra, ou em outro qualquer posto acomodado a esta pescaria. E o que é bom pescador (para que não faça tiro em vão) quando
(NP=5358&2379#2389=006 os) vê vir deixa- 
(NP=5358&2379#2389=007 os) primeiro passar, e espera até que 
(NP=5358&2379#2389=008 *pro*) fiquem a jeito que possa arpá-l@ 
(NP=5358&2379#2389=009 os) por detrás de maneira, que o arpão entre no peixe sem as escamas o impedirem, porque são (como digo) tão duras que se acerta de dar nelas de maravilha as pode penetrar. 
(NP-SBJ=5358&2379#2389=010 Este) é um dos melhores peixes que há nestas partes, porque além de 
(NP-SBJ=5358&2379#2389=011 *pro*) ser muito gostoso, 
(NP-SBJ=5358&2379#2389=012 *pro*) é também muito sadio , e mais enxuto de sua propriedade que outro algum que na terra se coma.

 

As with the annotation for alternating words, this markup is a little clumsy. In this particular case an option would be to mark singular and plurals of each noun with the same ID in the base annotation. This and other options will be considered in a final format of this system.


3.1.3 The marking of NP-SE

The clitic pronoun ‘se’ presents quite a complex syntax in Portuguese, and this complexity is connected, precisely, to its argumental and referential nature; this poses a challenge to the marking of noun phrases projected by the clitic ‘se’ (i.e., ‘NP-SE’ phrases) in this annotation system.

Some cases are not problematic – namely, where ‘se’ is a reflexive (i.e., simply the third-person reflexive clitic); in such instances, the NP projected by ‘se’ will simply bear the same Referent ID as the NP to which it referes reflexively, and later, its own number of occurrence:

(45)

- Reflexive NP-SE

E acima desta cachoeira se mete o mesmo rio debaixo da terra
'and above this waterfall the same river puts itself under the earth' 

(NP-SE=5042&0362=006 (CL se))
(NP-SBJ=5042&0362=007 (D o) (ADJ mesmo) (N=0362_rio rio))


( (IP-MAT (CONJ E)
	  (ADVP (ADV acima)
		(PP (P d@)
		    (NP=0443=002 (D-F @esta) (N=0447/1/0443_cachoeira cachoeira))))
	  (NP-SE=5042&0362=006 (CL se))
	  (VB-P mete)
	  (NP-SBJ=5042&0362=007 (D o) (ADJ mesmo) (N=0448/6/0362_rio rio))
	  (ADVP (ADV debaixo)
		(PP (P d@)
		    (NP=024=0046 (D-F @a) (N=0449/15/0046_terra terra)))))
  (ID G_008,9.57))

The challenge pertains cases of NP_SEs projected by passive and indeterminate ‘se’. In the present system, the referential ID annotation of SE for those cases follows the guidelines of the syntactic annotation very strictly (since one of the aims of this previous annotation is, precisely, to separate argumental, referential ‘se’ (reflexive or passive) and non-argumental, non-referential ‘se’).

In the syntactic annotation, both passive and indeterminate NP-SEs are marked as co-indexed with the NP-SBJ of the clauses they appear in. However, in potentially passive NP-SE, this subject is referential (lexical or null), and in potentially indeterminate NP-SE, this subject is non-referential or expletive (see the Syntactic Annotation Manual’s notes for Se-Constructions here).  

In the present annotation, NP-SEs co-indexed with a referential NP-SBJ in the syntactic annotation are marked with the same Referent ID and the same number of occurrence as that NP-SBJ, and NP-SEs coindexed with expletive NP-SBJs in the syntactic annotation are not marked for referent ID:

– NP-SE coindexed with a referential NP-SBJ:

(46) 
(NP-SE-4=0005 (CL -se)) , in 
fez-se uma frota para a Índia
'a fleet was made to Índia':
 
(VB-D fez-)
(NP-SE-4=0005=001 (CL -se))
(NP-SBJ-4=0005=002 (D-UM-F uma)
		   (N=0005_frota frota)
		   (PP (P para)
		       (NP (D-F a) (NPR Índia)))

– NP-SE coindexed with an expletive NP-SBJ
(not marked for referent):

(47)
 até onde se pode navegar sem nenhum impedimento
'up to where on may navigate with no impediment':

(PP (P até)
    (ADVP (CP-FRL (WADVP-2 (WADV onde))
    (IP-SUB (ADVP *T*-2)
	    (NP-SBJ-3 *exp*)
	    (NP-SE-3 (CL se))
            (VB-P pode)	
            (IP-INF (VB navegar)
                    (PP (P por)
                        (PP (P entre)
                            (NP (D-F-P as) (N-P ilhas))))
                    (PP (P sem)
                        (NP (Q-NEG nenhum) (N impedimento))))))))

This option is not without drawbacks, however. First, because it is not obvious that the ‘passive’ SE and its coindexed subject should really get the same number of occurrence. Second, it is not obvious that SE coindexed with expletive subject in the syntactic annotation does not, in fact, have a referential nature (in my interpretation, in does – it reads as an indefinite subject). Such doubts will be dealt in the final version of this annotation. For the moment, the only worry is that any markings with SE are consistent in each of the three cases above, allowing for simple re-annotation when a better solution is found.


3.2 First results

Some searches were devised and conducted with Corpus Search over the test file annotated according to the guidelines above, aiming chiefly at testing if this annotation might actually allow for the investigation of the main hypothesis in the Project Syntax and information structure in the first 16th century Portuguese narrative about Brazil. In broad terms, the hypothesis is that in Classical Portuguese, the first syntactic position of the clause is reserved for discursively salient constituents, and this salience may be explained in terms of how often, if at all, a certain constituent has been mentioned in the previous context in a text (as exposed here). Accordingly, the test searches targeted mostly the referential patterns of fronted arguments (subjects and objects).

The devised queries used the main labels for Referent ID (=0000), the sub-labels for position of reference in a chain of occurrences (=000) and the additional sub-labels for possible relations to previously mentioned referents (…&0000). The complete test searches are shown in 5 further below as examples, and the results are commented below briefly.

Given the small size of the test-file, the results so far may not be relevant from the linguistic point of view. However, they show that the proposed annotation is in the right direction to find patterns of relations between the referential nature of arguments and their order in the sentence in larger portions of text.


3.2.1 Queries focused on constructions

3.2.1.1 Using of the “number of occurrence” label (=000=)

The sub-labeling =001=, =002= etc. already allows for simple targeting in searches with relevant results. A simple condition targeting the number of occurrence can be used in any query for any construction involving NPs – subjects in a certain order, objects in a certain order, etc. In the tests, we particularly targeted first occurrences versus “not-first” occurrences (complete corresponding query scripts are in 5 below, numbers 1 to 3):

Sample conditions for queries using number of occurrence label :
 
 - To target subjects marked as first mentions: 
   (IP-MAT* iDominates NP-SBJ*=001)
 
 - To target subjects marked as “non-first” mentions: 
   (IP-MAT* iDominates !NP-SBJ*=001)

By combining the conditions above  with queries for particularly positioned subjects (pre-verbal, post verbal) and for particular forms of subjects (nominal, pronominal, null), relevant patterns may be revealed. In the tests, this already made it possible to find two patterns that are relevant for the purposes of this project. First and foremost, the tests already show that there is a difference in patterns of pre-verbal and post-verbal subjects: 66% of pre-verbal subjects) in the test file corresponded to referents mentioned for the first time; whereas only 20% of post-verbal subjects were a first-mention. We go back to these results below when discussing queries for additional annotation.

By modifying the conditions above to target objects (simply changing NP=SBJ* to NP-ACC* in the conditions above), and then comparing them to the patterns found for subjects, further relevant patterns can be found too. In the tests, we found that 90% of pre-verbal objects  correspond to referents mentioned for the first time (whereas the same was true for 66% of the pre-verbal subjects, as mentioned).

The numbers are very slight, but we think they make the prediction of interesting searches to follow. The important point at this stage is that the annotation seems to make such searches possible.

3.2.1.2 Using the additional annotation (&)

The additional sub-labeling for referents in relation allows for searches as the above to be refined in promising directions. The most important point is that thanks to them, we can separate ‘absolute’ first-mentions from first mentions of referents that have some kind of relation to a previous referent, using the following condition (complete corresponding query scripts are in 5 below, numbers 4 to 6):

Sample conditions for queries using the additional annotation:

 - To target subjects marked with additional annotation: 
   (IP-MAT* iDominates NP-SBJ*=*&*)

 - To target subjects not marked with additional annotation:
   (IP-MAT* iDominates !NP-SBJ*=*&*)

This of course yields an output with all first mention subjects that are entirely new mentions.

In one test, 20% of first-mentions resulted from this search for pre-verbal subjects; however, for post-verbal subjects, zero occurences resulted (i.e., although both pre-verbal and post-verbal subjects may correspond to referents mentioned for the first time, so far in our tests no post-verbal subject corresponded to an “absolute” first-time mention).

Again the numbers are slight; but the tests seem to indicate that, for first-position constituents, the fact of being or not related to previously mentioned referents is quite relevant, and should be further investigated.

They also show that the additional annotation may allow these patterns to be observed. As said before, this shows that the additional annotation for referents in relation is certainly worthwhile for the purposes of this project.


3.2.2 Queries focused on referents

The queries shown above focus on constructions, ‘filtered’ by the nature of their referent ID (‘subjects corresponding to referents mentioned for the first time’, ‘objects corresponding to referents mentioned for the first time’, etc), and result in lists of constructions that recur in the text. A second way to use the refernt ID annotation would follow the opposite logics, focusing on referents, ‘filtered’ by the nature of the construction they appear in: ‘referent X as it is projected as a subject’, ‘referent X as it is projected as an object’, ‘referent X and all the positions it may occur in’, etc.

Simple condition like the following may allow for this search (substituting IdRef for the targeted number); obviously, they may be combined (with OR) to render both subject and object NPs marked with the specified ID (a complete corresponding query script is in 5 below, number 7):

Condition for queries targeting specific referents
in specific projections:

- To target subjects marked with a specific ID:   
  (NP-SBJ*=*IdREf* exists)

- To target objects marked with a specific ID: 
  (NP-ACC*=*IdREf* exists)

The relevancy of queries such as this is that, in order to verify the main hypothesis of this project, we must be able to compare a sequence of first-position constituents, to verify the hypothesis that their referents will alternate (the reasons for this are exposed here).

For instance, in the sentences below, the referent =5481&1473 is shared by six constituents in a sequence of four IP-MATs, and it is important for the research to be able to observe how the chain of referents in this sequence is organized as regards the first-position ones. The example shows this highlighting the fronted arguments of the IP-MATs:

(47) 
Sequence of related first-position arguments in 4 IP-MATs:
Outros animais / *pro* / Estas Cotias / *pro* 

Outros animais há a que chamam Cotias, que são do tamanho de lebres: 
e quase tem a mesma semelhança, e sabor. 
Estas Cotias são ruivas, 
e tem as orelhas pequenas, e o rabo tão curto que quase não se enxerga. 
'Other animals there are called Cotias, that are the size of hares: 
and almost have the same look and taste. 
Those Cotias are red-haired, 
and have small ears, and the tail so short that it's almost not seen.'


 (IP-MAT   (NP-ACC=5481&1473=001 Outros animais) há a (WNP=0059=002 que) chamam Cotias (WNP=5481&1473=003 que) são do tamanho de lebres: (ID G_008,21.292))
 (IP-MAT e (NP-SBJ=5481&1473=004 *pro*) quase tem a mesma semelhança e sabor. (ID G_008,21.293))
 (IP-MAT   (NP-SBJ=5481&1473=005 Estas Cotias) são ruivas, (ID G_008,21.294))
 (IP-MAT e (NP-SBJ=5481&1473=006 *pro*) tem as orelhas pequenas, e o rabo tão curto que quase não se enxerga.  (ID G_008,21.295))

Some test queries were made to probe into the systematic observation of sequences such as  ‘Outros animais / *pro* / Estas Cotias / *pro*‘ above. For instance, we could build a query to show all constituents marked with the core ID 5481, in sequence. But a more interesting query would list only ‘high’ argumental NPs (matrix subjects and objects) identified as 5481 (thus not listing in the two WNPs above, which are not the focus) ; the query condition would be as follows (complete query in 5, item 9):

// Query snippet

  query: (NP-SBJ*=*5481* exists)
  OR     (NP-ACC*=*5481* exists)

//

For this referent, using the query shown above (for subject ‘or’ object positions), the output would look like this (this is the full result, as this referent for ‘Cotias’ is not very repeated in the text):

// Output snippet:

/~*
Outros animais há a que chamam Cotias, que são d@ @o tamanho de lebres:
(G_008,21.292)
*~/
/*
4 NP-ACC=5481&1473=001:  
4 NP-ACC=5481&1473=001
*/
( (4 NP-ACC=5481&1473=001 (5 OUTRO-P Outros)
			  (7 N-P=1667/7/1473_animais animais)
			  (9 CP-REL *ICH*-1)
			  (11 CP-REL *ICH*-4))
  (60 ID G_008,21.292)) 

/~*
e quase tem a mesma semelhança, e sabor.
(G_008,21.293)
*~/
/*
4 NP-SBJ=5481&1473=004:  
4 NP-SBJ=5481&1473=004
*/

( (4 NP-SBJ=5481&1473=004 *pro*)(29 ID G_008,21.293) )

/~*
Estas Cotias são ruivas,
(G_008,21.294)
*~/
/*
2 NP-SBJ=1668$5481=005&1473:  2 NP-SBJ=1668$5481=005&1473
*/
( (2 NP-SBJ=1668$5481=005&1473 (3 D-F-P Estas) (5 N-P=1672/1/1668 Cotias))
  (14 ID G_008,21.294)) 


/~*
e tem as orelhas pequenas, e o rabo tão curto que quase se não enxerga.
(G_008,21.295)
*~/
/*
4 NP-SBJ=5481&1473=006:  4 NP-SBJ=5481&1473=005
*/
( (4 NP-SBJ=5481&1473=006 *pro*)(52 ID G_008,21.295) )

//

In conclusion, we believe the tests have shown that the explicitation of reference-chains with the present annotation allow for the main aspects of the hypothesis to be systematically investigated. 


4. Technical aspects

As mentioned above, this annotation is performed manually, in the raw code of a syntactically annotated file, using a trusted text-processor such as Emacs.

For some of the stages a few ‘semi-automatic’ techniques were devised, as aids to the annotation (with regular expressions and Corpus Search Queries). In the future, ideally, a good part of those steps could be fully automated in a user-friendly interface.

In fact, in the whole process described below, the only part that needs to be performed exclusively with human intervention is the identification of referents mentioned by phrases that do not contain nouns; all else could be scripted and machine-performed, as tackled in more detail in 4.3 further below.

Meanwhile, 4.1 right below details the current technique for marking nouns, and 4.2 after that describes the current technique for marking nouns – in both cases, including devices for checking and correcting different stages of the markup.


4.1 Marking nouns


4.1.1 Preliminaries

As mentioned further above, each and every noun in the file receives a number, including repeated instances of the ‘same’ noun. It is to be understood that this number is the number of each instance of a N, N-P, NPR or NPR-P tag, not the actual referential ID yet. Some of the numbers given to the nouns will be part of referential IDs later, some not. 

This stage can be performed semi-automatically,  as described here. First of all, however, to make sure all is right in the end, before starting we run a search in the non-annotated file, to see how many nouns there are :

// Query: #lex_n_npr.q

 begin_remark: 
  Make a lexicon of all N, N-P, NPR, NPR-P
 end-remark

 define: port.def
 print_indices: t
 node: IP*
 make_lexicon: t
 pos_labels: N|N-P|NPR|NPR-P
//
// Sample output of query: #lex_n_npr.q 
  (snippet)
 /*
 SUMMARY: 
 source files, hits/tokens/total
 parsed.psd 4164/832/909
 whole search, hits/tokens/total
  4164/832/909
 */
//


4.1.2 Marking

The idea then is to mark all the nouns in the file with the correspondent number, up to 4164 in this example; below are the steps followed in the test file to achieve this.

4.1.2.1 Add ‘=’

The first procedure is to transform all N, N-P, NPR, NPR-P in the file in N=, N-P=, NPR=, NPR-P=, as all annotations will be appended to the = sign. This can be done with a simple query-replace in Emacs, substituting all N, N-P, NPR, NPR-P by N=, N-P=, NPR=, NPR-P=:

 (N → (N=
 (N-P → (N-P=
 (NPR → (NPR=
 (NPR-P → (NPR-P=

4.1.2.2 Number ‘=’

The next step is to number all the nouns – or better, now, all the instances of ‘=’.  This can be done with the following regular expression in Emacs:

RegEx
 to number all “=”
 \(=\) → =\#
Sample results:
(N= frota)     → 
(N=1 frota)
(NPR= Capitão) → 
(NPR=2 Capitão)

Next, substitute all =0 numbers for four-digit =0000 numbers. This is done to make lists and other output files more organized (i.e., so that all number-ids occupy the same space, the same lenght of column). This can be done with the following sequence of RegExs in Emacs:

RegEx
 to substitute number of digits
 target: =111
\(=\)\([0-9]\)\([0-9]\)\([0-9]\) → \1/0\2\3\4

RegEx
 to substitute number of digits
 target: =11
 \(=\)\([0-9]\)\([0-9]\) → \1/00\2\3
RegEx
 to substitute number of digits
 target: =111
 \(=\)\([0-9]\) → \1/000\2
IMPORTANT: must be run in this order!

Sample Results:
=111 →  =0111
=11  →  =0011
=1   →  =0001

4.1.2.3 Add mnemonic aids

Because at present the process of marking the ‘reappearances’ of each referent is done chiefly manually, the large quantity of numbers may get challenging to the memory. In further developments this has to be solved with a user-friendly interface. For the moment, two techniques are being used as by-passes of this mnemonic difficulty:

a. ‘Stubs’

One first technique is to use ‘stubs’ – i.e., a copy of the noun as a sub-tag (that can be removed later) – in each noun, as a mnemonic aid.

This can be done over the numbered items, with following regular expression in Emacs (Note: don’t put [a-z] instead of [:alpha:], or the RegEx won’t copy the accented characters):

RegEx
 to copy 'words' as stubs 
 target: =0000<space>word
  \(=\)\([0-9]\)\([0-9]\)\([0-9]\)\( \)\([[:alpha:]]+\) 
→ \1\2\3\4/_\6\5\6
Sample results:
(N=0005 frota) → 
(N=0005_frota frota)
(NPR=0007 Capitão) → 
(NPR=0007_Capitão Capitão)

However, the easiest way to do this is before numbering the nouns, with following regular expression:

RegEx
 to copy 'words' as stubs 
 target: (TAG=<space> word)
  \(=\)\( \)\([[:alpha:]]+\) 
→ \1\3\2\3
(N= frota) → 
(N=frota frota)

(NPR= Capitão) → 
(NPR=Capitão Capitão)

(then run numbering in 4.1.2.3 above putting a '_' after the numbers)

b. Further numbering

A second strategy that may be used as a mnemonic aid in the annotation is the reproduction of the original number of a given name in all the subsequent occurrences, over the number labels described above.

This is particularly useful when a noun is repeated vastly over the text. It is however a particularly time-consuming step, as it must be done for every noun. 

First, we can add, to the number of each instance, the number of the first noun on the series, and number the series. The example below shows this for ‘admiração’, ‘admiration‘, with 3 occurrences:

N=1545_admiração
N=2733_admiração
N=3373_admiração

RegEx 
to copy original ID
target: =xxxx_stub

/(=/)/([0-9]/)([0-9]/)([0-9]/)([0-9]/)/(_admiração/)  → 
\1\2\3\4\5\1545\6


N=1545_admiração → N=1545/1545_admiração
N=2733_admiração → N=2733/1545_admiração
N=3373_admiração → N=3373/1545_admiração

The resulting annotation is to be interpreted like this: 

N=1545/1545_admiração → This is noun 1545 
N=2733/1545_admiração → This is noun 2733 and it's a repetition of noun 1545
N=3373/1545_admiração → This is noun 3373 and it's a repetition of noun 1545

Even better, with the following RegEx, we can <‘strong>number all occurrences in the noun tag:

RegEx 
to copy original ID 
and number subsequent IDs
target: =xxxx_admiração

/(=/)/([0-9]/)([0-9]/)([0-9]/)([0-9]/)/(_admiração/)  → 
\1\2\3\4\5\#/\1545\6

N=1545_admiração → N=1545/1/1545_admiração
N=2733_admiração → N=2733/2/1545_admiração
N=3373_admiração → N=3373/3/1545_admiração

Now the resulting annotation is to be interpreted like this: 

N=1545/1/1545_admiração → This is noun 1545, it's the 1st time it occurs
N=2733/2/1545_admiração → This is noun 2733, it's a repetition of noun 1545,
                          for the 2nd time
N=3373/3/1545_admiração → This is noun 2733, it's a repetition of noun 1545,
                          for the 3rd time

This process of further annotation is quite time-consuming, but represents an important help when a noun is repeated many times along the text. Ideally, in future there would be a user-friendly interface that would script this preparation and ‘hide’ this code during the annotation; for the moment, it is a good aid.


4.1.3 Checking the noun count

It is essential that the noun count is precise, as it represents the basis of the whole annotation. In order to ensure this there are two helpful steps.

4.1.3.1 Recount

The first is to confer the numbering immediately after step 4.1.2.1 above (‘Number =’). One preliminary measure is to repeat the query given above, to count how many N, N-P, NPR or NPR-P tags are in the file, just to make sure it finds the exact same number of items as before starting.  An even better ideia is to run a similar search, but for N=, N-P=, NPR= and NPR-P= . This should also result in the exact same number of hits as before starting (4164 in this case). It also has the advantage of listing a complete, organized lexicon of all the tagged words, to be used as reference later (as described in 4.1.3.2):

// Query: #lex_n=_npr=.q

 begin_remark: 
  Make a lexicon of all N, N-P, NPR, NPR-P
  marked with a = sign
 end-remark

 define: port.def
 print_indices: t
 node: IP*
 make_lexicon: t
 pos_labels: N=|N-P=|NPR=|NPR-P=
//
// Sample output of query: #lex_n=_npr=.q 
  (snippet)
 /*
 SUMMARY: 
 source files, hits/tokens/total
 parsed.psd 4164/832/909
 whole search, hits/tokens/total
  4164/832/909
 */
//

4.1.3.2 Lexicon companion

A fundamental companion to the process of annotation in the present manual format is a lexicon of all nouns in the file, kept as a reference during the annotation, to check if a noun has appeared before or not, and how many times, etc. This is of course produced by the query above. 

Below is a part of the output of the query above, in a file marked only with numbers (no stubs and no further numbers):

// Sample output of query #lex_n=_npr=.q (1)

 /* ~A~ */
 abertura 1: [N=1812] 
 abismo 1: [N=2730] 
 abril 1: [NPR=0319] 
 abundância 4: [1 N=0399] [1 N=1382] [1 N=1600] [1 N=2006] 
 acrescentamento 1: [1 N=0703] 
 adens 1: [1 N-P=2109] 
 admiração 3: [1 N=1545] [1 N=2733] [1 N=3373] 
 adversários 1: [1 N-P=3136] 
 afonso 1: [1 NPR=0917] 
 afrontas 1: [1 N-P=3463] 
 agouros 1: [1 N=3232] 
 agravo 1: [1 N=1539] 
 agravos 2: [1 N-P=3484] [1 N-P=3916] 
 aguada 2: [1 N=0022] [1 N=0118] 
 aimorés Aimorés 4: [1 NPR-P=3698] [1 N-P=3714] [1 N-P=3730] [1 N-P=3804] 
 aipim Aipim 2: [1 NPR=1226] [1 NPR=3503] 
 ajuntamento 1: [1 N=2791] 
 alagadiços 1: [1 N-P=0261] 
 alcachofres 1: [1 N-P=1319] 
 aldeia 18: [1 N=2849] [1 N=3211] [1 N=3213] [1 N=3221] [1 N=3248] [1 N=3272] [1 N=3277] [1 N=3302] [1 N=3308] [1 N=3347] [1 N=3351] [1 N=3386] [1 N=3457] [1 N=3470] [1 N=3489] [1 N=3515] [1 N=3610] [1 N=3679]
 
… etc
//

If the annotation has made use of ‘stubs’, the output would look like this:

// Sample output of query #lex_n=_npr=.q (2)

 /* ~A~ */
 abertura 1: [N=1812_abertura] 
 abismo 1: [N=2730_abismo] 
 abril 1: [NPR=0319_Abril] 
 abundância 4: [1 N=0399_abundância] [1 N=1382_abundância] [1 N=1600_abundância] [1 N=2006_abundância] 
 acrescentamento 1: [1 N=0703_acrescentamento] 
 adens 1: [1 N-P=2109_adens] 
 admiração 3: [1 N=1545_admiração] [1 N=2733_admiração] [1 N=3373_admiração] 
 adversários 1: [1 N-P=3136_adversários] 
 afonso 1: [1 NPR=0917_Afonso] 
 afrontas 1: [1 N-P=3463_afrontas] 
 agouros 1: [1 N=3232_agouros] 
 agravo 1: [1 N=1539_agravo] 
 agravos 2: [1 N-P=3484_agravos] [1 N-P=3916_agravos] 
 aguada 2: [1 N=0022_aguada] [1 N=0118_aguada] 
 aimorés Aimorés 4: [1 NPR-P=3698_Aimorés] [1 N-P=3714_Aimorés] [1 N-P=3730_Aimorés] [1 N-P=3804_Aimorés] 
 aipim Aipim 2: [1 NPR=1226_Aipim] [1 NPR=3503_Aipim] 
 ajuntamento 1: [1 N=2791_ajuntamento] 
 alagadiços 1: [1 N-P=0261_alagadiços] 
 alcachofres 1: [1 N-P=1319_alcachofres] 
 aldeia 18: [1 N=2849_aldeia] [1 N=3211_aldeia] [1 N=3213_aldeia] [1 N=3221_aldeia] [1 N=3248_aldeia] [1 N=3272_aldeia] [1 N=3277_aldeia] [1 N=3302_aldeia] [1 N=3308_aldeia] [1 N=3347_aldeia] [1 N=3351_aldeia] [1 N=3386_aldeia] [1 N=3457_aldeia] [1 N=3470_aldeia] [1 N=3489_aldeia] [1 N=3515_aldeia] [1 N=3610_aldeia] [1 N=3679_aldeia]

 … etc
//

If the annotation has made use of stubs and further numbering, it would look like this (notice the added advantage of this output in making us able to chack the further numbering as well) – shown up to “admiração” only:

// Sample output of query #lex_n=_npr=.q (3)

 /* ~A~ */
 abertura 1: [N=1812/1/1812_abertura] 
 abismo 1: [N=2730/1/2730_abismo] 
 abril 1: [NPR=0319/1/0319_Abril] 
 abundância 4: [1 N=0399/1/0399_abundância] [1 N=1382/2/0399_abundância] [1 N=1600/3/0399_abundância] [1 N=2006/4/0399_abundância] 
 acrescentamento 1: [1 N=0703/1/0703_acrescentamento] 
 adens 1: [1 N-P=2109/1/2109_adens]
 admiração 3: [1 N=1545/1/1545_admiração] [1 N=2733/2/1545_admiração] [1 N=3373/3/1545_admiração] 

… etc
//

 


4.2 Marking noun phrases


4.2.1 Marking

Starting with the file with all nouns counted as described above, comes the important part of this annotation – the identification and marking of referent IDs in the phrases.

As mentioned above this is the stage that depends exclusively on human interpretation of the test, and there aren’t many technical details to be listed for this. Even numbering the occurrences is currently done ‘manually’. Of course, it is in fact not impossible to number the occurrences of each particular referent semi-automatically – provided all mentions to this referent are first identified (thus separating the process actually in two, identifying and only then numbering). The numbering could be done over the reference IDs, with a RegEx in Emacs, as follows (note that this must be done for each referent at a time, of course; substitute the “0000” below for the number you are targeting; later, all the one-digit numbers generated must be substituted by three-digits, as shown further above):

RegEx 
to number specific referents
target: =0000
 \(=\)\(0000\) -> 
 \1\2=\#
 NP=0000 → 
 NP=0000=1

However, this automatic counting is not worth the while. We feel that numbering previously applied IDs automatically makes the annotator miss the vital part of the annotation, which is precisely interpreting which referent is being repeated where. The best idea then is to number the referents manually, one by one, and then run automatic numberings over the manual numbering, as a checking device (as described in 4.2.2 below).

So, in technical terms, the most convenient way to proceed to numbering is in two stages, clause by clause:

Mark Referent IDs: First, read the clause and use only the “stubs” and ID references, copying and pasting them after each NP, WNP etc. that is interpreted as related to them; this step, described above, is in fact is equivalent to simply marking the intuitive interpretation of the text (i.e., “que”, ‘which’, is actually “a frota”, ‘the fleet’; etc.):

 (NP-SBJ-4=005_frota (D-UM-F uma) (N=0005_frota frota) ...)
 (NP-SE-4=0005_frota (CL -se))
 (NP=0005_frota      (WPRO que)
 (NP-GEN=005_frota   *T*-2)

Number Referent IDs: Following this, all the occurrences of each referent in the clause may be numbered (and corrected, if this is the case); the stubs (‘_frota’ in the examples below) may be later cleaned up (see 4.2.3.2 above):

(NP-SBJ-4=0005_frota=001     (D-UM-F uma) (N=0005_frota frota) ...)
(NP-SE-4=0005_frota=002      (CL -se))
(NP=0005_frota=003           (WPRO que)
(NP-GEN=0005_frota=TRACE_003 *T*-2)

Alternatively, the IDs could be left unnumbered, and at the very end of the annotation process, an automatic script may be ran to number all instances of each Referent ID, from the beginning to the end of the text. As mentioned above, however, this would be a very time-consuming process.


 4.2.2 Checking and correcting

The most common mistakes in the annotation, as the initial tests found, were:

  • Skipping a noun phrase altogether
    (N.B.: particularly common when there is no N in an NP projection)
  • Skipping number of occurrences;
  • Doubling number of occurrences.

After the process described above, potential mistakes may be checked semi-automatically – i.e., it is possible to automatically search for the mistakes with Corpus Search, and then correct them manually – as follows.

4.2.2.1 Altogether unmarked phrases

To check if all relevant phrases have received a tag for referencing, run the following command; it should result in zero cases (i.e., zero unmarked NP*s):

// Query: #xp.q

 begin_remark: 
  List (W)NPs missing Number of Occurrence and Referent ID 
 end_remark

 define: port.def
 print_indices: t
 
 node: NP|WNP
 nodes_only: t
 remove_nodes: f
 query: (NP exists)
 OR (NP-[123456789] exists)
 OR (WNP exists)
 OR (WNP-[123456789] exists)
//

Example of mistake spotted with this search (a non-marked NP):

// Sample output of query #xp.q

  /~*
  E isto causa não haver lá frios, nem ruínas de inverno que ofendam a suas plantas, como cá ofendem a@ @as nossas. Enfim que assim se houve a Naturezacom todas as coisas d@ @esta província, e de tal maneira se comediu n@ @atemperança d@ @os ares, que nunca n@ @ela se sente frio nem quentura excessiva.
 (G_008,8.37)
  *~/
  /*
  70 NP: 70 NP
  */
  ( (70 NP (71 D-F-P @as) (73 PRO$ nossas))
  (187 ID G_008,8.37))
//

4.2.2.2 Wrong number of occurrence numbers

The most obvious way to check if all numbers of occurrence of a Referent ID are in the right sequence is to recount them all automatically. This is very time consuming as it must be done for every number. However, for referents with a good number of repetition, it is worthwhile.

In order to ‘recount’, simply apply this regular expression in Emacs (number by number of course, i.e., substitute =0000 below by the desired Referent ID): 

RegEx 
to re-number specific referents
target: =0000=000

  \(=\)\(0000\)\(=\)/([0-9]/)([0-9]/)([0-9]/)
-> \1\2=\# 

  NP=9999=000 
→ NP=9999=1

For most Referents, however, it is better to use the queries to follow. In fact, for repetitive references we do both; recount, and apply the queries that follow.

4.2.2.3 Wrong IDs or wrong number of occurrence numbers

To find codes that may have been wrongly applied, by repeating or skipping items, etc., run the following searches and then correct the identified mistakes manually:

To find all individual referent IDs: To list all phrases marked with a referent IDs in a file, and then spot doubled or skipped referent IDs, run this query (that actually fins all ‘first mentions’ of an ID, i.e., all that correspond to a =001 sub-label): 

// Query: =001.q

 begin_remark: 
  List all phrases marked =001 
 end_remark
 
 define: port.def
 print_indices: t
 
 node: NP*|WNP*
 nodes_only: t
 remove_nodes: f

 query: (NP*=*=001 exists) 
 OR (WNP*=*=001 exists)
//

The expected result of this search is the grouping of all top-phrases marked with 001, in sequence in each phrase, as below; if a sequence has a number missing, or a number too much, this is easier to spot here (as compared to looking up the complete annotated text):

// Sample output of query =001.q (i)
 
 /~*
 Estando assim surtos n@ @esta parte que digo, saltou aquela noite com eles tanto tempo, que lhes foi forçado levarem as âncoras,
 (G_008,6.6)
 *~/
 /*
 14 NP=0056=001: 14 NP=0056=001, 19 CP-REL
 34 NP-SBJ=0057=001: 34 NP-SBJ=0057=001, 37 N=0057_noite
 45 NP-ADV=0058=001: 45 NP-ADV=0058=001, 52 CP-DEG
 71 NP-ACC=0059=001: 71 NP-ACC=0059=001, 74 N-P=0059_âncoras
 */
//

Spotting duplicated number of occurrences in this output: Below is an example of a doubling mistake spotted with the search above – an NP-ACC and a WNP with the same ID, NP-ACC=0065=001 and WNP=0065=001; going back to the annotation, it can be seen that the WNP should have got another number:

// Sample output of query =001.q (ii)
 /~*
 e com aquele vento que lhes era largo por aquele rumo, foram correndo a costa até chegarem a um porto limpo e de bom surgidouro onde entraram: a@ @o qual puseram então este nome, que hoje em dia tem de Porto Seguro, por lhes dar colheita e os assegurar d@ @o perigo d@ @a tempestade que levavam.
 (G_008,6.7)
 *~/
 /*
 9 NP=0060=001: 9 NP=0060=001, 14 CP-REL
 32 NP=0061=001: 32 NP=0061=001, 35 N=0061/1/0061_rumo
 59 NP=0063=001: 59 NP=0063=001, 92 CP-REL
 74 NP=0064=001: 74 NP=0064=001, 77 N=0064/1/0064_surgidouro
 111 NP-ACC=0065=001: 111 NP-ACC=0065=001  
 131 NP=0066=001: 131 NP=0066=001, 132 N=0066/1/0066_dia
 139 NP=0067=001: 139 NP=0067=001, 142 VB-P
 156 NP-ACC=0068=001: 156 NP-ACC=0068=001, 157 N=0068/1/0068_colheita
 171 NP=0069=001: 171 NP=0069=001, 184 CP-REL
 119 WNP-1=0065=001: 119 WNP-1=0065=001, 120 WPRO <- something wrong!
 43 NP-ACC=0039=003: 43 NP-ACC=0039=003, 46 N=0062/3/0039_costa 
 179 NP=0023=004: 179 NP=0023=004, 182 N=0070/4/0023_tempestade 
 ... (five other phrases) */
//

Spotting skipped IDs in this output:  This is a more difficult case, because the fact that an ID does not appear in the sequence of the output does not necessarily mean there was a mistake – some numbers in the sequence will simply never be used as referent IDs (i.e., when they correspond to nouns that have appeared in repetition). 

The output snipped above is a good example: notice that the sequence of NP is 0060 to 0070, but no NP 0062 appears. This is not a mistake, however; the noun corresponding to Referent 0062 (‘costa’) had been mentioned twice before in the text, and this was correctly indicated in the annotation. The noun phrase it appears in is correctly marked with number of its first occurrence (0039), and there really is no ID 0065 to be marked in an NP:

43 NP-ACC=0039=003: 43 NP-ACC=0039=003, 46 N=0062/3/0039_costa
 (VB-D foram)
 (VB-G correndo)
 (NP-ACC=0039=003 (D-F a) (N=0062/3/0039_costa costa))

So, when a number is missing in the list, it should be confirmed if this coincides with the number of a noun in this condition in the end of the list. This task is made easier by the further annotation of N-, NPR-, explained in 4.1.2.4 above (in this case, notice how the annotation of the N ‘costa’ shows this clearly: N=0062/3/0039_costa means ‘this is noun 0062 and it corresponds to the third occurrence of the referent of the noun 0039).

To find all phrases marked for referent IDs: To list absolutely all phrases marked for Referent IDs in a file (not only first mentions) and then spot wrong number of occurrence IDs, run:

// Query:  #xp=.q

 begin_remark: 
  List all (W)NPs marked =0000=000 
 end_remark
 define: port.def
 print_indices: t
 
 node: NP*|WNP*
 nodes_only: t
 remove_nodes: t

 query: (NP*=*=* exists)
 OR (WNP*=*=* exists)
//

This of course will result in a much larger list than the above; so, it is a good idea to run and correct first IDs first, and only later run all IDs – so many of the the potential mistakes will already have been harvestes out.

To examine selected referent IDs: For referents repeated profusely along the text, it is advisable to run a dedicated search, such as the follwing:

// Query: #thisref.q

 begin_remark:
  List all occurrences of a specific reference. 
  For each application, =0000 must be changed in the query.
 end_remark

 define: port.def
 print_indices: t

 node:  NP*|*WNP
 nodes_only: t
 remove_nodes: t

 query: (NP*=0000* exists)
 OR (WNP*=0000* exists)
//

In the output, numbers of occurrence cannot repeat, and hits must be the same number as the number of the last occurrence. After checking the file against this output, the references may be renumbered if there are mistakes.


 4.2.3 Post-Processing

When the processes described from 4.1 above have finished for the whole text, a few post-processing scripts may be run to reorganize the file.

First of all, some of the steps in the markup are relevant for the annotation process only and are not needed for the final format of the file – namely, further numbering of nouns and the stubs. Two scripts may be ran after the whole file has been annotated, to eliminate these auxiliary markups.

4.2.3.1 To crop off stubs from nouns and noun phrases

After the annotation process, the stubs will no longer be necessary. To target and eliminate only the stubs (but keep all else the numbers), after the annotation, run the following query:

// Query: #crop_stubs.q

 begin_remark:
  To crop tags after the symbol '_' 
 end_remark

 define: port.def
 print_indices: t
 node: IP*
 copy_corpus: t

 query: (IP* dominates {1}N*_*)
 OR (IP* dominates {2}N-P*_*)
 OR (IP* dominates {3}NPR*_*)
 OR (IP* dominates {4}NPR-P*_*)
 OR (IP* dominates {5}NP*_*)
 OR (IP* dominates {6}WNP*_*)

 post_crop_label{1, 2, 3, 4, 5, 6}:_
//
// Sample result of query #crop_stubs.q:
  (NP-SBJ-1=1545=002_admiração (N=3373_admiração admiração) → 
  (NP-SBJ-1=1541=002 (N=3373 admiração)

or

  (NP-SBJ-1=1545=002_admiração (N=3373/2/1545_admiração admiração) → 
  (NP-SBJ-1=1545=002 (N=3373/2/1545 admiração))
//

4.2.3.2 To crop off extra numbering from nouns

As with the case of the stubs, the code with the number of the first occurrence of a noun added to all subsequent occurrences is to be used only in the process of marking, and may be removed in the end with the following Query:

// Query: #crop_extranumbers.q

 begin_remark: 
  To crop tags after the symbol '/' 
 end_remark
 
 define: port.def
 print_indices: t
 node: IP*
 copy_corpus: t

 query: (IP* dominates {1}N*/*)
 OR (IP* dominates {2}N-P*/*)
 OR (IP* dominates {3}NPR*/*)
 OR (IP* dominates {4}NPR-P*/*)

 post_crop_label{1, 2, 3, 4}:/
//
// Sample result of query #crop_extranumbers.q:
   (NP-SBJ-1=1545=002 (N=3373/2/1545 admiração)) →
   (NP-SBJ-1=1545=002 (N=3373 admiração))
//

4.2.3.3 Reformatting

The marking of numbers on the labels NP* will create a mess in the indenting of the brackets in the source file. In order to correct this, in the end of the process (or even between stages), run the following make_corpus commands in order, and make a new file (the searches merely append an extra tag and then take it out, only to reformate the text). Of course, there is no need to do this if the queries in 4.2.3.2 above have been ran, as they will also re-organize the indent as a by-product:

1) Query for the first output:

// Query: #indent_1.q

 define: port.def
 print_indices: t
 node: IP*
 copy_corpus: t
 query: (IP* dominates {1}N=*) 
 append_label{1}: %
//

2) Query to be run over the first output::

// Query:  #indent_2.q

 define: port.def
 print_indices: t
 node: IP*
 copy_corpus: t
 query: (IP* dominates {1}N=*) 
 post_crop_label{1}: %
//

4.2.3.4 Eliminating all reference markers 

Finally, it is worth noticing at this point that, in fact, all the markup used in this annotation can be easily removed if there is the need for the file to revert to the original, syntactically annotated format. As the original syntactic annotation does not make use of ‘=’, all the referent annotation used in this system may be easily removed if necessary, by this simple query with Corpus Search:

// Query: #crop=.q

  begin_remark: 
   To crop tags after the symbol '=' 
  end_remark

  define: port.def
  print_indices: t
  node: IP*
  copy_corpus: t

  query: (IP* dominates {1}N=*)
  OR (IP* dominates {2}N-P=*)
  OR (IP* dominates {3}NPR=*)
  OR (IP* dominates {4}NPR-P=*)
 
  OR (IP* dominates {5}NP*=*) 
  OR (IP* dominates {6}WNP*=*)

  post_crop_label{1, 2, 3, 4, 5, 6}:=
//
// Sample result of query #crop=.q:
   (NP-SBJ-1=1545=001_admiração (N=3373/2/1545_admiração admiração)) → 
   (NP-SBJ-1 (N admiração)) 
//

 


4.3 Remaining technical issues

There are important technical difficulties pertaining the proposed annotation, as the marking technique described so far is rudimentary at best. 

The most important consequence of this rudimentary technique is the difficulty in maintaining consistency in the sequence of numbers of occurrence and in the unique ID list. The main problem is a problem of memory; there are, in this test-file of 22,944 words, 4,165 pre-numbered referents (see List of Referents), many of them with many occurrences in different parts of the text (the most acute case being ‘terra’, ‘land’, with 149 occurrences), and at a certain point it becomes hard to remember the numbers of each occurrence, even if a reference list is kept.

In order to make sure the sequence was not broken, some ‘mnemonic devices’ (i.e., intermediate markings to be used only in the annotation process) were devised as regards the numbering of nouns – such as the ones seen in 4.1.1 above (the use of copies of the noun in each number tag, ‘stubs’; the copy of the number of the first occurrence of each noun in all subsequent occurrences. These were helpful, and with them the task is much faster, and less error-prone. A few checking queries were devised too, as described in 4.2.2, which elevated the level of confidence in the resulting annotation of noun phrases.

However, the fact is that for the moment, what we have is a semi-automatic technique, with more than a few clumsy procedures involved. The markup is still a very time-consuming and error-prone process, and it would be important to make it much less so if this annotation is to be carried over to other texts in the corpus.

The solution, clearly, is to build a more automated system, by which all repetitive and consistent steps could be machine performed.

The rudimentary annotation was devised with this potential in mind. As mentioned above, in the present annotation, the only part that needs to be performed exclusively with human intervention is the identification of referents mentioned by phrases that do not contain nouns – i.e., phrases formed by pronouns (lexical or null) and by other complex phrases used to establish relations (such as noun phrases with determiners, demonstratives, quantifiers, and ‘elided’ nouns). All else could be scripted and machine-performed; more specifically:

Annotation stages and their automation potential:

  1. Numbering nouns                                             < could be automated
  2. Marking noun phrases containing nouns        < could be automated
  3. Marking noun phrases not containing nouns < could not be automated
  4. Counting reference ID occurrences               < could be automated

The automation of the processes could be implemented in different ways. In a bold scenario, we can envisage an interface in which stages 1, 2 and 4 above are ‘given’ to the annotator in a friendly environment, and the annotator’s task would be to perform stage 3 only – i.e., to mark the noun phrases that do not contain nouns, thanks to their interpretation of the text – something in the lines of the illustration below, for the opening clause of the test-text, ‘Reinando aquele muito católico…’,:

Screenshot 2015-11-27 18.02.07

In such an interface, the nouns and heads of noun phrases would be ‘pre-marked’ (blue above), and by clicking in each the annotator would open an options window where each of the possible antecedents would be pre-listed. Even empty categories could be tackled by such a system (provided, of course, the code behind it included the complete syntactic annotation); the illustration below shows this for the pair of clauses ‘Bugios há nesta terra muitos… e por serem tão conhecidos…‘:

Screenshot 2015-11-27 18.01.40

If such an environment can be designed, this annotation would, in fact, constitute a particularly easy task, requiring of the annotator only the capacity to understand the referential relations established by pronouns, determiners, demonstratives, etc., in a given text – in other words, all it would require is a human reader with command of the language in question.

A second, and less ambitious project, would prescind ‘user-friendly’ interfaces, but make use of what can be automated to facilitate markup in a normal processor. This might be achieved simply by running scripts over the syntactic annotation, that would mark all constituents that may be automatically marked, and leave only the difficult ones to human interpretation.

For instance, again in the clause ‘Reinando aquele muito católico…’, the tree (as follows) would have most NPs marked for reference easily by derivation from the annotation of the noun tags (marked in blue), leaving only the NPs in red to be marked by a human annotator. In this case, only the relation between (WNP (WPRO que)) and (N frota) would have to be ‘hand-coded’:

 ( (IP-MAT (IP-GER (VB-G REINANDO)
		  (NP-SBJ=0001=001 (D aquele)
			  (ADJP (Q muito)
				(ADJ católico)
				(CONJP (CONJ e)
				(ADJX (ADJ-S sereníssimo))))
			   (NPR=0001/1/0001 Príncipe)
			   (NP-PRN=9000=001 (NPR=0002/1/0002 el-Rei) (NPR=0003/1/0003 Dom) (NPR=0004/1/0004 MANUEL))))
	  (, ,)
	  (VB-D fez-)
	  (NP-SE-4=0005=002 (CL -se))
	  (NP-SBJ-4=0005=002 (D-UM-F uma)
		    (N=0005/1/0005 frota)
		    (PP (P para)
			(NP=0006=001 (D-F a) (NPR=0006/1/0006 Índia)))
		    (CP-REL (WPP-2 (P de)
		                   (NP (WPRO que)))
			    (IP-SUB (VB-D ia)
			    (PP (P por)
				(NP=0007=001 (N=0007/1/0007 capitão)
				             (ADJ mór)
				             (NP-GEN *T*-2)))
			    (NP-SBJ=9001=001 (NPR=0008 Pedro) (NPR=0009 Álvares) (NPR=0010 Cabral)))))...))

Ideas for scripts to achieve the markings in blue above would be:

1. Mark NPs containing (N=xxxx/y+/zzzz ...)) with =zzzz
2. Mark NPs co-indexed to NPs containing (N=xxxx/y+/zzzz ...) with =zzzz
3. Mark NPs containing more than one (NPR ...) with a new number =9000

NPs covered in each case:

1. Mark NPs containing (N=xxxx/y+/zzzz ...) with =zzzz 
   (NP-SBJ-4=0005 (D-UM-F uma) (N=0005/1/0005 frota))
   (NP=0006 (D-F a) (NPR=0006/1/0006 Índia))
   (NP=0007 (N=0007/1/0007 capitão))

2. Mark NPs co-indexed to NPs containing (N=xxxx/y+/zzzz ...) with =zzzz 
   (NP-SBJ-4=0005 (D-UM-F uma) (N=0005/1/0005 frota)) >
   (NP-SE-4=0005 (CL -se)) 

3. Mark NPs containing more than one (NPR ...) with a new number =9000  
   (NP-PRN=9000 (NPR=0002/1/0002 el-Rei) (NPR=0003/1/0003 Dom) (NPR=0004/1/0004 MANUEL)))
   (NP-SBJ=9001 (NPR=0008/1/0008 Pedro) (NPR=0009/1/0009 Álvares) (NPR=0010/1/0010 Cabral))

Notice, for rules 1 and 2 above, that the further numbering of N, N-P, NPR and NPR-P tags with ‘clones’ of their first mention (i.e., =0005/1/0005 over =0005) is paramount for the script to work: it is actually the ‘clone’ that the instruction will tag into the NPs.

This is not immediately clear in those examples, as they are all first occurrences of the nouns in question. It is better visible in the following case, where ‘terra’, ‘land’, appears as noun 1905, but it is the 82nd repetition of the original appearance of ‘terra’, first marked as 0046; with instructions such as (1) above, the NP containing ‘terra’ would easily be marked as =0046 (not =1905):

 Bugios há na terra muitos de muitas castas,
 where (N=1905/82/0046 terra)

1. Mark NPs containing (N=xxxx/y+/zzzz ...) with =zzzz
  (NP=0046 (D-F @a) (N=1905/82/0046 terra))

The tags for numbers of occurrence would be applied over this initial marking, in a second stage, also entirely automated with scripts focusing on each different Referent ID codes.

Notice that number of occurrence can not be based on the tags of names, as the number there does not include all the times the noun may have been referenced but not explicitly mentioned (again it is the case of ‘terra’ above; although it is the 82nd occurrence of the noun, it corresponds to the 91st occurrence of the referent). This is solved in a simple way if the scripts to number references simply ignore the numbers between /../ on the nouns, and just count the Referent IDs in NPs.

The only special provision in this case would be an instruction to number NPs containing traces with the same Number of Occurrence tag as the NP its trace is coindexed with.


5. Bank of queries

Below are the full queries used to achieve the partial results described in 3.2 above. Note that all searches below exclude sentences with ‘SE’ pronouns as high nodes, for reasons related to their particular syntax in Portuguese and to issues with the annotation (see 3.1.3 above).

5.1 Queries targeting constructions and their referents

(1) Searching for pre-verbal lexical subjects that correspond to the first mention of any referent in a chain

// Query: SV=001.q

 begin_remark: 
  Find matrix clauses with pre-verbal lexical subjects
  whose referent corresponds to a first-time mention
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-SBJ=*=001) 
 AND (NP-SBJ=*=001 iDominates !*pro*|*exp*) 
 AND (IP-MAT* iDominates !NP-SE*) 
 AND (NP-SBJ=*=001 HasSister VB-*|SR-*|ET-*|HV-*|TR-*) 
 AND (NP-SBJ=*=001 precedes VB-*|SR-*|ET-*|HV-*|TR-*) 
 //

Comparative search:

// Query: SV.q

 begin-remark:
   Find matrix clauses with pre-verbal lexical subjects
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-SBJ*)
 AND (NP-SBJ=*=001 iDominates !*pro*|*exp*)
 AND (IP-MAT* iDominates !NP-SE*)
 AND (NP-SBJ* HasSister VB-*|SR-*|ET-*|HV-*|TR-*)
 AND (NP-SBJ* precedes VB-*|SR-*|ET-*|HV-*|TR-*)
//

(2) Searching for pre-verbal objects that correspond to the first mention of any referent in a chain

// Query: OV=001.q

 begin-remark:
  Find matrix clauses with pre-verbal nominal accusative objects 
  whose referent corresponds to a first-time mention
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-ACC=*=001) 
 AND (IP-MAT* iDominates !NP-SE*) 
 AND (NP-ACC=*=001 iDominates !CL*) 
 AND (NP-ACC=*=001 HasSister VB-*|SR-*|ET-*|HV-*|TR-*) 
 AND (NP-ACC=*=001 precedes VB-*|SR-*|ET-*|HV-*|TR-*)
 //

Comparative search:

// Query: OV.q
 begin-remark:
  Find matrix clauses with pre-verbal nominal accusative objects
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-ACC*) 
 AND (IP-MAT* iDominates !NP-SE*) 
 AND (NP-ACC* iDominates !CL*) 
 AND (NP-ACC* HasSister VB-*|SR-*|ET-*|HV-*|TR-*) 
 AND (NP-ACC* precedes VB-*|SR-*|ET-*|HV-*|TR-*)
//

(3) Searching for null subjects that correspond to the first mention
of any referent in a chain

(N.B. This is a checking search; it is expected that no null subject is a first-time mention).

// Query: pro=001.q

 begin_remark:
  Find matrix clauses with null subjects
  whose referent corresponds to a first-time mention
 end_remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-SBJ=*=001) 
 AND (NP-SBJ=*=001 iDominates *pro*) 
 AND (IP-MAT* iDominates !NP-SE*)
 //

Comparative search:

// Query: pro.q

 begin_remark:
  Find matrix clauses with null subjects
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-SBJ*) 
 AND (NP-SBJ* iDominates *pro*) 
 AND (IP-MAT* iDominates !NP-SE*)
//

(4) Searching for pre-verbal lexical subjects that correspond to the first mention of a referent indirectly related to other referents in a chain

// Query: SV&=001.q

 begin_remark: 
  Find matrix clauses with pre-verbal lexical subjects
  whose referent corresponds to a first-time mention
  of a referent with additional & annotation
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-SBJ=*&*=001) 
 AND (NP-SBJ=*&*=001 iDominates !*pro*|*exp*) 
 AND (IP-MAT* iDominates !NP-SE*) 
 AND (NP-SBJ=*&*=001 HasSister VB-*|SR-*|ET-*|HV-*|TR-*) 
 AND (NP-SBJ=*&*=001 precedes VB-*|SR-*|ET-*|HV-*|TR-*) 
//

(the comparative search may be the same as for (1) above)

(5) Searching for pre-verbal objects that correspond to the first mention of a referent indirectly related to other referents in a chain

// Query: OV&=001.q

 begin-remark:
  Find matrix clauses with pre-verbal nominal accusative objects 
  whose referent corresponds to a first-time mention
  of a referent with additional & annotation
 end-remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-ACC=*&*=001) 
 AND (IP-MAT* iDominates !NP-SE*) 
 AND (NP-ACC=*&*=001 iDominates !CL*) 
 AND (NP-ACC=*&*=001 HasSister VB-*|SR-*|ET-*|HV-*|TR-*) 
 AND (NP-ACC=*&*=001 precedes VB-*|SR-*|ET-*|HV-*|TR-*)
//

(the comparative search may be the same as for (2) above)

(6) Searching for null subjects in first-time mentions of a referent indirectly related to other referents in a chain

(N.B. Again, this is a checking search; it is expected that no null subject is a first-time mention at all).

// Query: pro&=001.q

 begin_remark:
  Find matrix clauses with null subjects
  whose referent corresponds to a first-time mention
  of a referent with additional & annotation
 end_remark

 define: port.def
 print_indices: t

 node: IP*
 query: (IP-MAT* iDominates NP-SBJ=*&*=001) 
 AND (NP-SBJ=*&*=001 iDominates *pro*) 
 AND (IP-MAT* iDominates !NP-SE*)
//

(the comparative search may be the same as for (3) above)

(7) Searching for subject projections of a specific referent

// Query: SBJ_thisid.q

 begin-remark:
  Find subject projections of a specific referent in a chain.
  For each application, change 0000 to the number of the targeted ID.
 end_remark

 define: port.def
 print_indices: t

 node: NP*
 nodes_only: t
 remove_nodes: t

 query: (NP-SBJ*=*0000* exists)
 AND (IP-MAT* iDominates !NP-SE*)
//

(8) Searching for object projections of a specific referent

// Query: ACC_thisid.q

 begin-remark:
  Find object projections of a specific referent in a chain.
  For each application, change 0000 to the number of the targeted ID.
 end_remark

 define: port.def
 print_indices: t

 node: NP*
 nodes_only: t
 remove_nodes: t

 query: (NP-ACC*=*0000* exists)
 AND (IP-MAT* iDominates !NP-SE*)
//

(9) Searching for subject and object projections of a specific referent

// Query: SBJ_ACC_thisid.q

 begin-remark:
  Find subject and object projections of a specific referent in a chain.
  For each application, change 0000 to the number of the targeted ID.
 end_remark

 define: port.def
 print_indices: t

 node: NP*
 nodes_only: t
 remove_nodes: t

 query: (NP-SBJ*=*0000* exists)
 OR (NP-ACC*=*0000* exists)
 AND (IP-MAT* iDominates !NP-SE*)
//

(10) Searching for any NP projection of a specific referent

// Query: NP_thisid.q

 begin-remark:
  Find NP projections of a specific referent in a chain.
  For each application, change 0000 to the number of the targeted ID.
 end_remark

 define: port.def
 print_indices: t

 node: NP*
 nodes_only: t
 remove_nodes: t

 query: (NP*=*0000* exists)
 AND (IP-MAT* iDominates !NP-SE*)
//

5.2 Queries targeting referents

(10) Searching all projections of a specific referent

// Query: #thisref.q

 begin_remark:
  List all occurrences of a specific reference. 
  For each application, =0000 must be changed in the query.
 end_remark

 define: port.def
 print_indices: t

 node:  NP*|WNP*
 nodes_only: t
 remove_nodes: t

 query: (NP*=0000* exists)
 OR (WNP*=0000* exists)
//

Syntax and information structure
in the first 16th century Portuguese
narrative about Brazil


Anúncios