3.3 Linguistic annotation

Preliminary Version,
December 2014 

Plan for Year 1


Syntax and information structure
in the first 16th century Portuguese
narrative about Brazil

University of York,
United Kingdom

Host supervisor:
Prof. Susan Pintzuk

Grant for research abroad,
FAPESP – São Paulo Research Foundation
(see further data at Fapesp’s repository)

The methodology for the computational treatment of this Project´s corpus is well developed as regards the philological edition and syntactic annotation systems, and the aim of the first year of research is to develop an annotation system for information structure, in a framework that may be compatible with the other modules of annotation.

This goal will be achieved thanks to an academic visit at the Department of Linguistics of the University of York, from april 2015 to february 2016.

In this session, I shall briefly outline the directions to be taken for the development of this information structure annotation, in particular trying to justify the decision of applying it to the already mature annotations.

Syntax and information structure
in the first 16th century Portuguese narrative about Brazil

3.3.1 Syntax and information structure:
……..rationale behind an overlaying approach

The most evident advantage of building a system of annotations in which information structure annotation overlays existing syntactic annotation regards the main linguistic aspect that the Project aims at exploring – namely, the relation between syntax and information structure. As described in the Overview, the main hypothesis of this Project is that in Classical Portuguese, syntactic constituents are fronted in the clause when discoursively prominent, and that it would be possible to analyze this by observing the alternance of referents in narrative and descriptive sequences in written texts. In this perspective, the advantages of grounding the analysis of the relations between syntax and information structure on the syntactic annotation already available for these texts is very clear.

This can be shown here by recalling some of the intuitive analysis that found the Project´s hypothesis, both as regards the interactions among constituents in first position and as regards the interaction between first-position constituents and null subjects. Essentially, I have argued that in Classical Portuguese texts, the alternation of referents corresponds to a pattern in which newly mentioned referents will appear as first-position constituents, independent of their syntactic role. Example (6) in part 2 showed that for the sequence “Essa cobra he muito formosa, a cabeça tem vermelha, branca e preta, e assi todo o corpo. Esta he a mais peçonhenta detodas, anda de vagar, e vive em as brenhas da terra(loosely, ‘This snake is very pretty, its head is red, white and black, and thus the whole body. This is the most venomnous of all, and it lives in deep holes in the ground’), where first-position constituents [esta cobra]/[a cabeça]/[esta] correspond to a clear alternation of referents.

At that point, however, I did not tackle the syntactic constituents in this sequence that are also involved in the alternance of reference, but which do not correspond to lexical items – i.e., null subjects. Example (16) below tries to show a more complete view of referents in this same sequence, now including null subjects, marked as [ø], with indexes (i) and (ii) for each of the two basic referents “snake” and “snake´s head“:


[Esta cobra]-i ………he ……..muito formosa,
This snake …………………….be-3PS    very  pretty,
[a cabeça]-ii  ….[ø]-i  tem ……….vermelha, branca e preta, e assi todo o corpo.
the head ………………………..have-3PS ..red, white and black, and thus whole the body.
[Esta]-i ……………………he …….a mais peçonhenta de todas,
This ………………………………be-3PS .the most venomous of all,
………………………….[ø]-i   anda de vagar, e
………………………………..  …..move-3PS of slow, and
………………………….[ø]-i  vive em as gretas da terra
………………………………………live-3PS in the holows of -the earth

There is, therefore, the following “chain of referents” in the sentences above (including lexical items and null subjects): [esta cobra]-i/[a cabeça]-ii/[ø]-i/[esta]-i/[ø]-i/[ø]-i. We can now notice that while lexical constituents in first-position  correspond to alternant referents, the referents for the null subjects form a continuous “chain” – this is clear in the second sentence, “a cabeça [ø] tem vermelha, branca e preta”, where the subject [ø] corresponds to the same referent as the subjects of all other sentences, i.e., ‘snake‘, while the constituent in first position, [a cabeça] (‘the head‘) corresponds to a different referent than the referent for the first-position constituent of both the former and the subsequent sentences.

Describing the complete chain of referents in the constructions, including null subjects, has the added advantage of allowing us to analyze not only relevant aspects of Classical Portuguese grammar, but also their contrasts with Brazilian Portuguese. This is particularly important when we consider that important literature on the syntax of Brazilian Portuguese has shown that null subjects are severely restricted to specific syntactic configurations in this grammar  (see Modesto, 2000, among others) – and that, as suggested in 2, it is our hypothesis that this restriction is not active in the licensing of null subjects in Classical Portuguese. Access to an annotation that might make explicit the chain of references for subjects in the texts, when combined with an annotation in which all null subjects have been marked, would make it possible to investigate the prediction that, in opposition to Brazilian Portuguese, the referential interpretation of Classical Portuguese null subjects are not dependent at all on the previous sentential configuration.

3.3.2 A draft of the proposed system

Taking these briefly discussed aspects into account, the starting point for the annotation system to be developed in this Project will be a technique aimed, fundamentally, at the explicitation of the chain of referents formed by arguments in the texts.

This will be accomplished based on the syntactic annotation already developed for these texts in the context of the Tycho Brahe Parsed Corpus of Historical Portuguese, in which the syntactic functions of all constituents (lexical and null) are indicated. The syntactic annotation system used at the Tycho Brahe Corpus is an adaptation, for Portuguese, of the system devised for the Penn-Helsinki Parsed Corpus of Middle English (Kroch & Taylor, 2000), based on the parser developed by Bikel, 2004. In Paixão de Sousa, 2014, the annotation is described in relative detail; here I shall superficially show its main features as regards the annotation of subjects.

In this aspect, the first point to notice is that the automatic procedures at present do not include the identification of the function “subject” (or any other syntactic function), nor the identification of empty categories: both are identified by a human researcher and manually codified into the annotation by them. In order to show what those codifications appear like, we bring a short example of a sequence annotated in the text by Magalhães de Gandavo (Gandavo, 1576), part of the paragraph shown as example (11) in 2 , versing about the “Tatu” (‘armadillo‘): “Tem um rabo comprido todo coberto do mesmo casco : o focinho é como de leitão, ainda que mais delgado algum tanto, e não bota mais fora do casco que a cabeça(‘It has a long tail all covered in the same shell: its snout is like a piglet’s, albeit somewhat thinner, and it does not put out from the shell anything but its head‘), glosed in (17a) below:

(17) (a)

Tem          um rabo comprido todo coberto do       mesmo casco :
Have-3PS  a tail……..long          …all covered …….of-the   same     shell :

o focinho é ………como de leitão,  ainda  que mais  delgado algum tanto,
the snout be-3PS  like …..of pigglet, ..albeit  that  more  thin……..some much,

e …..não bota…….mais..fora..do……cascoque….a cabeça
and .not ..put-3PS  more….out….of-the shell……thanthe head

In (17b) below, where the three matrix clauses of this sequence are annotated as IP-MAT, notice annotation of the three matrix subjects: the first subject, null, is annotated as (NP-SBJ *pro*) – where *pro* stands for a null, referential pronoun; the second subject, lexical (o focinho, ‘the snout‘), is annotated as (NP-SBJ (D o) (N focinho)); the third subject, again null, (NP-SBJ *pro*):

(17) (b)
Imagem 833

To this annotation, indications of the referents for each of the subjects could be added – for instance by simply marking each of the categories (NP-SBJ …) with an index (i), (ii), etc., as roughly shown in (18):

Imagem 834

This very simple annotation would add two important pieces of information to what had been codified in the original annotation in (17).

The first aspect concerns the referents of null subjects. The annotation now shows that the referent of the first subject, *pro* (i.e., ‘armadillo‘), is different from the referent of the second subject, o focinho, ‘the snout‘; and that the referent of the third subject, *pro*, is the same as the first but (more importantly) different from the second. This would be important for comparative studies, since, in the light of what the literature and our own intuition shows, such a configuration would be un-grammatical in Brazilian Portuguese: in this grammar, the third subject would only be licensed as *pro* if its referent were co-indexed with the previous subject (o focinho). This one piece of information, therefore, would already constitute very good grounds for comparative analysis between the two grammars. By extending this codification of referents to the whole text, we could apply searches for “all constructions with pre-verbal lexical subjects” combined with “all constructions with null referential subjects“, and confirm the hypothesis that null subjects are consistently licensed in this text with absolutely no configurational restrictions – as is my hypothesis, based on a merely intuitive reading of the text.

The second aspect that this simple annotation makes explicit is related to the referents of lexical pre-verbal constituents (subjects and others). As we saw, the referent of the pre-verbal constituent o focinho (‘the snout‘) contrasts with both the former and the following referent in the chain – i.e., with the referent of the two *pro* subjects, namely, ‘armadillo‘. In other words, the referent of the second clause, so to speak, interrupts a harmonious sequence of references – ‘armadillo‘ / ‘the snout‘ / ‘armadillo‘. According to our main hypothesis, this is the reason why it appears in pre-verbal position. If we annotate the whole text following this basic idea, and then search for sequences of “constructions with pre-verbal lexical subjects or complements“, we would be able to examine if pre-verbal constituents in Classical Portuguese actually do correspond, always and consistently, to referents that differ from the last mentioned referent in a chain, as we propose based on intuitive reading. Therefore, this simple annotation of referents for main-clause arguments would be an interesting basis for the verification of the hypothesis of ‘left prominence’, the core idea in this Project as regards the grammar of Classical Portuguese.

3.3.3 Final remarks

Much as it would be helpful for the two goals highlighted this far, however, the annotation outlined here is not an annotation of information structure yet – it is, at best, an annotation of referential chains. It does not, for instance, codify the status of left-fronted constituents as foci our topics, let alone specific sub-types of foci or topics (contrastive, familiar, etc.) according to any typology.

We take this simple annotation of referents as a basis over which a more sophisticated and theoretically oriented annotation for information structure may be developed. The idea is to use a first version of the annotation, in the simple model shown in (18) above, to run searches and initial analysis that would provide the empirical grounds over which a complete annotation for information structure categories would be developed.

We consider that this is an appropriate sequence of procedures, in particular, given the fact that we have no knowledge of other techniques for codifying information structure based on the Tycho Brahe-Penn Helsinky framework for syntactic annotation. The experimental character of this development, thus, seems to speak for the need of a trial and error approach.

February, 2015

Abstract | 1 Introduction | 2 Overview | 3 Details | References