This document describes PC-PATR, an implementation of the PATR-II computational linguistic formalism (plus a few enhancements) for personal computers. It is available for MS-DOS, Microsoft Windows, Macintosh, and Unix.(1)
PC-PATR uses a left corner chart parser with these characteristics:
PC-PATR is still under development. The author would appreciate feedback directed to the following address:
Stephen McConnel (972)708-7361 (office) Language Software Development (972)708-7561 (fax) SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 steve@acadcomp.sil.org U.S.A. or Stephen_McConnel@sil.org
The PATR-II formalism can be viewed as a computer language for encoding linguistic information. It does not presuppose any particular theory of syntax. It was originally developed by Stuart M. Shieber at Stanford University in the early 1980's (Shieber 1984, Shieber 1986). A PATR-II grammar consists of a set of rules and a lexicon. Each rule consists of a context-free phrase structure rule and a set of feature constraints, that is, unifications on the feature structures associated with the constituents of the phrase structure rules. The lexicon provides the items that can replace the terminal symbols of the phrase structure rules, that is, the words of the language together with their relevant features.
Context-free phrase structure rules should be familiar to anyone who has studied either linguistic theory or computer science. They look like this:
LHS -> RHS_1 RHS_2 ...
`LHS' (the symbol to the left of the arrow) is a nonterminal symbol for the type of phrase that is being described. To the right of the arrow is an ordered list of the constituents of the phrase. These constituents are either nonterminal symbols, appearing on the left hand side of some rule in the grammar, or terminal symbols, representing basic classes of elements from the lexicon. These basic classes usually correspond to what are commonly called parts of speech. In PATR-II, the terminal and nonterminal symbols are both referred to as categories.
Figure 1. Context-free phrase structure grammar Rule S -> NP VP (SubCl) Rule NP -> {(Det) (AdjP) N (PrepP)} / PR Rule Det -> DT / PR Rule VP -> VerbalP (NP / AdjP) (AdvP) Rule VerbalP -> V Rule VerbalP -> AuxP V Rule AuxP -> AUX (AuxP_1) Rule PrepP -> PP NP Rule AdjP -> (AV) AJ (AdjP_1) Rule AdvP -> {AV / PrepP} (AdvP_1) Rule SubCl -> CJ S
Consider the PC-PATR style context-free phrase structure grammar in figure 1. It has ten nonterminal symbols (S, NP, Det, VP, VerbalP, AuxP, PrepP, AdjP, AdvP, and SubCl), and nine terminal symbols (N, PR, DT, V, AUX, PP, AV, AJ, and CJ). This grammar describes a small subset of English sentences. Several aspects of this grammar are worth mentioning.
_
) character.
Figure 2. Parse of sample English sentence S /\ / \ / \ / \ / \ / \ / \ NP VP /\ /|\ / \ / | \ / \ / | \ Det N VerbalP NP AdvP | | | | | DT man V PR PrepP | | | /\ the sees us / \ / \ PP NP | /\ with / \ / \ Det N | | DT telescope | a
Figure 3. Parse of sample sentence (PC-PATR output) S __________|__________ NP VP ___|____ _________|__________ Det N VerbalP NP AdvP | man | | | DT V PR PrepP the sees us _____|______ PP NP with ____|_____ Det N | telescope DT a
A significant amount of grammar development can be done just with context-free phrase structure rules such as these. For example, parsing the sentence "the man sees us with a telescope" with this simple grammar produces a parse tree like that shown in figure 2. (In order to minimize the height of parse trees without needing to use a graphical interface, PC-PATR actually draws parse trees like the one shown in figure 3.) Parsing the similar sentence "we see the man with a telescope" produces two different parses as shown in figure 4, correctly showing the ambiguity between whether we used a telescope to see the man, or the man had a telescope when we saw him.
Figure 4. Parses of an ambiguous English sentence S_1 __________|__________ NP_2+ VP_4 | _____________|_____________ PR_3+ VerbalP_5+ NP_7 AdvP_11 we | ___|____ | V_6+ Det_8+ N_10+ PrepP_12+ see | man _____|______ DT_9+ PP_13+ NP_14+ the with ____|_____ Det_15+ N_17+ | telescope DT_16+ a S_18 _______|________ NP_2+ VP_19 | ________|________ PR_3+ VerbalP_5+ NP_20 we | _________|__________ V_6+ Det_8+ N_10+ PrepP_12+ see | man _____|______ DT_9+ PP_13+ NP_14+ the with ____|_____ Det_15+ N_17+ | telescope DT_16+ a
A fundamental problem with context-free phrase structure grammars is that they tend to grossly overgenerate. For example, the sample grammar would incorrectly recognize the sentence "*he see the man with a telescope", assigning it tree structures similar to those shown in figure 4. With only the simple categories used by context-free phrase structure rules, a very large number of rules are required to accurately handle even a small subset of a language's grammar. This is the primary motivation behind feature structures, the basic enhancement of PATR-II over context-free phrase structure grammars.(2)
The basic data structure of the PATR-II formalism is called a feature structure. A feature structure contains one or more features. A feature consists of an attribute name and a value. Feature structures are commonly written as attribute-value matrices like this (example 1):
(1) [ lex: telescope cat: N ]
where lex and cat are attribute names, and telescope and N are the values for those attributes. Note that the feature structure is enclosed in brackets. Each feature occurs on a separate line, with the name coming first, followed by a colon and then its value. Feature names and (simple) values are single words consisting of alphanumeric characters.
Feature structures can have either simple values, such as the example above, or complex values, such as this (example 2):
(2) [ lex: telescope cat: N gloss: `telescope head: [ agr: [ 3sg: + ] number: SG pos: N proper: - verbal: - ] root_pos: N ]
where the value of the head feature is another feature structure, that also contains an embedded feature structure. Feature structures can be arbitrarily nested in this manner.
Portions of a feature structure can be referred to using the
path notation. A path is a sequence of one or more feature names
enclosed in angled brackets (<>
). For instance, examples
3-5
would all be valid feature paths based on the feature structure of
example 2:
(3) <head> (4) <head number> (5) <head agr 3sg>
Paths are used in feature templates and feature constraints, described below.
Different features within a feature structure can share values. This
is not the same thing as two features having identical values. In
Example 6 below, the <head agr>
and
<subj head agr>
features have identical values, but in
Example 7, they share the same value:
(6) [ cat: S pred: [ cat: VP head: [ agr: [ 3sg: + ] finite: + pos: V tense: PAST vform: ED ] ] subj: [ cat: NP head: [ agr: [ 3sg: + ] case: NOM number: SG pos: N proper: - verbal: - ] ] ]
(7) [ cat: S pred: [ cat: VP head: [ agr: $1[ 3sg: + ] finite: + pos: V tense: PAST vform: ED ] ] subj: [ cat: NP head: [ agr: $1[ 3sg: + ] case: NOM number: SG pos: N proper: - verbal: - ] ] ]
Shared values are indicated by the coindexing markers
$1
, $2
, and so on.
Note that upper and lower case letters used in feature names and values are distinctive. For example, NUMBER is not the same as Number or number. (This is also true of the symbols used in the context-free phrase structure rules.)
Unification is the basic operation applied to feature structures in PC-PATR. It consists of the merging of the information from two feature structures. Two feature structures can unify if their common features have the same values, but do not unify if any feature values conflict.
Consider the following feature structures:
(8) [ agreement: [ number: singular person: first ] ] (9) [ agreement: [ number: singular ] case: nominative ] (10) [ agreement: [ number: singular person: third ] ] (11) [ agreement: [ number: singular person: first ] case: nominative ] (12) [ agreement: [ number: singular person: third ] case: nominative ]
Feature 9 can unify with either feature 8
(producing feature 11) or feature 10 (producing
feature 12). However, feature 8 cannot unify with
feature 10 due to the conflict in the values of their
<agreement person>
features.
The feature constraints associated with phrase structure rules in PATR-II consist of a set of unification expressions (the unification constraints). Each unification expression has three parts, in this order:
=
)
As an example, consider the following PC-PATR rules:
(13) Rule S -> NP VP (SubCl) <NP head agr> = <VP head agr> <NP head case> = NOM <S subj> = <NP> <S head> = <VP head> (14) Rule NP -> {(Det) (AJ) N (PrepP)} / PR <Det head number> = <N head number> <NP head> = <N head> <NP head> = <PR head>
Rule 13 has two feature constraints that limit the co-occurrence of NP
and VP, and two feature constraints that build the feature structures for
S. This highlights the dual purpose of feature constraints in PC-PATR:
limiting the co-occurrence of phrase structure elements and constructing
the feature structure for the element defined by a rule. The first
constraint states that the NP and VP <head agr>
features must
unify successfully, and also modifies both of those features if they do
unify. The second constraint states that NP's <head case>
feature
must either be equal to NOM
or else be undefined. In the latter
case, it is set equal to NOM
. The last two constraints create a
new feature structure for S from the feature structures for NP and VP.
Rule 14 illustrates another important point about feature unification constraints: they are applied only if they involve the phrase structure constituents actually found for the rule.
Figure 5. PC-PATR grammar of English subset Rule S -> NP VP (SubCl) <NP head agr> = <VP head agr> <NP head case> = NOM <S subj> = <NP> <S pred> = <VP> Rule NP -> {(Det) (AdjP) N (PrepP)} / PR <Det head number> = <N head number> <NP head> = <N head> <NP head> = <PR head> Rule Det -> DT / PR <PR head case> = GEN <Det head> = <DT head> <Det head> = <PR head> Rule VP -> VerbalP (NP / AdjP) (AdvP) <NP head case> = ACC <NP head verbal> = - <VP head> = <VerbalP head> Rule VerbalP -> V <V head finite> = + <VerbalP head> = <V head> Rule VerbalP -> AuxP V <V head finite> = - <VerbalP head> = <AuxP head> Rule AuxP -> AUX (AuxP_1) <AuxP head> = <AUX head> Rule PrepP -> PP NP <NP head case> = ACC <PrepP head> = <PP head> Rule AdjP -> (AV) AJ (AdjP_1) Rule AdvP -> {AV / PrepP} (AdvP_1) Rule SubCl -> CJ S
Figure 6. PC-PATR output with feature structure 1: S __________|__________ NP VP ___|____ _________|__________ Det N VerbalP NP AdvP | man | | | DT V PR PrepP the saw us _____|______ PP NP with ____|_____ Det N | telescope DT a S: [ cat: S pred: [ cat: VP head: [ agr: $1[ 3sg: + ] finite:+ pos: V tense: PAST vform: ED ] ] subj: [ cat: NP head: [ agr: $1[ 3sg: + ] case: NOM number:SG pos: N proper:- verbal:- ] ] ] 1 parse found
Figure 5 shows the grammar of
figure 1 augmented with a number of feature constraints.
With this grammar (and a suitable lexicon), the parse output shown in
figure 2 would include the sentence feature
structure, as shown in figure 6. Note that the
<subj head agr>
and <pred head agr>
features
share a common value as a result of the feature constraint unifications
associated with the rule S -> NP VP (SubCl)
.
PC-PATR allows disjunctive feature unification constraints with its phrase structure rules. Consider rules 15 and 16 below. These two rules have the same phrase structure rule part. They can therefore be collapsed into the single rule 17, which has a disjunction in its unification constraints.
(15) Rule CP -> NP C' ; for wh questions with NP fronted <NP type wh> = + <C' moved A-bar> = <NP> <CP type wh> = <NP type wh> <CP type> = <C' type> <CP moved A-bar> = none <CP type root> = + ; root clauses <CP type q> = + <CP type fin> = + <CP moved A> = none <CP moved head> = none (16) Rule CP -> NP C' ; for wh questions with NP fronted <NP type wh> = + <C' moved A-bar> = <NP> <CP type wh> = <NP type wh> <CP type> = <C' type> <CP moved A-bar> = none <CP type root> = - ; non-root clauses (17) Rule CP -> NP C' ; for wh questions with NP fronted <NP type wh> = + <C' moved A-bar> = <NP> <CP type wh> = <NP type wh> <CP type> = <C' type> <CP moved A-bar> = none { <CP type root> = + ; root clauses <CP type q> = + <CP type fin> = + <CP moved A> = none <CP moved head> = none / <CP type root> = - ; non-root clauses }
Not only does PC-PATR allow disjunctive unification constraints, but it also allows disjunctive phrase structure rules. Consider rule 18: it is very similar to rule 17. These two rules can be further combined to form rule 19, which has disjunctions in both its phrase structure rule and its unification constraints.
(18) Rule CP -> PP C' ; for wh questions with PP fronted <PP type wh> = + <C' moved A-bar> = <PP> <CP type wh> = <PP type wh> <CP type> = <C' type> <CP moved A-bar> = none { <CP type root> = + ; root clauses <CP type q> = + <CP type fin> = + <CP moved A> = none <CP moved head> = none / <CP type root> = - ; non-root clauses } (19) ; for wh questions with NP or PP fronted Rule CP -> { NP / PP } C' <NP type wh> = + <C' moved A-bar> = <NP> <CP type wh> = <NP type wh> <PP type wh> = + <C' moved A-bar> = <PP> <CP type wh> = <PP type wh> <CP type> = <C' type> <CP moved A-bar> = none { <CP type root> = + ; root clauses <CP type q> = + <CP type fin> = + <CP moved A> = none <CP moved head> = none / <CP type root> = - ; non-root clauses }
Since the open brace ({
) introduces disjunctions both in the
phrase structure rule and in the unification constraints, care must be
taken to avoid confusing PC-PATR when it is loading the grammar
file. The end of the phrase structure rule, and the beginning of the
unification constraints, is signaled either by the first constraint
beginning with an open angle bracket (<
) or by a colon
(:
). If the first constraint is part of a disjunction, then the
phrase structure rule must end with a colon. Otherwise, PC-PATR
will treat the unification constraint as part of the phrase structure
rule, and will shortly complain about syntax errors in the grammar
file.
Perhaps it should be noted that disjunctions in phrase structure rules or unifications are expanded when the grammar file is read. They serve only as a convenience for the person writing the rules.
The lexicon provides the basic elements (atoms) of the grammar, which are usually words. Information like that shown in feature 2 is provided for each lexicon entry. Unlike the original implementation of PATR-II, PC-PATR stores the lexicon in a separate file from the grammar rules. See section 6 The PC-PATR Lexicon File, below for details.
PC-PATR is an interactive program. It has a few command line options, but it is controlled primarily by commands typed at the keyboard (or loaded from a file previously prepared).
The PC-PATR program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.
-a filename
-g filename
-l filename
-t filename
The following options exist only in beta-test versions of the program, since they are used only for debugging.
-/
-z filename
-Z address,count
address
is allocated or
freed for the count
'th time.
Each of the commands available in PC-PATR is described below. Each command consists of one or more keywords followed by zero or more arguments. Keywords may be abbreviated to the minimum length necessary to prevent ambiguity.
cd
directory
changes the current directory to the one specified. Spaces in the
directory pathname are not permitted.
For MS-DOS or Windows, you can give a full path starting with the disk
letter and a colon (for example, a:
); a path starting with
\
which indicates a directory at the top level of the current
disk; a path starting with ..
which indicates the directory
above the current one; and so on. Directories are separated by the
\
character. (The forward slash /
works just as well as
the backslash \
for MS-DOS or Windows.)
For the Macintosh, you can give a full path starting with the name of
a hard disk, a path starting with :
which means the current
folder, or one starting ::
which means the folder containing the
current one (and so on).
For Unix, you can give a full path starting with a /
(for
example, /usr/pcpatr
); a path starting with ..
which
indicates the directory above the current one; and so on. Directories
are separated by the /
character.
clear
erases all existing grammar and lexicon information,
allowing the user to prepare to load information for a new language.
Strictly speaking, it is not needed since the load grammar
command erases the previously existing grammar, and the
load lexicon
and load analysis
commands erase any
previously existing lexicon.
close
closes the current log file opened by a previous log
command.
directory
lists the contents of the current directory. This command is available
only for the MS-DOS and Unix implementations. It does not exist for
Microsoft Windows or the Macintosh.
edit
filename
attempts to edit the specified file using the program indicated by the
environment variable EDITOR
. If this environment variable is not
defined, then edlin
is used to edit the file on MS-DOS, and
vi
is used to edit the file on Unix. (These defaults should
convince you to set this variable!) This command is not available for
Microsoft Windows or the Macintosh.
exit
stops PC-PATR, returning control to the operating system. This is the
same as quit
.
The file
commands process data from a file, optionally writing
the parse results to another file. Each of these commands is described
below.
file disambiguate
input.ana [out.ana]
reads sentences from the specified AMPLE analysis file and writes the
corresponding parse trees and feature structures either to the screen
or to the optionally specified output file. If the output file is
written, ambiguous word parses are eliminated as much as possible as a
result of the sentence parsing. When finished, a statistical report of
successful (sentence) parses is displayed on the screen.
file parse
input-file [output-file]
reads sentences from the specified input file, one per line, and writes
the corresponding parse trees and feature structures to the screen or
to the optionally specified output file. The comment character is in
effect while reading this file. PC-PATR currently makes no attempt to
handle either capitalization or punctuation. PROBABLY SOME
CAPABILITY FOR HANDLING PUNCTUATION WILL BE ADDED AT SOME POINT.
This command behaves the same as parse
except that input comes
from a file rather than the keyboard, and output may go to a file
rather than the screen. When finished, a statistical report of
successful parses is displayed on the screen.
help
command
displays a description of the specified command. If help
is typed
by itself, PC-PATR displays a list of commands with short descriptions of
each command.
The load
commands all load information stored in specially
formatted files. The load ample
and load kimmo
commands activate morphological parsers, and serve as alternatives to
load lexicon
(or load analysis
) for obtaining the
category and other feature information for words. Each of the
load
commands is described below.
load ample control
xxad01.ctl xxancd.tab [xxordc.tab]
erases any existing AMPLE information (including dictionaries) and
reads control information from the specified files. This also erases
any stored PC-Kimmo information.
At least two and possibly three files are loaded by this command. The
first file is the AMPLE analysis data file. It has a
default filetype extension of .ctl
but no default filename. The
second file is the AMPLE dictionary code table file. It has a
default filetype extension of .tab
but no default filename. The
third file is an optional dictionary orthography change table. It has a
default filetype extension of .tab
and no default filename.
l am c
is a synonym for load ample control
.
load ample dictionary
[prefix.dic] [infix.dic] [suffix.dic] root1.dic [...]
or
load ample dictionary
file01.dic [file02.dic ...]
erases any existing AMPLE dictionary information and reads the
specified files. This also erases any stored PC-Kimmo information.
The first form of the command is for using a dictionary whose files are
divided according to morpheme type (set ample-dictionary split
).
The different types of dictionary files must be loaded
in the order shown, with any unneeded affix dictionaries omitted.
The second form of the command is for using a dictionary whose entries
contain the type of morpheme
(set ample-dictionary unified
).(3)
l am d
is a synonym for load ample dictionary
.
load ample text-control
xxintx.ctl
erases any existing AMPLE text input control information and reads
the specified file. This also erases any stored PC-Kimmo information.
The text input control file has a default filetype extension of
.ctl
but no default filename.
l am t
is a synonym for load ample text-control
.
load analysis
file1.ana [file2.ana ...]
erases any existing lexicon and reads a new lexicon from the specified
AMPLE analysis file(s). Note that more than one file may be loaded
with the single load analysis
command: duplicate entries are not
stored in the lexicon.
The default filetype extension for load analysis
is .ana
,
and the default filename is ample.ana
.
l a
is a synonym for load analysis
.
load grammar
file.grm
erases any existing grammar and reads a new grammar from the specified file.
The default filetype extension for load grammar
is .grm
,
and the default filename is grammar.grm
.
l g
is a synonym for load grammar
.
load kimmo grammar
file.grm
erases any existing PC-Kimmo (word) grammar and reads a new word grammar
from the specified file.
The default filetype extension for load kimmo grammar
is
.grm
, and the default filename is grammar.grm
.
l k g
is a synonym for load kimmo grammar
.
load kimmo lexicon
file.lex
erases any existing PC-Kimmo lexicon information and reads a new
morpheme lexicon from the specified file. A PC-Kimmo rules file must
be loaded before a PC-Kimmo lexicon file can be loaded.
The default filetype extension for load kimmo lexicon
is
.lex
, and the default filename is lexicon.lex
.
l k l
is a synonym for load kimmo lexicon
.
load kimmo rules
file.rul
erases any existing PC-Kimmo rules and reads a
new set of rules from the specified file. This also erases any stored
AMPLE information.
The default filetype extension for load kimmo rules
is
.rul
, and the default filename is rules.rul
.
l k r
is a synonym for load kimmo rules
.
load lexicon
file1.lex [file2.lex ...]
erases any existing lexicon and reads a new lexicon from the specified
file(s). Note that more than one file may be loaded with a single
load lexicon
command.
The default filetype extension for load lexicon
is .lex
,
and the default filename is lexicon.lex
.
l l
is a synonym for load lexicon
.
log
[file.log]
opens a log file. Each item processed by a parse
command
is stored to the log file as well as being displayed on the screen.
If a filename is given on the same line as the log
command, then
that file is used for the log file. Any previously existing file with
the same name will be overwritten. If no filename is provided, then
the file pcpatr.log
in the current directory is used for the log
file.
Use close
to stop recording in a log file. If a log
command is given when a log file is already open, then the earlier log
file is closed before the new log file is opened.
parse
[sentence or phrase]
attempts to parse the input sentence according to the loaded grammar.
If a sentence is typed on the same line as the command, then that
sentence is parsed. If the parse
command is given by itself,
then the user is prompted repeatedly for sentences to parse. This
cycle of typing and parsing is terminated by typing an empty
"sentence" (that is, nothing but the Enter
or Return
key).
Both the grammar and the lexicon must be loaded before using this command.
quit
stops PC-PATR, returning control to the operating system. This is the
same as exit
.
The save
commands write information stored in memory to a file
suitable for reloading into PC-PATR later. Each of these commands is
described below.
save lexicon
[file.lex]
writes the current lexicon contents to the designated file. The output
lexicon file must be specified. This can be useful if you are using a
morphological parser to populate the lexicon.
save status
[file.tak]
writes the current settings to the designated file in the form of
PC-PATR commands. If the file is not specified, the settings are
written to pcpatr.tak
in the current directory.
The set
commands control program behavior by setting internal
program variables. Each of these commands (and variables) is described
below.
set ambiguities
number
limits the number of analyses printed to the given number. The default
value is 10. Note that this does not limit the number of analyses
produced, just the number printed.
set ample-dictionary
value
determines whether or not the AMPLE dictionary files are divided
according to morpheme type. set ample-dictionary split
declares
that the AMPLE dictionary is divided into a
prefix dictionary file, an infix dictionary file, a suffix dictionary
file, and one or more root dictionary files. The existence of the
three affix dictionary depends on settings in the AMPLE analysis
data file. If they exist, the load ample dictionary
command
requires that they be given in this relative
order: prefix, infix, suffix, root(s).
set ample-dictionary unified
declares that any of the AMPLE
dictionary files may contain any type of morpheme. This implies that
each dictionary entry may contain a field specifying the type of
morpheme (the default is root), and that the dictionary code
table contains a \unified
field. One of the changes
listed under \unified
must convert a backslash code to T
.
The default is for the AMPLE dictionary to be split.(4)
set check-cycles
value
enables or disables a check to prevent cycles in the parse chart.
set check-cycles on
turns on this check, and
set check-cycles off
turns it off. This check
slows down the parsing of a sentence, but it makes the parser less
vulnerable to hanging on perverse grammars. The default setting is
on
.
set comment
character
sets the comment character to the indicated value. If character
is missing (or equal to the current comment character), then comment
handling is disabled. The default comment character is ;
(semicolon).
set failures
value
enables or disables grammar failure mode. set failures on
turns on grammar failure mode, and set failures off
turns it
off. When grammar failure mode is on, the partial results of forms
that fail the grammar module are displayed. A form may fail the
grammar either by failing the feature constraints or by failing the
constituent structure rules. In the latter case, a partial tree (bush)
will be returned. The default setting is off
.
Be careful with this option. Setting failures to on
can cause
the PC-PATR to go into an infinite loop for certain recursive grammars
and certain input sentences. WE MAY TRY TO DO SOMETHING TO DETECT
THIS TYPE OF BEHAVIOR, AT LEAST PARTIALLY.
set features
value
determines how features will be displayed.
set features all
enables the display of the features for all
nodes of the parse tree.
set features top
enables the display of the feature
structure for only the top node of the parse tree. This is the default
setting.
set features flat
causes features to be displayed in a flat,
linear string that uses less space on the screen.
set features full
causes features to be displayed in an
indented form that makes the embedded structure of the feature set
clear. This is the default setting.
set features on
turns on features display mode, allowing
features to be shown. This is the default setting.
set features off
turns off features display mode, preventing
features from being shown.
set final-punctuation
value
defines the set of characters used to mark the ends of sentences. The
individual characters must be separated by spaces so that digraphs and
trigraphs can be used, not just single character units. The default is
. ! ? : ;
.
This variable setting affects only the file disambiguate
command.
set gloss
value
enables the display of glosses in the parse tree output if value is
on
, and disables the display of glosses if value is
off
. If any glosses exist in the lexicon file, then gloss
is
automatically turned on
when the lexicon is loaded. If no glosses
exist in the lexicon, then this flag is ignored.
set kimmo check-cycles
value
enables or disables a check to prevent cycles in a word parse chart
created by the embedded PC-Kimmo morphological parser.
set kimmo check-cycles on
turns on this check, and
set kimmo check-cycles off
turns it off. This check slows
down the parsing of a sentence, but it makes the parser less vulnerable
to hanging on perverse grammars. The default setting is on
.
set kimmo promote-default
value
controls whether default atomic values in the feature structures loaded
from the lexicon are "promoted" to ordinary atomic values before
parsing a word with the embedded PC-Kimmo morphological parser.
set kimmo promote-defaults on
turns on this behavior, and
set kimmo promote-defaults off
turns it off. The default
setting is on
. (It is arguable that this is the wrong choice for
the default, but this has been the behavior since the program was first
written.)
set kimmo top-down-filter
value
enables or disables top-down filtering in the embedded PC-Kimmo
morphological parser, based on the morpheme categories.
set kimmo top-down-filter on
turns on this filtering, and
set kimmo top-down-filter off
turns it off. The top-down
filter speeds up the parsing of a sentence, but might cause the parser to
miss some valid parses. The default setting is on
.
This should not be required in the final version of PC-PATR.
set limit
number
sets the time limit (in seconds) for parsing a sentence. Its argument is
a number greater than or equal to zero, which is the maximum number of
seconds than a parse is allowed before being cancelled. The default
value is 0
, which has the special meaning that no time limit is
imposed.
NOTE: this feature is new and still somewhat experimental. It may not be fully debugged, and may cause unforeseen side effects such as program crashes some time after one or more parses are cancelled due to exceeding the set time limit.
set marker category
marker
establishes the marker for the field containing the category (part of
speech) feature. The default is \c
.
set marker features
marker
establishes the marker for the field containing miscellaneous features.
(This field is not needed for many words.) The default is \f
.
set marker gloss
marker
establishes the marker for the field containing the word gloss. The
default is \g
.
set marker record
marker
establishes the field marker that begins a new record in the lexicon
file. This may or may not be the same as the word
marker. The
default is \w
.
set marker rootgloss
marker
establishes the marker for the field containing the word rootgloss. The
default is \r
. The word's root gloss may be useful for handling
syntactic constructions such as verb reduplication. One can write a
unification constraint that ensures that the rootgloss unifies between
two successive lexical items/terminal symbols. Note that this does
not work when using Kimmo to parse words.
set marker word
marker
establishes the marker for the word field. The default is \w
.
set promote-defaults
value controls whether default
atomic values in the feature structures loaded from the lexicon are
"promoted" to ordinary atomic values before parsing a sentence.
set promote-defaults on
turns on this behavior, and
set promote-defaults off
turns it off. (This can affect
feature unification since a conflicting default value does not cause a
failure: the default value merely disappears.) The default setting is
on
. (It is arguable that this is the wrong choice for the
default, but this has been the behavior since the program was first
written.)
set property-is-feature
value controls whether the
values in the AMPLE analysis \p
(property) field are to be
interpreted as feature template names, the same as the values in the AMPLE
analysis \fd
(feature descriptor) field.
set property-is-feature on
turns on this behavior, and
set property-is-feature off
turns it off.
The default setting is off
. (It is arguable that this is the
wrong choice for the default, but this has been the behavior since the
program was first written.)
set recognize-only
value controls whether the parser
acts as a recognizer or as a real parser and thus produces all
possible parses.
set recognize-only on
causes the first successful parse to terminate the parsing process.
set recognize-only off
allows all possible parses to be to be checked and returned by the parsing process.
The default setting is off
.
set rootgloss
value specifies if root glosses should be
treated as a lexical feature and, if so, which root(s) in compound roots
are used. The word's root gloss may be useful for handling syntactic
constructions such as verb reduplication. Note that this does not work
when using Kimmo to parse words.
set rootgloss off
turns off the use of the root gloss feature.
This is the default setting.
set rootgloss on
turns on the use of the root gloss feature.
This value should be used when using a word lexicon (i.e. when using the
load lexicon file
command). N.B. that it must be set before one
loads the lexicon file (otherwise, no root glosses will be loaded).
set rootgloss leftheaded
turns on the use of the root gloss
feature and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, only the leftmost root in compound roots
will be used as the root gloss feature value.
set rootgloss rightheaded
turns on the use of the root gloss
feature and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, only the rightmost root in compound roots
will be used as the root gloss feature value.
set rootgloss all
turns on the use of the root gloss
feature and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, every root gloss in compound roots
will be used as the root gloss feature value.
set timing
value
enables timing mode if value is on
, and disables timing
mode if value is off
. If timing mode is on
, then
the elapsed time required to process a command is displayed when the
command finishes. If timing mode is off
, then the elapsed time
is not shown. The default is off
. (This option is useful only
to satisfy idle curiosity.)
set top-down-filter
value
enables or disables top-down filtering based on the categories.
set top-down-filter on
turns on this filtering,
and set top-down-filter off
turns it off. The
top-down filter speeds up the parsing of a sentence, but might cause
the parser to miss some valid parses. The default setting is
on
.
This should not be required in the final version of PC-PATR.
set tree
value
specifies how parse trees should be displayed.
set tree full
turns on the parse tree display, displaying the
result of the parse as a full tree. This is the default setting.
A short sentence would look something like this:
Sentence_1 | Declarative_2 _____|_____ NP_3 VP_5 | ___|____ N_4 V_6 COMP_7 cows eat | NP_8 | N_9 grass
set tree flat
turns on the parse tree display, displaying the
result of the parse as a flat tree structure in the form of a bracketed
string. The same short sentence would look something like this:
(Sentence_1 (Declarative_2 (NP_3 (N_4 cows))(VP_5 (V_6 eat)(COMP_7 (NP_8 (N_9 grass))))))
set tree indented
turns on the parse tree display, displaying
the result of the parse in an indented format sometimes called a
northwest tree. The same short sentence would look like this:
Sentence_1 Declarative_2 NP_3 N_4 cows VP_5 V_6 eat COMP_7 NP_8 N_9 grass
set tree xml
turns on the parse tree display, displaying the
result of the parse in an XML format. The same short sentence would look
like this:
<Analysis count="1"> <Parse> <Node cat="Sentence" id="_1._1"> <Fs> <F name="cat"><str>Sentence</str></f> </Fs> <Node cat="Declarative" id="_1._2"> <Fs> <F name="cat"><str>Declarative</str></f> </Fs> <Node cat="NP" id="_1._3"> <Fs> <F name="cat"><str>NP</str></f> </Fs> <Leaf cat="N" id="_1._4"> <Fs> <F name="cat"><str>N</str></f> <F name="lex"><str>cows</str></f> </Fs> <Lexfs> <F name="cat"><str>N</str></f> <F name="lex"><str>cows</str></f> </Lexfs> <Str>cows</str> </Leaf> </Node> <Node cat="VP" id="_1._5"> ... (35 lines omitted) </Node> </Node> </Node> </Parse> </Analysis>
set tree off
disables the display of parse trees altogether.
set trim-empty-features
value
disables the display of empty feature values if value is
on
, and enables the display of empty feature values if
value is off
. The default is not to display empty feature
values.
set unification
value
enables or disables feature unification.
set unification on
turns on unification mode. This is the
default setting.
set unification off
turns off feature unification in the
grammar. Only the context-free phrase structure rules are used to
guide the parse; the feature contraints are ignored. This can be
dangerous, as it is easy to introduce infinite cycles in recursive
phrase structure rules.
set verbose
value
enables or disables the screen display of parse trees in the
file parse
command. set verbose on
enables the screen display of parse
trees, and set verbose off
disables such display. The default
setting is off
.
set warnings
value
enables warning mode if value is on
, and disables
warning mode if value is off
. If warning mode is
enabled, then warning messages are displayed on the output. If warning
mode is disabled, then no warning messages are displayed. The default
setting is on
.
set write-ample-parses
value
enables writing \parse
and \features
fields at the end of
each sentence in the disambiguated analysis file if value is
on
, and disables writing these fields if value is
off
. The default setting is off
.
This variable setting affects only the file disambiguate
command.
The show
commands display internal settings on the screen. Each
of these commands is described below.
show lexicon
prints the contents of the lexicon stored in memory on the standard
output. THIS IS NOT VERY USEFUL, AND MAY BE REMOVED.
show status
displays the names of the current grammar, sentences, and log files,
and the values of the switches established by the set
command.
show
(by itself) and status
are synonyms for
show status
.
status
displays the names of the current grammar, sentences, and log files,
and the values of the switches established by the set
command.
system
[command]
allows the user to execute an operating system command (such as
checking the available space on a disk) from within PC-PATR. This is
available only for MS-DOS and Unix, not for Microsoft Windows or the
Macintosh.
If no system-level command is given on the line with the system
command, then PC-PATR is pushed into the background and a new system
command processor (shell) is started. Control is usually returned to
PC-PATR in this case by typing exit
as the operating system
command.
!
(exclamation point) is a synonym for system
.
take
[file.tak]
redirects command input to the specified file.
The default filetype extension for take
is .tak
, and the default
filename is pcpatr.tak
.
take
files can be nested three deep. That is, the user types
take file1
, file1
contains the command take file2
,
and file2
has the command take file3
. It would be an
error for file3
to contain a take
command. This should
not prove to be a serious limitation.
A take
file can also be specified by using the -t
command
line option when starting PC-PATR. When started, PC-PATR looks for a
take
file named `pcpatr.tak' in the current directory to
initialize itself with.
The following specifications apply generally to the grammar file:
set comment
command
(see section 3.2.14.4 set comment)
is operative in the grammar file. The default
comment character is the semicolon (;
). Comments may be placed
anywhere in the grammar file. Everything following a comment character
to the end of the line is ignored.
Rule
starts a context-free phrase structure rule with its
set of feature constraints. These rules define how words join together
to form phrases, clauses, or sentences. The lexicon and grammar are
tied together by using the lexical categories as the terminal symbols
of the phrase structure rules and by using the other lexical features
in the feature constraints.
Let
starts a feature template definition. Feature
templates are used as macros (abbreviations) in the lexicon. They may
also be used to assign default feature structures to the categories.
Parameter
starts a program parameter definition. These
parameters control various aspects of the program.
Define
starts a lexical rule definition. As noted in Shieber
(1985), something more powerful than just abbreviations for common
feature elements is sometimes needed to represent systematic
relationships among the elements of a lexicon. This need is met by
lexical rules, which express transformations rather than mere
abbreviations. Lexical rules serve two primary purposes in PC-PATR:
modifying the feature structures associated with lexicon entries to
produce additional lexicon entries, and modifying the feature structures
produced by a morphological parser to fit the syntactic grammar
description.
Constraint
starts a constraint template definition. Constraint
templates are used as macros (abbreviations) in the grammar file.
Lexicon
starts a lexicon section. This is only for
compatibility with the original PATR-II. The section name is
skipped over properly, but nothing is done with it.
Word
starts an entry in the lexicon. This is only for
compatibility with the original PATR-II. The entry is skipped
over properly, but nothing is done with it.(5)
End
effectively terminates the file. Anything following this
keyword is ignored.
Comment
starts a comment field. The rest of the line following
the keyword is skipped over, and everything in following lines until the
next keyword is also ignored. If you must use a keyword (other than
comment
verbatim in one of the extra lines of a comment, put a
comment character at the beginning of the line containing the keyword.
RULE
is the
same as rule
, and both are the same as Rule
. Also, in
order to facilitate interaction with the `Shoebox' program, any
of the keywords may begin with a backslash \
character. For
example, \Rule
and \rule
are both acceptable alternatives
to RULE
or rule
. The abbreviated form \co
is a
special synonym for comment
or \comment
. Note that there
is no requirement that these keywords appear at the beginning of a line.
comment
, each of the fields in the grammar file may
optionally end with a period. If there is no period, the next keyword
(in an appropriate slot) marks the end of one field and the beginning of
the next.
A PC-PATR grammar rule has these parts, in the order listed:
Rule
{}
)
->
) or equal sign (=
)
:
)
.
)
The optional rule identifier consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file.
The terminal and nonterminal symbols in the rule have the following characteristics:
NOUN
is not the same as Noun
, and neither is
the same as noun
.
X
(capital letter x) may be used to stand for any
terminal or nonterminal. For example, this rule says that any category
in the grammar rules can be replaced by two copies of the same category
separated by a CJ.
Rule X -> X_1 CJ X_2 <X cat> = <X_1 cat> <X cat> = <X_2 cat> <X arg1> = <X_1 arg1> <X arg1> = <X_2 arg1>The symbol X can be useful for capturing generalities. Care must be taken, since it can be replaced by anything.
_
). This is
illustrated in the rule for X above.
(){}[]<>=:/
cannot be used in terminal or
nonterminal symbols since they are used for special purposes in the
grammar file. The character _
can be used only for
attaching an index number to a symbol.
The symbols on the right hand side of a phrase structure rule may be marked or grouped in various ways:
Rule S -> NP {TVP / IV}
Rule S -> NP TVP / IV
The phrase structure rule can be followed by zero or more unification constraints that refer to symbols used in the rule. A unification constraint has these parts, in the order listed:
A unification constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the values of the agr features for the NP and VP nodes of the parse tree must unify:
Rule S -> NP VP <NP agr> = <VP agr>
If a unification constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the head case feature for the NP node of the parse tree must either be originally undefined or equal to NOM:
Rule S -> NP VP <NP head case> = NOM
(After unification succeeds, the head case feature for the NP node of the parse tree will be equal to NOM.)
A unification constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the tense feature is passed from the VP node up to the S node:
Rule S -> NP VP <S tense> = <VP tense>
See section 2.4 Feature constraints, for more details about unification constraints.
The phrase structure rule can also be followed by zero or more priority union operations that refer to symbols used in the rule. A priority union operation has these parts, in the order listed:
<=
)
Although priority union operations may be intermingled with unification constraints following the phrase structure rule, they are applied only after all unification constraints have succeeded. Therefore, it makes more sense to place them after all of the unification constraints as a reminder of the order of application.
Priority union operations may not appear inside a disjunction: if two rules logically differ only in the application of one priority union or another, both rules must be written out in full.
The phrase structure rule can also be followed by zero or more logical constraint operations that refer to symbols used in the rule. A logical constraint operation has these parts, in the order listed:
==
)
Although logical constraint operations may be intermingled with unification constraints or priority union operations following the phrase structure rule, they are applied only after all unification constraints have succeeded and all priority union operations have been applied. Therefore, it makes more sense to place them after all of the unification constraints, and after any priority union operations, as a reminder of the order of application.
Logical constraint operations may not appear inside a disjunction: if two rules logically differ only in the application of one logical constraint or another, both rules must be written out in full.
These last two elements of a PC-PATR rule are enhancements to the original PATR-II formalism. For this reason, they are discussed in more detail in the following two sections.
Unification is the only mechanism implemented in the original PATR-II formulism for merging two feature structures. There are situations where the desired percolation of information is not easily expressed in terms of unification. For example, consider the following rule (where ms stands for morphosyntactic features):
Stem -> Root Deriv: <Root ms> = <Deriv msFrom> <Stem ms> = <Root ms> <Stem ms> = <Deriv msTo>
The first unification expression above imposes the agreement constraints
for this rule. The second and third unification expressions attempt to
provide the percolation of information up to the Stem
. However,
it is quite possible for there to be a conflict between <Root ms>
and <Deriv msTo>
. Any such conflict would cause the third
unification expression to fail, causing the rule as a whole to fail. The
only way around this at present is to provide a large number of
unification expressions that go into greater depth in the feature
structures. Even then it may not be possible to always avoid conflicts.
An additional mechanism for merging feature structures is provided to properly handle percolation of information: overwriting via priority union. The notation of the previous example changes slightly to the following:
Stem -> Root Deriv: <Root ms> = <Deriv msFrom> <Stem ms> = <Root ms> <Stem ms> <= <Deriv msTo>
The only change is in the third expression under the rule: the
unification operator =
has been changed to a priority union
operator <=
. This new operator is the same as unification except
for handling conflicts and storing results. In unification, a conflict
causes the operation to fail. In priority union, a conflict is resolved
by taking the value in the right hand feature structure. In unification,
both the left hand feature structure and the right hand feature structure
are replaced by the unified result. In priority union, only the left
hand feature structure is replaced by the result.
There is one other significant difference between unification and priority union. Unification is logically an unordered process; it makes no difference what order the unification expressions are written. Priority union, on the other hand, is inherently ordered; a priority union operation always overrides any earlier priority union (or unification) result. For this reason, all unification expressions are evaluated before any priority union expressions, and the ordering of the priority union expressions is significant.
A BNF grammar for PC-PATR priority union operations follows.
<priority-union> ::= <feature-path> '<=' <feature-path> | <feature-path> '<=' <ATOM> <feature-path> ::= '<' <label-list> '>' <label-list> ::= <LABEL> | <LABEL> <label-list>
Note that both <LABEL>
and <ATOM>
refer to a single string
token of contiguous characters.
Unification is the only mechanism implemented in the original PATR-II formulism for imposing constraints on feature structures. There are situations where the desired constraint is not easily expressed in terms of unification. For example, consider the following rule:
Stem -> Root Deriv: <Root ms> = <Deriv msFrom> <Stem ms> = <Root ms> <Stem ms> <= <Deriv msTo>
where <Root ms>
and <Deriv msFrom>
have
the following feature structures:
[Root: [ms: [finite: - ...]]] [Deriv: [msFrom: [tense: past ...]]]
Assume that from our knowledge of verb morphology, we would like to rule out
this analysis because only finite verb roots ([finite: +]
) are
marked for tense. The only way to do this with unification is to add
[finite: +]
to the msFrom
feature of all the
tense bearing derivational suffixes. This would work, but it adds
information to suffixes that properly belongs only to roots. A better
approach would be some way to express the desired constraint more directly.
Consider the following rule:
Stem -> Root Deriv: <Root ms> = <Deriv msFrom> <Stem ms> = <Root ms> <Stem ms> <= <Deriv msTo> <Stem ms> == [finite: +] <-> [tense: []]
The fourth feature expression under the rule is a new operation called a
constraint. This particular constraint is interpreted as follows: if the
feature structure [finite: +]
subsumes the feature
structure that is the value of <Stem ms>
, then the feature
structure [tense: []]
must also subsume the feature structure
that is the value of <Stem ms>
, and if the feature
structure [finite: +]
does not subsume the feature structure
that is the value of <Stem ms>
, then the feature structure
[tense: []]
must not subsume the feature structure that is the
value of <Stem ms>
. (A feature structure F1
subsumes another feature structure F2 if F1 contains a
subset of the information contained by F2. The empty feature
structure []
subsumes all other feature structures. Subsumption
is a partial ordering: not every two feature structures are in a subsumption
relation to each other.)
A constraint is much different both syntactically and semantically from either unification or priority union. The first difference is that a constraint does not modify any feature structures; it merely compares the content of two feature structures. The second difference is that the right hand side of a constraint expression is a logical expression involving one or more feature structures rather than a feature path.
Constraints support two unary and four binary logical operations:
existence, negation, logical and, logical or, conditional, and
biconditional. The following tables summarize these logical operations.
($
is used for the subsumption operation. *P
represents
the feature structure pointed to by the feature path associated with the
logical constraint. F
, L
, and R
represent a
feature structure associated with the logical constraint.)
existence | negation | |
---|---|---|
F $ *P | P == F | P == ~F |
true | true | false |
false | false | true |
logical and | logical or | conditional | biconditional | ||
---|---|---|---|---|---|
L $ *P | R $ *P | P == L & R | P == L / R | P == L -> R | P == L <-> R |
true | true | true | true | true | true |
true | false | false | true | false | false |
false | true | false | true | true | false |
false | false | false | false | true | true |
Since they apply to the final feature structure, constraint expressions are evaluated after all of the unification and priority union expressions. Like unification and unlike priority union, the relative order of constraints is not (logically) important.
A BNF grammar for PC-PATR logical constraint operations follows.
<logical-constraint> ::= <feature-path> '==' <expression> <feature-path> ::= '<' <label-list> '>' <label-list> ::= <LABEL> | <LABEL> <label-list> <expression> ::= <factor> | '~' <factor> | <factor> <binop> <factor> | '~' <factor> <binop> <factor> | <factor> <binop> '~' <factor> | '~' <factor> <binop> '~' <factor> <factor> ::= <feature> | '(' <expression> ')' <binop> ::= '&' | '/' | '->' | '<->' <feature> ::= '[' <attribute-list> ']' | '[]' <attribute-list> ::= <attribute> | <attribute> <attribute-list> <attribute> ::= <LABEL> ':' <ATOM> | <LABEL> ':' <feature> | <LABEL> ':' <indexedvariable> <indexedvariable> ::= '^1' | '^2' | '^3' | '^4' | '^5' | '^6' | '^7' | '^8' | '^9'
Note that both <LABEL>
and <ATOM>
refer to a
single string token of contiguous characters.
An <indexedvariable>
is interpreted as a variable for the
atomic value at that place in the feature structure. The
first such variable is instantiated by the atomic value of
the feature at that place in the feature-path. All
subsequent instances of the variable are compared for
equality with the first instantiated one.
Why might one need such an indexed variable? In some SOV
languages with pro-drop and noun-verb compounding, a clause
consisting just of a Noun Verb
sequence is
potentially at least three ways ambiguous:
Subject Verb
Object Verb
Noun-Verb-compound
In at least one of these languages, it is the case that when a noun-verb compound is possible, it is the only valid reading. Therefore, the correct thing to do is to ensure that none of the other possible readings are allowed by the grammar.
Here's a (simplified) example of how one can use indexed
variables to rule out the Subject Verb
case. (The
Noun
is realized as the DP
node and the
Verb
is realized as a VP
which is a daughter
of the I'
node in the following rule.)
rule {IP option 2cI - subject initial, required, root clause} IP = DP I' <IP head> = <I' head> <IP head type root> = + <IP head type pro-drop> = - ... <DP head case nominative> = + ... <IP head> == [rootgloss:^1] -> ~ ( [type:[no_intervening:+]] & (( [subject:[head:[type:[compounds_with1:^1]]]] / [subject:[head:[type:[compounds_with2:^1]]]]) / ([subject:[head:[type:[compounds_with3:^1]]]] / [subject:[head:[type:[compounds_with4:^1]]]]) ) ) ...
In the final logical constraint above (which is shown in
bold), the atomic value of the rootgloss
feature is
stored in variable ^1
in the antecedent (the "if"
part) of the conditional. This atomic value is then
compared with the values of the various
compounds_with
features. The idea is that the value
of the rootgloss
feature should not be any of the
values of the various compounds_with
features (there
are more than one of these because a given noun may compound
with more than one verb).
A PC-PATR feature template has these parts, in the order listed:
Let
be
.
)
If the template name is a terminal category (a terminal symbol in one of the phrase structure rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure.
The characters (){}[]<>=:
cannot be used in template names
since they are used for special purposes in the grammar file. The
characters /_
can be freely used in template names. The
character \
should not be used as the first character of a
template name because that is how fields are marked in the lexicon
file.
The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form feet may have the abbreviation pl in its features field. The grammar file would define this abbreviation with a template like this:
Let pl be [number: PL]
The path notation may also be used:
Let pl be <number> = PL
More complicated feature structures may be defined in templates. For example,
Let 3sg be [tense: PRES agr: 3SG finite: + vform: S]
which is equivalent to:
Let 3sg be <tense> = PRES <agr> = 3SG <finite> = + <vform> = S
In the following example, the abbreviation irreg is defined using another abbreviation:
Let irreg be <reg> = - pl
The abbreviation pl must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation irreg in its definition. In this way, an inheritance hierarchy features may be constructed.
Feature templates permit disjunctive definitions. For example, the lexical entry for the word deer may specify the feature abbreviation sg-pl. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural:
Let sg/pl be {[number:SG] [number:PL]}
This has the effect of creating two entries for deer, one with
singular number and another with plural. Note that there is no limit
to the number of disjunct structures listed between the braces. Also,
there is no slash (/
) between the elements of the disjunction as
there is between the elements of a disjunction in the rules.
A shorter version of the above template using the path notation looks
like this:
Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they have previously been defined:
Let sg be <number> = SG Let pl be <number> = PL Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations sg and pl; without square brackets they would be interpreted as simple values instead.
Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value:
Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they
are explicitly marked as plural. For example, regular nouns such as
book do not need any feature in their lexical entries to signal
that they are singular; but an irregular noun such as feet would
have a feature abbreviation such as pl in its lexical entry.
This would be defined in the grammar as [number: PL]
, and would
override the default value for the feature number specified by the
template above. If the N template above used SG
instead of
!SG
, then the word feet would fail to parse, since its
number feature would have an internal conflict between SG
and PL
.
A PC-PATR parameter setting has these parts, in the order listed:
Parameter
:
)
is
.
)
PC-PATR recognizes the following parameters:
Start symbol
Parameter Start symbol is Sdeclares that the parse goal of the grammar is the nonterminal category S. The default start symbol is the left hand symbol of the first phrase structure rule in the grammar file.
Restrictor
Parameter Restrictor is <cat> <head form>declares that the cat and head form features should be used to screen rules before adding them to the parse chart. The default is not to use any features for such filtering. This filtering, named restriction in Shieber (1985), is performed in addition to the normal top-down filtering based on categories alone. RESTRICTION IS NOT YET IMPLEMENTED. SHOULD IT BE INSTEAD OF NORMAL FILTERING RATHER THAN IN ADDITION TO?
Attribute order
Parameter Attribute order is cat lex sense head first rest agreementdeclares that the cat attribute should be the first one shown in any output from PC-PATR, and that the other attributes should be shown in the relative order shown, with the agreement attribute shown last among those listed, but ahead of any attributes that are not listed above. Attributes that are not listed are ordered according to their character code sort order. If the attribute order is not specified, then the category feature cat is shown first, with all other attributes sorted according to their character codes.
Category feature
Parameter Category feature is Categdeclares that Categ is the name of the category attribute. The default name for this attribute is cat.
Lexical feature
Parameter Lexical feature is Lexdeclares that Lex is the name of the lexical attribute. The default name for this attribute is lex.
Gloss feature
Parameter Gloss feature is Glossdeclares that Gloss is the name of the gloss attribute. The default name for this attribute is gloss.
RootGloss feature
Parameter RootGloss feature is RootGlossdeclares that RootGloss is the name of the root gloss attribute. The default name for this attribute is rootgloss. Note that this does not work when using Kimmo to parse words.
Lexical rules serve two purposes: providing a flexible means of creating multiple related lexicon entries, and converting morphological parser output into a form suitable for syntactic parser input.
Figure 7. PC-PATR lexical rule example ; lexicon entry \w stormed \c V \f Transitive AgentlessPassive <head trans pred> = storm ; definitions from the grammar file Let Transitive be <subcat first cat> = NP <subcat rest first cat> = NP <subcat rest rest> = end <head trans arg1> = <subcat first head trans> <head trans arg2> = <subcat rest first head trans>. Define AgentlessPassive as <out cat> = <in cat> <out subcat> = <in subcat rest> <out lex> = <in lex> ; added for PC-PATR <out head> = <in head> <out head form> => passiveparticiple.
Figure 8. Feature structure before lexical rule [ lex: stormed cat: V head: [ trans: [ arg1: $1 [] arg2: $2 [] pred: storm ] ] subcat: [ first: [ cat: NP head: [ trans: $1 [] ] ] rest: [ first: [ cat: NP head: [ trans: $2 [] ] ] rest: end ] ] ]
Figure 9. Feature structures after lexical rule [ lex: stormed cat: V head: [ trans: [ arg1: $1 [] arg2: $2 [] pred: storm ] ] subcat: [ first: [ cat: NP head: [ trans: $1 [] ] ] rest: [ first: [ cat: NP head: [ trans: $2 [] ] ] rest: end ] ] ] [ lex: stormed cat: V head: [ trans: [ arg1: [] arg2: $1 [] pred: storm ] form: passiveparticiple ] subcat: [ first: [ cat: NP head: [ trans: $1 [] ] ] rest: end ] ]
A PC-PATR lexical rule has these parts, in the order listed:
Define
as
.
)
The rule definition consists of one or more mappings. Each mapping has
three parts: an output feature path, an assignment operator, and the
value assigned, either an input feature path or an atomic value. Every
output path begins with the feature name out
and every input
path begins with the feature name in
. The assignment operator
is either an equal sign (=
) or an equal sign followed by a
"greater than" sign (=>
).(6)
Consider the information shown in figure 7. When
the lexicon entry is loaded, it is initially assigned the feature
structure shown in figure 8, which is the unification
of the information given in the various fields of the lexicon entry.
Since one of the the labels stored in the \f
(feature) field is
actually the name of a lexical rule, after the complete feature structure
has been built, the named lexical rule is applied. After the rule has
been applied, the original single feature structure has been changed to
the two feature structures shown in figure 9. Note that
not all of the input feature information is found in both of the output
feature structures.
Figure 10. PC-PATR lexical rule for using PC-Kimmo Define MapKimmoFeatures as <out cat> = <in head pos> <out head> = <in head> <out gloss> = <in root> <out root_pos> = <in root_pos>
Figure 11. Feature structure received from PC-Kimmo [ cat: Word clitic: - drvstem: - head: [ agr: [ 3sg: + ] finite: + pos: V tense: PRES vform: S ] root: `sleep root_pos: V ]
Figure 12. Feature structure sent to PC-PATR [ cat: V gloss: `sleep head: [ agr: [ 3sg: + ] finite: + pos: V tense: PRES vform: S ] lex: sleeps root_pos: V ]
Using a lexical rule in conjunction with the PC-Kimmo morphological
parser within PC-PATR is illustrated in
figures 10-12.
Figure 10 shows the lexical rule for mapping from
the top-level feature structure produced by the morphological parser to
the bottom-level feature structure used by the sentence parser. Note
that this rule must be named MapKimmoFeatures
(unorthodox
capitalization and all).
Figure 11 shows the feature structure
created by the PC-Kimmo parser. After the lexical rule shown in
figure 10 has been applied (and after some
additional automatic processing), the feature structure shown in
figure 12 is passed to the PC-PATR parser. Note that
only a single feature structure results from this operation, unlike the
result of a lexical rule applied to a lexicon entry.
Note that the feature structure passed to the PC-PATR parser
always has both a lex
feature and a gloss
feature, even
if the MapKimmoFeatures
lexical rule does not create them. The
default value for the lex
feature is the original word from the
sentence being parsed. The default value for the gloss
feature
is the concatenation of the glosses of the individual morphemes in the
word.
In contrast to the lex
and gloss
features which are
provided automatically by default, the cat
feature must be
provided by the MapKimmoFeatures
lexical rule. There is no way
to provide this feature automatically, and it is required for the
phrase structure rule portion of PC-PATR.
A PC-PATR constraint template has these parts, in the order listed:
Constraint
is
.
)
The characters (){}[]<>=:/
cannot be used in constraint template
names since they are used for special purposes in the grammar file. The
characters _\
can be freely used in constraint template names.
The abbreviations defined by constraint templates are used in the logical constraint operations that are part of the rules defined in the grammar file. A constraint template must be defined in the grammar file before it can be used in a rule.
Consider the following rules in a grammar file:
RULE Word -> Stem <Word ms> = <Stem ms> <Stem ms> == [finite: +] <-> [tense: []] RULE Word -> Stem Infl <Word ms> = <Stem ms> <Word ms> = <Infl ms> <Stem ms> == [finite: +] <-> [tense: []] RULE Stem -> Root Deriv <Root ms> = <Deriv msFrom> <Stem ms> = <Root ms> <Stem ms> <= <Deriv msTo> <Stem ms> == [finite: +] <-> [tense: []] RULE Stem -> Root <Stem ms> = <Root ms> <Stem ms> == [finite: +] <-> [tense: []]
These rules can be simplied by defining a constraint template:
CONSTRAINT ValidVerb is [finite: +] <-> [tense: []] RULE Word -> Stem <Word ms> = <Stem ms> <Stem ms> == ValidVerb RULE Word -> Stem Infl <Word ms> = <Stem ms> <Word ms> = <Infl ms> <Stem ms> == ValidVerb RULE Stem -> Root Deriv <Root ms> = <Deriv msFrom> <Stem ms> = <Root ms> <Stem ms> <= <Deriv msTo> <Stem ms> == ValidVerb RULE Stem -> Root <Stem ms> = <Root ms> <Stem ms> == ValidVerb
Some of the input control files that PC-PATR reads are standard
format files. This means that the files are divided into records and
fields. A standard format file contains at least one record, and some
files may contain a large number of records. Each record contains one
or more fields. Each field occupies at least one line, and is marked
by a field code at the beginning of the line. A field code
begins with a backslash character (\
), and contains 1 or more
printing characters (usually alphabetic) in addition.
If the file is designed to have multiple records, then one of the field codes must be designated to be the record marker, and every record begins with that field, even if it is empty apart from the field code. If the file contains only one record, then the relative order of the fields is constrained only by their semantics.
It is worth emphasizing that field codes must be at the beginning of a line. Even a single space before the backslash character prevents it from being recognized as a field code.
It is also worth emphasizing that record markers must be present even if that field has no information for that record. Omitting the record marker causes two records to be merge into a single record, with unpredictable results.
The lexicon file is a standard format
database file consisting of any number of records,
each of which represents one word. These records are divided into
fields, each of which begins with a standard format marker at the
beginning of a line. These markers begin with the \
(backslash)
character followed by one or more alphanumeric characters. Each record
begins with a designated field. PC-PATR recognizes four
different fields, with these default field markers:
\w
\c
\g
\f
Note that the fields containing the lexical form of the word and its category must be present for each word (record) in the lexicon. The other two fields (glosses and features) are optional, as are additional fields that may be present for other purposes.
Each word loaded from the lexicon file is assigned certain features based on the fields described above.
These feature names should be treated as reserved names and not used for other purposes.
For example, consider these entries for the words fox and foxes:
\w fox \c N \g canine \f <number> = singular \w foxes \c N \g canine+PL \f <number> = plural
When these entries are used by the grammar, they are represented by these feature structures:
[cat: N gloss: canine lex: foxes number: singular] [cat: N gloss: canine+PL lex: foxes number: plural]
The lexicon entries can be simplified by defining feature templates in the grammar file. Consider the following templates:
Let PL be <number> = plural Let N be <number> = !singular
With these two templates, defining an abbreviation for "plural" and defining a default feature for category N (noun), the lexicon entries can be rewritten as follows:
\w fox \c N \g canine \f \w foxes \c N \g canine+PL \f PL
Note that the feature (\f
) field of the first entry could be
omitted altogether since it is now empty.
Rather than using a dedicated lexicon file, PC-PATR can load its internal lexicon from one or analysis files produced by the AMPLE morphological analysis program. AMPLE writes a standard format database for its output, each record of which corresponds to a word of the source text. The first field of each entry contains the analysis. Other fields, which may or may not occur, contain additional information.
The utility of this command has been greatly reduced by the
availability of the load ample
and load kimmo
commands which allow morphological analysis on demand to populate
PC-PATR's word lexicon. However, the file disambiguate
command also operates on AMPLE analysis files, so this information is
still of interest.
This section describes the fields that AMPLE writes to the output
analysis file. The only field that is guaranteed to exist is the
analysis (\s
) field. All other fields are either data dependent
or optional.
The analysis field (\a
) starts each record of the output
analysis file. It has the following form:
\a PFX IFX PFX < CAT root CAT root > SFX IFX SFX
where PFX
is a prefix morphname, IFX
is an infix
morphname, SFX
is a suffix morphname, CAT
is a root
category, and root
is a root gloss or etymology. In the
simplest case, an analysis field would look like this:
\a < CAT root >
The \rd
field in the analysis data file can replace the
characters used to bracket the root category and gloss/etymology; see
section `Root Delimiter Characters: \rd' in AMPLE Reference Manual.
The dictionary field code mapped to M
in the dictionary codes
file controls the affix and default root morphnames; see
section `Morphname (internal code M)' in AMPLE Reference Manual.
If the AMPLE `-g' command line option was given, the output analysis file
contains glosses from the root dictionary marked by the field code
mapped to G
in the dictionary codes file; see
section `AMPLE Command Options' in AMPLE Reference Manual,
and
section `Root Gloss (internal code G)' in AMPLE Reference Manual.
The morpheme decomposition field (\d
) follows the analysis
field. It has the following form:
\d anti-dis-establish-ment-arian-ism-s
where the hyphens separate the individual morphemes in the surface form of the word.
The \dsc
field in the text input control file can replace the
hyphen with another character for separating the morphemes; see
section `Decomposition Separation Character: \dsc' in AMPLE Reference Manual.
The morpheme decomposition field is optional. It is enabled either by an AMPLE `-w d' command line option (see section `AMPLE Command Options' in AMPLE Reference Manual), or by an interactive query.
The category field (\cat
) provides rudimentary category
information. It has the following form:
\cat CAT
where CAT
is the proposed word category. A more complex example
is
\cat C0 C1/C0=C2=C2/C1=C1/C1
where C0
is the proposed word category, C1/C0
is a prefix
category pair, C2
is a root category, and C2/C1
and
C1/C1
are suffix category pairs. The equal signs (=
)
serve to separate the category information of the individual morphemes.
The \cat
field of the analysis data file controls whether the
category field is written to the output analysis file; see
section `Category output control: \cat' in AMPLE Reference Manual.
The properties field (\p
) contains the names of any allomorph or
morpheme properties found in the analysis of the word. It has the
form:
\p ==prop1 prop2=prop3=
where prop1
, prop2
, and prop3
are property names.
The equal signs (=
) serve to separate the property information
of the individual morphemes. Note that morphemes may have more than
one property, with the names separated by spaces, or no properties at
all.
By default, the properties field is written to the output analysis file. The `-w 0' command option, or any `-w' option that does not include `p' in its argument disables the properties field.
The feature descriptor field (\fd
) contains the feature names
associated with each morpheme in the analysis. It has the following
form:
\fd ==feat1 feat2=feat3=
where feat1
, feat2
, and feat3
are feature
descriptors. The equal signs (=
) serve to separate the feature
descriptors of the individual morphemes. Note that morphemes may have
more than one feature descriptor, with the names separated by spaces,
or no feature descriptors at all.
The dictionary field code mapped to F
in the dictionary code
table file controls whether feature descriptors are written to the
output analysis file; if this mapping is not defined, then the
\fd
field is not written.
See section `Feature Descriptor (internal code F)' in AMPLE Reference Manual.
The underlying form field (\u
) is similar to the decomposition
field except that it shows underlying forms instead of surface forms.
It looks like this:
\u a-para-a-i-ri-me
where the hyphens separate the individual morphemes.
The \dsc
field in the text input control file can replace the
hyphen with another character for separating the morphemes; see
section `Decomposition Separation Character: \dsc' in AMPLE Reference Manual.
The dictionary field code mapped to U
in the dictionary code
table file controls whether underlying forms are written to the output
analysis file; if this mapping is not defined, then the \u
field
is not written.
section `Underlying Form (internal code U)' in AMPLE Reference Manual.
The original word field (\w
) contains the original input word as
it looks before decapitalization and orthography changes. It looks
like this:
\w The
Note that this is a gratuitous change from earlier versions of AMPLE, which wrote the decapitalized form.
The original word field is optional. It is enabled either by an AMPLE `-w w' command line option (see section `AMPLE Command Options' in AMPLE Reference Manual), or by an interactive query.
The format information field (\f
) records any formatting codes
or punctuation that appeared in the input text file before the word.
It looks like this:
\f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s
where backslashes (\
) in the input text are doubled, newlines
are represented by \n
, and additional lines in the field start
with a tab character.
The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words.
The capitalization field (\c
) records any capitalization of the
input word. It looks like this:
\c 1
where the number following the field code has one of these values:
1
2
4-32767
Note that the third form is of limited utility, but still exists because of the author's last name.
The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized; see section `Prevent Any Decapitalization: \nocap' in AMPLE Reference Manual, and section `Prevent Decapitalization of Individual Characters: \noincap' in AMPLE Reference Manual.
The nonalphabetic field (\n
) records any trailing punctuation,
bar code
(see section `Bar Code Format Code Characters: \barcodes' in AMPLE Reference Manual),
or whitespace characters. It looks like this:
\n |r.\n
where newlines are represented by \n
. The nonalphabetic field
ends with the last whitespace character immediately following the word.
The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it.
The previous section assumed that AMPLE produced only one analysis for a word. This is not always possible since words in isolation are frequently ambiguous. AMPLE handles multiple analyses by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example,
\a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n
where the percent sign (%
) separates the different analyses in
each field. Note that only those fields which contain analysis
information are marked for ambiguity. The other fields (\w
,
\f
, \c
, and \n
) are the same regardless of the
number of analyses that AMPLE discovers.
The \ambig
field in the text input control file can replace the
percent sign with another character for separating the analyses; see
section `Ambiguity Marker Character: \ambig' in AMPLE Reference Manual,
for details.
The previous sections assumed that AMPLE successfully analyzed a word.
This does not always happen. AMPLE marks analysis failures the same
way it marks multiple analyses, but with zero (0
) for the
ambiguity count. For example,
\a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n
Note that only the \a
and \d
fields contain any analysis
information, and those both have the decapitalized word as a place
holder.
The \ambig
field in the text input control file can replace the
percent sign with another character for marking analysis failures and
ambiguities; see
section `Ambiguity Marker Character: \ambig' in AMPLE Reference Manual, for details.
Normally, PC-PATR requires the linguist to develop a full-fledged lexicon of words with their features. This may be unnecessary if a morphological analysis, and a comprehensive lexicon of morphemes, has already been developed using either PC-Kimmo (version 2) or AMPLE (version 3). These morphological parsing programs are also available from SIL.
Version 2 of PC-Kimmo supports a PC-PATR style grammar for defining word structure in terms of morphemes. This provides a straightforward way to obtain word features as a result of the morphological analysis process. For best results, the (PC-Kimmo) word grammar and the (PC-PATR) sentence or phrase grammar should be developed together.
When using the PC-Kimmo morphological parser, PC-PATR requires a
special lexical rule in the (sentence level) grammar file.
This rule is named MapKimmoFeatures
and is used automatically to
map from the features produced by the word parse to the features needed
by the sentence parse. For example, consider the following definition:
Define MapKimmoFeatures as <out cat> = <in head pos> <out lex> = <in lex> <out head> = <in head>
This lexical rule uses the <head pos>
feature produced by the
PC-Kimmo parser as the <cat>
feature for the PC-PATR
parser, and passes the <lex>
and <head>
features from the
morphological parser to the sentence parser unchanged.
The only thing necessary to use the AMPLE morphological parser inside PC-PATR is to load the appropriate control files and dictionaries. This will not be too useful, however, unless the AMPLE dictionaries contain feature descriptors to pass through to PC-PATR. It is also required for the AMPLE data to define the word category. (Either the word-final suffix category or the word-initial prefix category can be designated in the analysis data file). Consult the AMPLE documentation for more details on either of these issues.
The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the Metrowerks C SIOUX function.
Gazdar and Mellish (1989, pages 142-147) discuss why context-free phrase structure grammars are inadequate to model some human languages. The PATR-II formalism (unification of feature structures added to the context-free phrase structure rules) is shown to be adequate for those cases.
This is a new feature of AMPLE version 3.
The unified dictionary is a new feature of AMPLE version 3.
Would this be a useful enhancement to PC-PATR?
These two operators are equivalent in PC-PATR, since the implementation treats each lexical rule as an ordered list of assignments rather than using unification for the mappings that have an equal sign operator.
By
default, \w
also marks the initial field of each word's record.
This document was generated on 28 November 2006 using the texi2html translator version 1.52.