Architecture of SILKin, Part 2

The Contexts Folder

During Library operations, especially Machine Learning, SILKin often needs to read in all or a portion of a Context with a known kinship system. Although each context can be found in the Domain Theory Files folder, those files are in human-readable PROLOG format. For speed of access, I store each context in a Java proprietary format with file extension ".ctxt" in this folder.

These files are written and read only by Java utilities. Once read into memory, they are accessed as instances of the Context class. No file structure spec is therefore needed.

The Context Stubs File

A SILK file contains all the data, preferences, and decisions re: a particular project. The stubs file, by contrast, is a much smaller record of the state of the SILKin Library (a 'stub' of information about each context) and a few other state variables that do not change with each project. It is a flat file, not in XML or any other format. It is read in one line at a time in this format:

Line Meaning
0 number of context stubs in this stub file
1 code of CURRENT ACTIVITY: DATA_GATHERING = 0; BROWSING = 1; SUGGESTIONS = 2; ADMIN = 3;
2 name of last context browsed
3 'menuLang=' plus 2-letter code of language to be used in all menus and screen text. Currently limited to en = English or fr = French.
NOTE: This optional line may not be present.
4 'userDir=' plus the file name of User's last access. (It may not be just the name of the context if user is making alternate versions.)
5 name of context last used
6 'editDir=' plus the full pathname to the directory where User is keeping SILK files.
7 serial number of test (no longer used)
Lines 8-12 are optional, and may not be present.
8 full pathname of most recent file touched
9 'null' or full pathname of 2nd most recent file touched
10 'null' or full pathname of 3rd most recent file touched
11 'null' or full pathname of 4th most recent file touched
12 'null' or full pathname of 5th most recent file touched
13 true if Help Screen should pop up at start-up. Else false.
Lines 14 and beyond contain as many context stubs as were declared in line 0. Each stub is a quadruple:
context_name,
true if Terms-of-Address theory file exists,
true if Terms-of-Reference theory file exists,
true if Census file exists.

The Domain Theory Files Folder

If the context.stubs file declares that a Terms-of-Reference file exists for a context, then that file (with extension '.thy') will be in this folder. I must distinguish here between a domain theory submitted to the Library by a human versus a .thy file in this folder.

Anyone working in kinship analysis may submit a domain theory (a set of definitions for kin terms plus certain other data) to the SILKin Administrator. That theory must be expressed in (or converted to) PROLOG syntax in some arbitrary file. Only the Administrator may accept that submission and add it to the Library. When it is added, SILKin will convert the PROLOG file into an 'expanded PROLOG' format, compute all the expansions of definitions into expandedDefs, and add other data to aid later analysis. The original file submitted to the Library will not be retained, only this .thy file.

Domain Theory Files Prepared by Humans

Domain theory files in SILKin are allowed to have a 'header' section that declares facts about the context and language to which the theory applies. After this header has been read, the remainder of the file follows PROLOG syntax. This is also known as Horn Clause syntax. To my knowledge, that portion of the file can be read into PROLOG and manipulated there. Here are the header conventions, special characters, and symbols in PROLOG syntax:

HEADER DATA is expressed as lists enclosed in parentheses with comma-separated elements. The lists are all optional, and may appear in any order, although 'language' normally comes first.
Symbol Meaning
language All text enclosed in double quotes is the name of the language in which these kin term definitions apply. Example: (language, "English")
author Text enclosed in double quotes is the name of the person who created the definitions.
date Text enclosed in double quotes is the date these definitions were published or submitted. Date may be in any format.
polygamyOK Boolean (not in quotes) indicates if multiple concurrent wives are recognized in kinship terminology.
partial Boolean (not in quotes) indicates if this theory is incomplete.
recursiveLevels Integer sets the limit on how many levels of recursion need to be expanded in order to adequately reflect recursive definitions.
non_term Remainder of list are symbols which, when embedded anywhere in a predicate, indicate that this is not a top-level term in the kinship system but rather an auxiliary or 'helper' concept. All such symbols must be enclosed in square brackets. Example: (non_term, [aux], [eq])
citation A single element in double-quotes, however lengthy, will be stored as the academic citation for this language.
synonyms This symbol is followed by pairs of words, enclosed in parentheses and comma separated. These pairs of kin terms are declared to be synonymous.
umbrellas This symbol is followed by lists, comma-separated, that have internal lists. Each top-level list contains a kin term and a list of all other kin terms that are encompassed by that kin term. Example: (ancestor, (parent, grandmother, father))
overlaps This symbol is followed by lists, comma-separated, that have internal lists. Each top-level list contains a kin term and a list of all other kin terms that can legitimately overlap with the meaning of that kin term. Example: (buddy, (cousin, uncle))
userDefinedProperties This symbol is followed by one or more lists, comma-separated. Each of those lists contains the specification of one UDP, expressed as keyword-value pairs. The keywords are listed here for convenience, but complete information is in the UDP portion of the SILK file specification.
  • starName
  • type
  • single_value
  • restricted_to
  • default
  • max
  • min
  • chartable
These SPECIAL CHARACTERS may appear in the PROLOG portion of the file.
;; Comment. Any text following this symbol, to the end of the line, is ignored.
:- 'Implies' or 'Defined As'. Any literal appearing to the left of this symbol has the same meaning as the series of literals appearing on the right. You may also think of this as a 'rewrite rule.'
, Comma. When a Horn Clause contains more than one literal, the literals are separated by commas. When an argument list contains more than one argument, the arguments are separated by commas.
. Period. A Horn Clause statement always ends with a period, meaning 'end of statement' or 'end of clause'.
| Vertical Bar. This symbol combines a disjunction of Horn Clauses. When it appears at the front of a line, it means 'another definition of the previous literal is'
[ ] Square Brackets. Any expression enclosed in square brackets and included as an element in a clause is treated as a SILKin flag.
HORN CLAUSE COMPONENTS are as follows, and follow these conventions.
predicate a single symbol beginning with a lowercase letter that appears before a list of arguments is called the predicate. Its meaning is arbitrary, defined by the user. In SILKin, we provide a batch of SILKin-defined predicates.
argument list a list of symbols, comma separated, enclosed in parentheses, is considered the list of arguments to which the predicate is applied. An argument list may contain one or more arguments. In SILKin there are almost always two, occasionally just one.
argument an argument is a single symbol or a single literal. If the symbol begins with an uppercase letter, it is considered a variable. If it is a number or a symbol beginning with a lowercase letter, it is considered a constant.
literal a predicate followed by an argument list is called a literal.
Horn Clause A literal, followed by the 'Implies' symbol followed by one or more literals, comma-separated, and terminated with a period constitutes a Horn Clause. A single Horn Clause expresses one definition (or rewrite rule) for the left-hand literal. When multiple definitions or rewrite rules apply to the same literal, multiple Horn Clauses may be written for it. By convention, those multiple clauses are grouped together, and all clauses after the first one begin with the OR symbol (vertical bar).
Ego and Alter In SILKin, and all kinship literature, Ego always refers to the person using a kin term, and Alter is the person who is called by that term.
EXAMPLES
uncle(Alter, Ego) :- parent(P, Ego), brother(Alter, P).
               | parent(A, Ego), sister(S, A), husband(Alter, S).
clanbrother(Alter, Ego) :- *clan(C, Ego), *clan(C, Alter), not(sibling(Ego, Alter)), male(Ego), male(Alter).
;; We assume a UDP (*clan) has been declared previously.
;; Negation is just a predicate applied to an argument that is a literal.

Any file that follows these conventions can be imported into SILKin as a domain theory for the Library. It will be processed, indexed and then stored in the Domain Theory Files folder as a .thy file.

SILKin-Generated Domain Theory Files (.thy)

All of the conventions and special characters detailed above apply to a .thy file as well. In addition, extra information is added, enclosed by special characters. It is no longer usable as PROLOG, although the Horn Clauses embedded in it are. The file is still human-readable, but is not intended for human editing. A .thy file repeats the human-crafted Horn Clause kin term definitions from the original domain theory, but all Horn Clauses are then expanded into expandedDefs: Horn Clauses that use only SILKin's built-in, singular, gender-specific ('primitive') predicates.

Additional Special Characters for .thy Files
Symbol Meaning
Curly Brackets After the human-crafted Horn Clauses I insert a concatenated 'kin type signature' surrounded by curly brackets. This is a sorted, concatenated list of all the kin types covered by this definition.
Each expandedDef is preceded by a comment; its serial number. This facilitates retrieving a particular expanded clause.
Percent Sign Enclosed within percent signs is a trace of the expansion path that transformed the human-crafted definition into this expandedDef.
Keyword-Value Pairs On the line immediately after the expansion path, I insert 4 keyword-value pairs (joined by an equal sign) plus the kin type of this clause (enclosed in curly brackets).
Keyword Value
Lvl This is the height of Alter in Ego's family tree.
PC The number of parent and child links traversed from Ego to Alter.
S The number of spousal links traversed from Ego to Alter.
Star The number of 'star links' traversed from Ego to Alter (i.e. how many UDPs are involved in the connection).

A human-crafted domain theory that is fairly compact can expand into a very large .thy file, for several reasons:

Again, although the .thy file is human-readable, it is basically an internal pre-computation of data that will speed SILKin's learning efforts.

The Feature Vectors Folder

During the academic phase of this project, we attempted to cluster the various domain theories in the Library according to shared features. The idea was that a kin term definition emerging in the User's data might resemble several definitions already in the Library. If one of those definitions came from a domain theory that seemed to share features with the User's emerging domain theory, it would be helpful to suggest that the User consider possible parallels between the language/context they are studying and the 'similar' context in the Library.

We tried to identify characteristics of a domain theory that might reveal underlying cultural patterns, and then clustered all the Library's domain theories around these features. By estimating the features of the User's emerging context, we hoped to identify Library contexts that could be more relevant. (Details of the scheme are available in the dissertation.) Alas, later evaluation of this scheme showed it to be of marginal value. The clustering did not hurt performance, so the feature vectors used in the clustering are preserved in this folder

The Resources Folder

Until version 2.2, SILKin was written strictly in English. An extensive set of Help files was placed in a Help Files Folder in the Library. They were all HTML files, so a User could go to them directly with a browser, or view them as-needed via a hierarchical topical menu from inside SILKin. Any SILKin installation prior to version 2.2 has just a Help Files folder in the Library Folder. Beginning with 2.2, there will instead be a Resources Folder that contains help and other resources.

Version 2.2 and later are set up to display menus, screen text, and messages in any language for which translated text is available. Following the Java convention, all language-dependent files are stored in a Resources folder in the Library Folder.

Contents of the Resources Folder
File or Folder Description
ValidLanguages.txt A flat file. Each line begins with a standard 2-letter language code, comma, space, and a text title of that language. The 2-letter code is stored in the Context.stub file, so it is persistent in all projects until switched by the User.
EXAMPLE: en, English(USA)
HelpFiles This folder contains the help files to be displayed by default if the User chooses an invalid menu language, or if any file in the chosen language is missing.
HelpFiles_en This folder contains the help files in English(USA).
HelpFiles_fr This folder contains the help files en Francais.
MenuItems.properties A flat file of the menu keywords and their full text in the default language, English. FORMAT: keyword, space, equal sign, space, full text.
MenuItems_en.properties The menu keywords with full text in English(USA).
MenuItems_fr.properties The menu keywords with full text en Francais.
Messages.properties
Messages_en.properties
Messages_fr.properties
Same pattern. These files have message keywords for every pop-up, dialog, or error message in SILKin.
ScreenElements.properties
ScreenElements.properties
ScreenElements.properties
Same pattern. These files have keywords for every text item that may appear on a screen in SILKin.

The Help Files

The structure and information in the Help files is identical in all languages. The hierarchical menu of topics within SILKin hopefully allows a User to zero in on specific help needed, but the topics are arranged from basic to advanced so the whole set could be read as a User Manual. You may view the entire set (in English) via the User Manual button in the top navigation menu.

Adding New languages

As configured for version 2.2, SILKin is ready to accept any number of menu languages. The hard part of adding a new language is translating text. The easy part is adding the new language option once the translation work is finished. The following files must be translated, preferably in this order:

  1. MenuItems.properties
  2. ScreenElements.properties
  3. All the Help files, preferably in this order:
    1. Start.html
    2. Chart.html
    3. ContextEdit.html
    4. Suggs.html
    5. Prefs.html
    6. HornClause.html
    7. NonGen.html
  4. Messages.properties
  5. Silk-status.xsl (found in the Suggestions Folder)

The Kin_DFA and GEDCOM_DFA Files

Although parser-builders are common and useful, I perceived early on some advantages to building my own parsers for the various input files. They use Deterministic Finite Automata (DFAs) to guide the conversion from character streams to logical tokens. These two files are for parsing a Domain Theory or a GEDCOM import file. The DFAs are flat files in the somewhat canonical form. Elements on a line are separated by spaces. Each line consists of:

Each CharacterFound is defined using Java's built-in lexical methods in the code file JavaLex.java documented here. Some characters match more than one CharacterFound definition. The first match ends the search and transitions to the NewState. When NewState = -99, then the DFA returns the token for this state and returns to state 0.

The KinTermSigTree File

When looking for Library definitions that might match the emerging definition in User's data, the only candidates we need consider are those that cover all the kin types User has identified so far, plus perhaps more. This flat file is used to build a Java TreeMap (Red and Black tree) where the keys are sorted-concatenated kin types and the values are lists of all definitions in all languages that cover that collection of kin types.
A sample line: BroDa_HuBroDa_HuSisDa_SisDa_WiBroDa_WiSisDa { (([{English, niece}])) }
We call the sorted concatenated list of kin types a KinTerm Signature.

The KinTermSigCompressed File

The full KTSigTree has an entry for each unique kin term signature. Many of those entries contain a single definition from a single language. But often we want to find signatures that have definitions in many different languages. This would indicate a common kinship concept. So the KinTermSigCompressed file is just a pre-computed tree with all single-definition signatures removed. The file format is identical to the KinTermSigTree; this is just a pruned tree.

Predicate Encodings and Decodings

These two files are another pre-computed efficiency hack. Kin term definitions in the Library are stored in .thy files (one or two per context; a language may have both a Ref and an Adr context) and listed by the actual term as it is spelled in that language. During machine learning sessions, we do a lot of set operations on groups of candidate definitions. It is more space efficient to refer to each one by a compact code rather than languageName/kinTerm. The Encoding and Decoding files allow translation to and from such codes.

Format for the PredEncodings file is:

The file contains one of the above structures for each language in the Library.

The Decoding file goes from code to full languageName/KinTermName.

NOTE that auxiliary kin terms (i.e. non-terms in the language, but useful intermediate concepts, like 'cross-cousin') are included in the Encoding and Decoding files.

The Standard_Macros File

The Java code models (in Object-Oriented Programming style) Horn Clauses. It also has 24 built-in predicates; logical, mathematical, or 'biological' predicates representing logical operations or physical facts. They are universal and should never require revision.

However, English-speaking authors often express kinship in cultural terms that are widely understood, but not necessarily universal. 'Brother' is well understood, but a precise defintion of it in biological terms would require something like this:

sibling(Alter, Ego) :- mother(M, Ego), father(F, Ego), child(Alter, M),
          child(Alter, F), not(equal(Alter, Ego)).

brother(Alter, Ego) :- sibling(Alter, Ego), male(Alter).

The Standard Macros file contains Horn Clauses that define 15 convenience terms. These standard 'macro definitions' are automatically added to every domain theory, so they can always be used in defining terms.

For a precise definition of these terms, read the Standard_Macros file.

The Suggestions Folder

Although it usually plays a minor role (if any) for a field worker gathering kinship data, a major focus of the SILKin dissertation was the generation of suggestions to the User based on machine learning sessions that analyze the data gathered so far and try to suggest things that will minimize the data gathering burden and speed completion of a complete domain theory. There are six types of suggestions:

  1. Suggested Definition
  2. Potential Synonym
  3. Potential Umbrella Term
  4. Potential that an Umbrella is really a Synonym
  5. Potential Overlapping Definitions
  6. Data Request

Precise definitions of these suggestion types will be given later. The important point here is that when SILKin conducts a learning session, it stores the complete results in this folder as an HTML file. The HTML is the result of creating an intermediate XML file with the suggestion data, and then transforming it into a more attractive format via a formatting file: silk-status.xsl created by Gary Simons of SIL. Gary's XSL file not only formats and explains each suggestion, it also provides a nice statistical analysis of the terms encountered and defined.

Because one laptop might be used for several different projects, the Context Under Construction folder contains a SILK file for each project. Likewise, this Suggestions Folder contains one HTML file for each project, plus the XSL formatting file.

SILKin provides a nice split-screen for examining the suggestions and recording the User's decisions about them. But it is also possible to simply open the HTML file with certain browsers and see the suggestions (but not act on them). Whenever a new learning session is requested a complete reanalysis of the User's data is performed, constrained by any decisions the User may have made when acting on prior suggestions. The results of the latest session replace any earlier file. There is only one file per project.

The WeightVector File

As previously discussed, part of the academic effort was an attempt to cluster domain theories in the Library according to similar features. This small flat file is just a list of the weights to be applied to each feature when computing similarity. Although the clustering effort proved to be of little value, clustering is still done when a new domain theory is added to the Library. This file is preserved (along with the Feature Vectors) for that purpose.


Continue to the next topic: the Major Processes in SILKin.