Introduction

This presentation will present a brief tutorial on the maxent.ot R package we’re currently developing. We will show you how to:

What is MaxEnt OT?

Maximum Entropy Optimality Theory (henceforth MaxEnt OT; Goldwater & Johnson 2003) is a variant of Optimality Theory (Prince & Smolensky 1993/2004) that uses numeric constraint weights rather than constraint rankings (Pater 2009). Given a set of constraint weights and violation profiles for each output candidate, MaxEnt OT calculates probability distributions over candidates for a given input. This has made it a popular tool for modeling variability in phonological patterns at either the individual or population level (e.g., Hayes & Wilson 2008, Moore-Cantwell & Pater 2016, White 2017, Zuraw & Hayes 2017, Mayer 2021, a.o.).

MaxEnt OT calculates the probability of an output candidate \(y\) given an underlying form \(x\) as:

\[P(y|x; w) = \frac{1}{Z_w(x)}\exp(-\sum_{k=1}^{m}{w_k f_k(y, x)})\] \(f_k(y, x)\) is the number of violations of constraint \(k\) incurred by mapping underlying \(x\) to surface \(y\). \(Z_w(x)\) is a normalization term defined as

\[Z(x) = \sum_{y\in\mathcal{Y}(x)}{\exp(-\sum_{k=1}^{m}{w_k f_k(y, x)})}\] where \(\mathcal{Y}(x)\) is the set of observed surface realizations of input \(x\).

The log likelihood of a dataset \(D\) given a set of constraint weights \(w\) is calculated as

\[LL_w(D) = \sum_{i=1}^{n}{\ln P(y_i|x_i; w)}\]

where \(n\) is the number of data points.

Given a data set consisting of input-output pairs and their corresponding violation profiles, it is straightforward to learn the optimal constraint weights using maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP; see Hayes & Wilson 2008). Because probabilities can be calculated for each observed output, different models fit to the same data (that is, models with different constraint weights or sets of constraints) can be compared numerically.

The data

In this presentation we will walk you through a MaxEnt OT analysis of a simple, fabricated dataset that represents a French-speaking child’s acquisition of onset consonant clusters, based loosely on Rose (2002). For this imaginary child:

We plot the overall pattern below:

Loading the library

The first step is to load the maxent.ot library. This needs to be installed from Github since it’s not (yet) an official CRAN package.

if (!require(devtools)) {
  install.packages("devtools", repos = "http://cran.us.r-project.org")
}
if (!require(maxent.ot)) {
  devtools::install_github("connormayer/maxent.ot")
}

Tableau format

We’ll start with a tableau that contains two simple constraints. It’s obvious that these constraints can’t capture the influences of stress and sonority, but this will let us look at the input data format and set the stage for some model comparison.

Tableaux must be represented in OTSoft format (https://linguistics.ucla.edu/people/hayes/otsoft/). Here’s what this looks like. It’s not important which separator is used (commas, tabs, etc).

The first two rows have labels for the constraints.

The remaining rows correspond to UR-SR pairs.

The first three columns are unlabeled:

The remaining columns correspond to the individual constraints and their violations by each candidate.

Fitting a MaxEnt grammar and examining its predictions

Fitting a MaxEnt grammar in its most basic form involves computing numeric weights for each constraint that get the model’s predictions as close to the observed frequencies in the data by optimizing the log likelihood. We won’t talk about how this fitting is done, but see Hayes and Wilson (2008) for a nice description.

We can fit weights using the optimize_weights function. In its simplest application, optimize_weights takes a single argument, which is the path to the OTSoft file. The in_sep argument defines the separator used in the input file. This defaults to tabs (which are the separator used in OTSoft traditionally), but we’ve used commas here.

base_file <- "amp_demo_grammar_base.csv"
base_model <- optimize_weights(base_file, in_sep=',')

# Get the weights of each constraint
base_model$weights
## StarComplex         Max 
##   1.6088911   0.9898518
# Get the log likelihood assigned to the training data under these weights
base_model$loglik
## [1] -258.9787
# Get the number of free parameters (i.e. number of constraints)
base_model$k
## [1] 2
# Get the number of data points
base_model$n
## [1] 400

Evaluating model predictions

We can calculate the predictions of a fitted model on some data set using the predict_probabilities function. This takes two arguments: the data file, which has the same format as above, and a set of weights. Here we’ll use this to get the predicted frequencies for the training data set, but you can also use this on new data provided it uses the same constraints as your trained model.

predict_probabilities(base_file, base_model$weights, in_sep=',')
## $loglik
## [1] -258.9787
## 
## $predictions
##            UR         SR Freq StarComplex Max Predicted Probability
## 1:      'stad      'stad   40           1   0                  0.35
## 2:      'stad       'tad   60           0   1                  0.65
## 3:      'grav      'grav   60           1   0                  0.35
## 4:      'grav       'gav   40           0   1                  0.65
## 5: spa.gə.'ti spa.gə.'ti   10           1   0                  0.35
## 6: spa.gə.'ti  pa.gə.'ti   90           0   1                  0.65
## 7:     gry.'o     gry.'o   30           1   0                  0.35
## 8:     gry.'o      gy.'o   70           0   1                  0.65
##    Observed Probability       Error
## 1:                  0.4 -0.05000002
## 2:                  0.6  0.05000002
## 3:                  0.6 -0.25000002
## 4:                  0.4  0.25000002
## 5:                  0.1  0.24999998
## 6:                  0.9 -0.24999998
## 7:                  0.3  0.04999998
## 8:                  0.7 -0.04999998

We can use predict_probabilities to evaluate how well our model generalizes to new data. We can also use it to evaluate the predictions our model makes about the training data, and identify data points that it is particularly bad at accounting for.

Only the three final columns are new here:

  • Predicted Probability: The predicted probability of each candidate form under the model.
  • Observed Probability: The observed probability in the data.
  • Error: The difference between predicted and observed probability. This can be useful for identifying cases that the model does particularly badly on.

Above we can see that it assigns probabilities without considering stress or sonority. This suggests our model might do better if we add some additional constraints.

Adding in Max-stressed

Let’s try adding a new constraint that specifically penalizes deleting segments from stressed syllables.

  • Max-Stressed: Don’t delete segments from stressed syllables.

Our updated tableau looks like this:

Let’s fit a new model to it:

stressed_file <- "amp_demo_grammar_stressed.csv"
stressed_model <- optimize_weights(stressed_file, in_sep=',')

# Get the weights of each constraint
stressed_model$weights
##    *Complex         Max MaxStressed 
##   2.3697155   0.9834209   1.3862945
# Get the log likelihood assigned to the training data under these weights
stressed_model$loglik
## [1] -238.7099
# Get the number of free parameters (i.e. number of constraints)
stressed_model$k
## [1] 3
# Get the number of data points
stressed_model$n
## [1] 400

And now let’s look at the predictions it makes:

predict_probabilities(stressed_file, stressed_model$weights, in_sep=',')
## $loglik
## [1] -238.7099
## 
## $predictions
##            UR         SR Freq *Complex Max MaxStressed Predicted Probability
## 1:      'stad      'stad   40        1   0           0                   0.5
## 2:      'stad       'tad   60        0   1           1                   0.5
## 3:      'grav      'grav   60        1   0           0                   0.5
## 4:      'grav       'gav   40        0   1           1                   0.5
## 5: spa.gə.'ti spa.gə.'ti   10        1   0           0                   0.2
## 6: spa.gə.'ti  pa.gə.'ti   90        0   1           0                   0.8
## 7:     gry.'o     gry.'o   30        1   0           0                   0.2
## 8:     gry.'o      gy.'o   70        0   1           0                   0.8
##    Observed Probability       Error
## 1:                  0.4  0.09999998
## 2:                  0.6 -0.09999998
## 3:                  0.6 -0.10000002
## 4:                  0.4  0.10000002
## 5:                  0.1  0.09999997
## 6:                  0.9 -0.09999997
## 7:                  0.3 -0.10000003
## 8:                  0.7  0.10000003

This model does a better job of differentiating between frequency of deletion in stressed vs. unstressed syllables, but it doesn’t get us the effects of sonority. Let’s add one more constraint.

Adding in SSP

The last constraint we’ll add penalizes onsets that violate the Sonority Sequencing Principle: in this case [sp] onsets.

  • SSP: Don’t violate the SSP.

Our final tableau looks like this:

Let’s fit a new model to it:

full_file <- "amp_demo_grammar_full.csv"
full_model <- optimize_weights(full_file, in_sep=',')

# Get the weights of each constraint
full_model$weights
##    *Complex         Max MaxStressed         SSP 
##   1.8104996   0.8510383   1.4618151   1.0047286
# Get the log likelihood assigned to the training data under these weights
full_model$loglik
## [1] -228.811
# Get the number of free parameters (i.e. number of constraints)
full_model$k
## [1] 4
# Get the number of data points
full_model$n
## [1] 400

And now let’s look at the predictions it makes:

predict_probabilities(full_file, full_model$weights, in_sep=',')
## $loglik
## [1] -228.811
## 
## $predictions
##            UR         SR Freq *Complex Max MaxStressed SSP
## 1:      'stad      'stad   40        1   0           0   1
## 2:      'stad       'tad   60        0   1           1   0
## 3:      'grav      'grav   60        1   0           0   0
## 4:      'grav       'gav   40        0   1           1   0
## 5: spa.gə.'ti spa.gə.'ti   10        1   0           0   1
## 6: spa.gə.'ti  pa.gə.'ti   90        0   1           0   0
## 7:     gry.'o     gry.'o   30        1   0           0   0
## 8:     gry.'o      gy.'o   70        0   1           0   0
##    Predicted Probability Observed Probability       Error
## 1:             0.3769828                  0.4 -0.02301723
## 2:             0.6230172                  0.6  0.02301723
## 3:             0.6230123                  0.6  0.02301233
## 4:             0.3769877                  0.4 -0.02301233
## 5:             0.1230143                  0.1  0.02301433
## 6:             0.8769857                  0.9 -0.02301433
## 7:             0.2769861                  0.3 -0.02301393
## 8:             0.7230139                  0.7  0.02301393

Now we’re doing pretty well! The model’s predicted frequencies capture the qualitative behavior of the observed data quite closely.

Model comparison

Which of the models we’ve fit above is the ‘best’ model? We’ve qualitatively decided that the full model best reflects the patterns in the data, but how can we quantify this?

One approach is to simply compare the log likelihood scores the models assign to the data.

base_model$loglik
## [1] -258.9787
stressed_model$loglik
## [1] -238.7099
full_model$loglik
## [1] -228.811

Higher log likelihoods correspond to greater probability, so the full model is best in the sense that it most accurately predicts the patterns in the data. However, just looking at the log likelihood doesn’t take into account the complexity of each model: the base model has two constraints, the stress model has three, and the full model has four. Adding more parameters that can be fitted to the data will generally reduce the log likelihood. We want to know whether each new parameter gets us enough ‘bang for our buck’: is the increase in log likelihood worth the extra complexity of the model?

maxent.ot implements several commonly used model comparisons. We won’t define these in detail here, but simply show how they can be deployed in maxent.ot.

Let’s compare our models using the BIC.

compare_models(base_model, stressed_model, full_model, method='bic')
##                       model k   n      bic bic.delta       bic.wt    cum.wt
## 1     amp_demo_grammar_full 4 400 481.5879   0.00000 9.989964e-01 0.9989964
## 2 amp_demo_grammar_stressed 3 400 495.3942  13.80633 1.003595e-03 1.0000000
## 3     amp_demo_grammar_base 2 400 529.9402  48.35233 3.162196e-11 1.0000000
##          ll
## 1 -228.8110
## 2 -238.7099
## 3 -258.9787

Lower values of BIC indicate greater preference for a model, and differences of > 4 or so are considered to be meaningful (see, e.g., Burnham & Anderson, 2002). These results tell us that the extra complexity of the two constraints we added is acceptable because of the better fit to the data.

Let’s look at a case where the extra complexity of an added constraint doesn’t buy us enough to be worth it. We’ll add a new constraint to try to coax our predicted probabilities in the right directions.

Our final tableau is shown below:

Let’s fit a model to this data and evaluate its predictions.

overfit_file <- "amp_demo_grammar_overfit.csv"
overfit_model <- optimize_weights(overfit_file, in_sep=',')
predict_probabilities(overfit_file, overfit_model$weights, in_sep=',')
## $loglik
## [1] -228.1971
## 
## $predictions
##            UR         SR Freq *Complex Max MaxStressed SSP DoTheRightThing
## 1:      'stad      'stad   40        1   0           0   1               0
## 2:      'stad       'tad   60        0   1           1   0               1
## 3:      'grav      'grav   60        1   0           0   0               1
## 4:      'grav       'gav   40        0   1           1   0               0
## 5: spa.gə.'ti spa.gə.'ti   10        1   0           0   1               1
## 6: spa.gə.'ti  pa.gə.'ti   90        0   1           0   0               0
## 7:     gry.'o     gry.'o   30        1   0           0   0               0
## 8:     gry.'o      gy.'o   70        0   1           0   0               1
##    Predicted Probability Observed Probability         Error
## 1:             0.3999982                  0.4 -1.838151e-06
## 2:             0.6000018                  0.6  1.838151e-06
## 3:             0.6000000                  0.6  6.001617e-09
## 4:             0.4000000                  0.4 -6.001617e-09
## 5:             0.1000001                  0.1  1.120851e-07
## 6:             0.8999999                  0.9 -1.120851e-07
## 7:             0.2999981                  0.3 -1.860377e-06
## 8:             0.7000019                  0.7  1.860377e-06

Our predictions are almost identical to the observed frequencies in the data! But at what cost? Let’s add the new model to our comparison from above.

compare_models(base_model, stressed_model, full_model, overfit_model, method='bic')
##                       model k   n      bic bic.delta       bic.wt    cum.wt
## 1     amp_demo_grammar_full 4 400 481.5879  0.000000 9.145853e-01 0.9145853
## 2  amp_demo_grammar_overfit 5 400 486.3514  4.763534 8.449594e-02 0.9990812
## 3 amp_demo_grammar_stressed 3 400 495.3942 13.806325 9.187953e-04 1.0000000
## 4     amp_demo_grammar_base 2 400 529.9402 48.352330 2.895004e-11 1.0000000
##          ll
## 1 -228.8110
## 2 -228.1971
## 3 -238.7099
## 4 -258.9787

The new model has a slightly better log likelihood, but the BIC favors the simpler model.

Adding in bias terms

The models above were all fit using a procedure called Maximum Likelihood Estimation (MLE). This means that the only consideration when fitting the weights is to maximize the likelihood of the data. An alternative way to fit a model to a dataset is called Maximum A Posteriori Estimation (MAP). MAP estimates in the context of a MaxEnt OT model incorporate some belief about the weights prior to seeing any data. This can serve two purposes:

  • As regularization to prevent overfitting. Overfitting means that a model does well in predicting the data it is fit to, but generalizes to new data poorly.
  • To encode learning biases: previous work has used MAP estimates to encode phonological learning biases (e.g., Wilson, 2006; White, 2013).

maxent.ot allows you to specify a normal prior on the weight of each constraint. This means that rather than optimizing the log likelihood, we optimize the following objective function:

\[J_w(D) = LL_w(D) - \sum_{k=1}^{m}{\frac{(w_k - \mu_k)^2}{2\sigma_k^2}}\]

where \(m\) is the total number of constraints, \(\mu_k\) is the expected value of the \(k^{th}\) constraint weight prior to seeing any data, and \(\sigma_k\) reflects the prior certainty in the weight of the \(k^{th}\) constraint: smaller values of \(\sigma_k\) require more data to shift the learned weight away from the expected value.

Let’s suppose we get a new set of observations from our French child.

tableau <- flextable(read.csv("amp_demo_grammar_new.csv", header=FALSE))
tableau <- delete_part(tableau, part='header')
border_remove(tableau)

We can train our model again, but with the same prior on the weights of all constraints.

full_model_map <- optimize_weights(full_file, in_sep=',', mu_scalar=0, sigma_scalar=0.5)

Notice that the log likelihood of the training data is worse for the MAP model.

full_model$loglik
## [1] -228.811
full_model_map$loglik
## [1] -230.0765

Let’s compare the predictions of each model on the new data.

new_file <- "amp_demo_grammar_new.csv"
predict_probabilities(new_file, full_model$weights, in_sep=',')
## $loglik
## [1] -258.2738
## 
## $predictions
##          UR       SR Freq *Complex Max MaxStressed SSP Predicted Probability
## 1:    'staʒ    'staʒ   43        1   0           0   1             0.3769828
## 2:    'staʒ     'taʒ   57        0   1           1   0             0.6230172
## 3:    'krɛp    'krep   53        1   0           0   0             0.6230123
## 4:    'krɛp     'kɛp   47        0   1           1   0             0.3769877
## 5:  sti.'lo  sti.'lo   20        1   0           0   1             0.1230143
## 6:  sti.'lo   ti.'lo   80        0   1           0   0             0.8769857
## 7: krɔ.'kɛt kro.'kɛt   35        1   0           0   0             0.2769861
## 8: krɔ.'kɛt  kɔ.'kɛt   65        0   1           0   0             0.7230139
##    Observed Probability       Error
## 1:                 0.43 -0.05301723
## 2:                 0.57  0.05301723
## 3:                 0.53  0.09301233
## 4:                 0.47 -0.09301233
## 5:                 0.20 -0.07698567
## 6:                 0.80  0.07698567
## 7:                 0.35 -0.07301393
## 8:                 0.65  0.07301393
predict_probabilities(new_file, full_model_map$weights, in_sep=',')
## $loglik
## [1] -254.3244
## 
## $predictions
##          UR       SR Freq *Complex Max MaxStressed SSP Predicted Probability
## 1:    'staʒ    'staʒ   43        1   0           0   1             0.3721018
## 2:    'staʒ     'taʒ   57        0   1           1   0             0.6278982
## 3:    'krɛp    'krep   53        1   0           0   0             0.5831654
## 4:    'krɛp     'kɛp   47        0   1           1   0             0.4168346
## 5:  sti.'lo  sti.'lo   20        1   0           0   1             0.1622585
## 6:  sti.'lo   ti.'lo   80        0   1           0   0             0.8377415
## 7: krɔ.'kɛt kro.'kɛt   35        1   0           0   0             0.3137755
## 8: krɔ.'kɛt  kɔ.'kɛt   65        0   1           0   0             0.6862245
##    Observed Probability       Error
## 1:                 0.43 -0.05789816
## 2:                 0.57  0.05789816
## 3:                 0.53  0.05316539
## 4:                 0.47 -0.05316539
## 5:                 0.20 -0.03774152
## 6:                 0.80  0.03774152
## 7:                 0.35 -0.03622455
## 8:                 0.65  0.03622455

Although the new model does a poorer job of fitting the training data, its predictions about the new data are closer to what is observed.

Additional functionality not covered here

maxent.ot has other functionality not covered here, including:

  • Saving model predictions to an output file.
  • Specifying different bias parameters for individual constraints
  • Reading bias parameters from a file rather than specifying them in code
  • Changing parameters of the optimizer
  • Setting a temperature parameter when predicting new data (e.g., Hayes et al., 2009)

Planned functionality

We are in the process of implementing some additional functionality in maxent.ot, including:

  • Cross-validation of different bias parameters
  • Reading input from data frames rather than OTSoft-formatted files