This presentation will present a brief tutorial on the
maxent.ot
R package we’re currently developing. We will
show you how to:
Maximum Entropy Optimality Theory (henceforth MaxEnt OT; Goldwater & Johnson 2003) is a variant of Optimality Theory (Prince & Smolensky 1993/2004) that uses numeric constraint weights rather than constraint rankings (Pater 2009). Given a set of constraint weights and violation profiles for each output candidate, MaxEnt OT calculates probability distributions over candidates for a given input. This has made it a popular tool for modeling variability in phonological patterns at either the individual or population level (e.g., Hayes & Wilson 2008, Moore-Cantwell & Pater 2016, White 2017, Zuraw & Hayes 2017, Mayer 2021, a.o.).
MaxEnt OT calculates the probability of an output candidate \(y\) given an underlying form \(x\) as:
\[P(y|x; w) = \frac{1}{Z_w(x)}\exp(-\sum_{k=1}^{m}{w_k f_k(y, x)})\] \(f_k(y, x)\) is the number of violations of constraint \(k\) incurred by mapping underlying \(x\) to surface \(y\). \(Z_w(x)\) is a normalization term defined as
\[Z(x) = \sum_{y\in\mathcal{Y}(x)}{\exp(-\sum_{k=1}^{m}{w_k f_k(y, x)})}\] where \(\mathcal{Y}(x)\) is the set of observed surface realizations of input \(x\).
The log likelihood of a dataset \(D\) given a set of constraint weights \(w\) is calculated as
\[LL_w(D) = \sum_{i=1}^{n}{\ln P(y_i|x_i; w)}\]
where \(n\) is the number of data points.
Given a data set consisting of input-output pairs and their corresponding violation profiles, it is straightforward to learn the optimal constraint weights using maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP; see Hayes & Wilson 2008). Because probabilities can be calculated for each observed output, different models fit to the same data (that is, models with different constraint weights or sets of constraints) can be compared numerically.
In this presentation we will walk you through a MaxEnt OT analysis of a simple, fabricated dataset that represents a French-speaking child’s acquisition of onset consonant clusters, based loosely on Rose (2002). For this imaginary child:
We plot the overall pattern below:
The first step is to load the maxent.ot library. This needs to be installed from Github since it’s not (yet) an official CRAN package.
if (!require(devtools)) {
install.packages("devtools", repos = "http://cran.us.r-project.org")
}
if (!require(maxent.ot)) {
devtools::install_github("connormayer/maxent.ot")
}
We’ll start with a tableau that contains two simple constraints. It’s obvious that these constraints can’t capture the influences of stress and sonority, but this will let us look at the input data format and set the stage for some model comparison.
Tableaux must be represented in OTSoft format (https://linguistics.ucla.edu/people/hayes/otsoft/). Here’s what this looks like. It’s not important which separator is used (commas, tabs, etc).
StarComplex | Max | |||
StarComplex | Max | |||
'stad | 'stad | 40 | 1 | |
'tad | 60 | 1 | ||
'grav | 'grav | 60 | 1 | |
'gav | 40 | 1 | ||
spa.gə.'ti | spa.gə.'ti | 10 | 1 | |
pa.gə.'ti | 90 | 1 | ||
gry.'o | gry.'o | 30 | 1 | |
gy.'o | 70 | 1 |
The first two rows have labels for the constraints.
The remaining rows correspond to UR-SR pairs.
The first three columns are unlabeled:
The remaining columns correspond to the individual constraints and their violations by each candidate.
Fitting a MaxEnt grammar in its most basic form involves computing numeric weights for each constraint that get the model’s predictions as close to the observed frequencies in the data by optimizing the log likelihood. We won’t talk about how this fitting is done, but see Hayes and Wilson (2008) for a nice description.
We can fit weights using the optimize_weights
function.
In its simplest application, optimize_weights
takes a
single argument, which is the path to the OTSoft file. The
in_sep
argument defines the separator used in the input
file. This defaults to tabs (which are the separator used in OTSoft
traditionally), but we’ve used commas here.
base_file <- "amp_demo_grammar_base.csv"
base_model <- optimize_weights(base_file, in_sep=',')
# Get the weights of each constraint
base_model$weights
## StarComplex Max
## 1.6088911 0.9898518
# Get the log likelihood assigned to the training data under these weights
base_model$loglik
## [1] -258.9787
# Get the number of free parameters (i.e. number of constraints)
base_model$k
## [1] 2
# Get the number of data points
base_model$n
## [1] 400
We can calculate the predictions of a fitted model on some data set
using the predict_probabilities
function. This takes two
arguments: the data file, which has the same format as above, and a set
of weights. Here we’ll use this to get the predicted frequencies for the
training data set, but you can also use this on new data provided it
uses the same constraints as your trained model.
predict_probabilities(base_file, base_model$weights, in_sep=',')
## $loglik
## [1] -258.9787
##
## $predictions
## UR SR Freq StarComplex Max Predicted Probability
## 1: 'stad 'stad 40 1 0 0.35
## 2: 'stad 'tad 60 0 1 0.65
## 3: 'grav 'grav 60 1 0 0.35
## 4: 'grav 'gav 40 0 1 0.65
## 5: spa.gə.'ti spa.gə.'ti 10 1 0 0.35
## 6: spa.gə.'ti pa.gə.'ti 90 0 1 0.65
## 7: gry.'o gry.'o 30 1 0 0.35
## 8: gry.'o gy.'o 70 0 1 0.65
## Observed Probability Error
## 1: 0.4 -0.05000002
## 2: 0.6 0.05000002
## 3: 0.6 -0.25000002
## 4: 0.4 0.25000002
## 5: 0.1 0.24999998
## 6: 0.9 -0.24999998
## 7: 0.3 0.04999998
## 8: 0.7 -0.04999998
We can use predict_probabilities
to evaluate how well
our model generalizes to new data. We can also use it to evaluate the
predictions our model makes about the training data, and identify data
points that it is particularly bad at accounting for.
Only the three final columns are new here:
Above we can see that it assigns probabilities without considering stress or sonority. This suggests our model might do better if we add some additional constraints.
Let’s try adding a new constraint that specifically penalizes deleting segments from stressed syllables.
Our updated tableau looks like this:
*Complex | Max | MaxStressed | |||
*Complex | Max | MaxStressed | |||
'stad | 'stad | 40 | 1 | ||
'tad | 60 | 1 | 1 | ||
'grav | 'grav | 60 | 1 | ||
'gav | 40 | 1 | 1 | ||
spa.gə.'ti | spa.gə.'ti | 10 | 1 | ||
pa.gə.'ti | 90 | 1 | |||
gry.'o | gry.'o | 30 | 1 | ||
gy.'o | 70 | 1 |
Let’s fit a new model to it:
stressed_file <- "amp_demo_grammar_stressed.csv"
stressed_model <- optimize_weights(stressed_file, in_sep=',')
# Get the weights of each constraint
stressed_model$weights
## *Complex Max MaxStressed
## 2.3697155 0.9834209 1.3862945
# Get the log likelihood assigned to the training data under these weights
stressed_model$loglik
## [1] -238.7099
# Get the number of free parameters (i.e. number of constraints)
stressed_model$k
## [1] 3
# Get the number of data points
stressed_model$n
## [1] 400
And now let’s look at the predictions it makes:
predict_probabilities(stressed_file, stressed_model$weights, in_sep=',')
## $loglik
## [1] -238.7099
##
## $predictions
## UR SR Freq *Complex Max MaxStressed Predicted Probability
## 1: 'stad 'stad 40 1 0 0 0.5
## 2: 'stad 'tad 60 0 1 1 0.5
## 3: 'grav 'grav 60 1 0 0 0.5
## 4: 'grav 'gav 40 0 1 1 0.5
## 5: spa.gə.'ti spa.gə.'ti 10 1 0 0 0.2
## 6: spa.gə.'ti pa.gə.'ti 90 0 1 0 0.8
## 7: gry.'o gry.'o 30 1 0 0 0.2
## 8: gry.'o gy.'o 70 0 1 0 0.8
## Observed Probability Error
## 1: 0.4 0.09999998
## 2: 0.6 -0.09999998
## 3: 0.6 -0.10000002
## 4: 0.4 0.10000002
## 5: 0.1 0.09999997
## 6: 0.9 -0.09999997
## 7: 0.3 -0.10000003
## 8: 0.7 0.10000003
This model does a better job of differentiating between frequency of deletion in stressed vs. unstressed syllables, but it doesn’t get us the effects of sonority. Let’s add one more constraint.
The last constraint we’ll add penalizes onsets that violate the Sonority Sequencing Principle: in this case [sp] onsets.
Our final tableau looks like this:
*Complex | Max | MaxStressed | SSP | |||
*Complex | Max | MaxStressed | SSP | |||
'stad | 'stad | 40 | 1 | 1 | ||
'tad | 60 | 1 | 1 | |||
'grav | 'grav | 60 | 1 | |||
'gav | 40 | 1 | 1 | |||
spa.gə.'ti | spa.gə.'ti | 10 | 1 | 1 | ||
pa.gə.'ti | 90 | 1 | ||||
gry.'o | gry.'o | 30 | 1 | |||
gy.'o | 70 | 1 |
Let’s fit a new model to it:
full_file <- "amp_demo_grammar_full.csv"
full_model <- optimize_weights(full_file, in_sep=',')
# Get the weights of each constraint
full_model$weights
## *Complex Max MaxStressed SSP
## 1.8104996 0.8510383 1.4618151 1.0047286
# Get the log likelihood assigned to the training data under these weights
full_model$loglik
## [1] -228.811
# Get the number of free parameters (i.e. number of constraints)
full_model$k
## [1] 4
# Get the number of data points
full_model$n
## [1] 400
And now let’s look at the predictions it makes:
predict_probabilities(full_file, full_model$weights, in_sep=',')
## $loglik
## [1] -228.811
##
## $predictions
## UR SR Freq *Complex Max MaxStressed SSP
## 1: 'stad 'stad 40 1 0 0 1
## 2: 'stad 'tad 60 0 1 1 0
## 3: 'grav 'grav 60 1 0 0 0
## 4: 'grav 'gav 40 0 1 1 0
## 5: spa.gə.'ti spa.gə.'ti 10 1 0 0 1
## 6: spa.gə.'ti pa.gə.'ti 90 0 1 0 0
## 7: gry.'o gry.'o 30 1 0 0 0
## 8: gry.'o gy.'o 70 0 1 0 0
## Predicted Probability Observed Probability Error
## 1: 0.3769828 0.4 -0.02301723
## 2: 0.6230172 0.6 0.02301723
## 3: 0.6230123 0.6 0.02301233
## 4: 0.3769877 0.4 -0.02301233
## 5: 0.1230143 0.1 0.02301433
## 6: 0.8769857 0.9 -0.02301433
## 7: 0.2769861 0.3 -0.02301393
## 8: 0.7230139 0.7 0.02301393
Now we’re doing pretty well! The model’s predicted frequencies capture the qualitative behavior of the observed data quite closely.
Which of the models we’ve fit above is the ‘best’ model? We’ve qualitatively decided that the full model best reflects the patterns in the data, but how can we quantify this?
One approach is to simply compare the log likelihood scores the models assign to the data.
base_model$loglik
## [1] -258.9787
stressed_model$loglik
## [1] -238.7099
full_model$loglik
## [1] -228.811
Higher log likelihoods correspond to greater probability, so the full model is best in the sense that it most accurately predicts the patterns in the data. However, just looking at the log likelihood doesn’t take into account the complexity of each model: the base model has two constraints, the stress model has three, and the full model has four. Adding more parameters that can be fitted to the data will generally reduce the log likelihood. We want to know whether each new parameter gets us enough ‘bang for our buck’: is the increase in log likelihood worth the extra complexity of the model?
maxent.ot
implements several commonly used model
comparisons. We won’t define these in detail here, but simply show how
they can be deployed in maxent.ot
.
Let’s compare our models using the BIC.
compare_models(base_model, stressed_model, full_model, method='bic')
## model k n bic bic.delta bic.wt cum.wt
## 1 amp_demo_grammar_full 4 400 481.5879 0.00000 9.989964e-01 0.9989964
## 2 amp_demo_grammar_stressed 3 400 495.3942 13.80633 1.003595e-03 1.0000000
## 3 amp_demo_grammar_base 2 400 529.9402 48.35233 3.162196e-11 1.0000000
## ll
## 1 -228.8110
## 2 -238.7099
## 3 -258.9787
Lower values of BIC indicate greater preference for a model, and differences of > 4 or so are considered to be meaningful (see, e.g., Burnham & Anderson, 2002). These results tell us that the extra complexity of the two constraints we added is acceptable because of the better fit to the data.
Let’s look at a case where the extra complexity of an added constraint doesn’t buy us enough to be worth it. We’ll add a new constraint to try to coax our predicted probabilities in the right directions.
Our final tableau is shown below:
*Complex | Max | MaxStressed | SSP | DoTheRightThing | |||
*Complex | Max | MaxStressed | SSP | DoTheRightThing | |||
'stad | 'stad | 40 | 1 | 1 | |||
'tad | 60 | 1 | 1 | 1 | |||
'grav | 'grav | 60 | 1 | 1 | |||
'gav | 40 | 1 | 1 | ||||
spa.gə.'ti | spa.gə.'ti | 10 | 1 | 1 | 1 | ||
pa.gə.'ti | 90 | 1 | |||||
gry.'o | gry.'o | 30 | 1 | ||||
gy.'o | 70 | 1 | 1 |
Let’s fit a model to this data and evaluate its predictions.
overfit_file <- "amp_demo_grammar_overfit.csv"
overfit_model <- optimize_weights(overfit_file, in_sep=',')
predict_probabilities(overfit_file, overfit_model$weights, in_sep=',')
## $loglik
## [1] -228.1971
##
## $predictions
## UR SR Freq *Complex Max MaxStressed SSP DoTheRightThing
## 1: 'stad 'stad 40 1 0 0 1 0
## 2: 'stad 'tad 60 0 1 1 0 1
## 3: 'grav 'grav 60 1 0 0 0 1
## 4: 'grav 'gav 40 0 1 1 0 0
## 5: spa.gə.'ti spa.gə.'ti 10 1 0 0 1 1
## 6: spa.gə.'ti pa.gə.'ti 90 0 1 0 0 0
## 7: gry.'o gry.'o 30 1 0 0 0 0
## 8: gry.'o gy.'o 70 0 1 0 0 1
## Predicted Probability Observed Probability Error
## 1: 0.3999982 0.4 -1.838151e-06
## 2: 0.6000018 0.6 1.838151e-06
## 3: 0.6000000 0.6 6.001617e-09
## 4: 0.4000000 0.4 -6.001617e-09
## 5: 0.1000001 0.1 1.120851e-07
## 6: 0.8999999 0.9 -1.120851e-07
## 7: 0.2999981 0.3 -1.860377e-06
## 8: 0.7000019 0.7 1.860377e-06
Our predictions are almost identical to the observed frequencies in the data! But at what cost? Let’s add the new model to our comparison from above.
compare_models(base_model, stressed_model, full_model, overfit_model, method='bic')
## model k n bic bic.delta bic.wt cum.wt
## 1 amp_demo_grammar_full 4 400 481.5879 0.000000 9.145853e-01 0.9145853
## 2 amp_demo_grammar_overfit 5 400 486.3514 4.763534 8.449594e-02 0.9990812
## 3 amp_demo_grammar_stressed 3 400 495.3942 13.806325 9.187953e-04 1.0000000
## 4 amp_demo_grammar_base 2 400 529.9402 48.352330 2.895004e-11 1.0000000
## ll
## 1 -228.8110
## 2 -228.1971
## 3 -238.7099
## 4 -258.9787
The new model has a slightly better log likelihood, but the BIC favors the simpler model.
The models above were all fit using a procedure called Maximum Likelihood Estimation (MLE). This means that the only consideration when fitting the weights is to maximize the likelihood of the data. An alternative way to fit a model to a dataset is called Maximum A Posteriori Estimation (MAP). MAP estimates in the context of a MaxEnt OT model incorporate some belief about the weights prior to seeing any data. This can serve two purposes:
maxent.ot
allows you to specify a normal prior on the
weight of each constraint. This means that rather than optimizing the
log likelihood, we optimize the following objective function:
\[J_w(D) = LL_w(D) - \sum_{k=1}^{m}{\frac{(w_k - \mu_k)^2}{2\sigma_k^2}}\]
where \(m\) is the total number of constraints, \(\mu_k\) is the expected value of the \(k^{th}\) constraint weight prior to seeing any data, and \(\sigma_k\) reflects the prior certainty in the weight of the \(k^{th}\) constraint: smaller values of \(\sigma_k\) require more data to shift the learned weight away from the expected value.
Let’s suppose we get a new set of observations from our French child.
tableau <- flextable(read.csv("amp_demo_grammar_new.csv", header=FALSE))
tableau <- delete_part(tableau, part='header')
border_remove(tableau)
*Complex | Max | MaxStressed | SSP | |||
*Complex | Max | MaxStressed | SSP | |||
'staʒ | 'staʒ | 43 | 1 | 1 | ||
'taʒ | 57 | 1 | 1 | |||
'krɛp | 'krep | 53 | 1 | |||
'kɛp | 47 | 1 | 1 | |||
sti.'lo | sti.'lo | 20 | 1 | 1 | ||
ti.'lo | 80 | 1 | ||||
krɔ.'kɛt | kro.'kɛt | 35 | 1 | |||
kɔ.'kɛt | 65 | 1 |
We can train our model again, but with the same prior on the weights of all constraints.
full_model_map <- optimize_weights(full_file, in_sep=',', mu_scalar=0, sigma_scalar=0.5)
Notice that the log likelihood of the training data is worse for the MAP model.
full_model$loglik
## [1] -228.811
full_model_map$loglik
## [1] -230.0765
Let’s compare the predictions of each model on the new data.
new_file <- "amp_demo_grammar_new.csv"
predict_probabilities(new_file, full_model$weights, in_sep=',')
## $loglik
## [1] -258.2738
##
## $predictions
## UR SR Freq *Complex Max MaxStressed SSP Predicted Probability
## 1: 'staʒ 'staʒ 43 1 0 0 1 0.3769828
## 2: 'staʒ 'taʒ 57 0 1 1 0 0.6230172
## 3: 'krɛp 'krep 53 1 0 0 0 0.6230123
## 4: 'krɛp 'kɛp 47 0 1 1 0 0.3769877
## 5: sti.'lo sti.'lo 20 1 0 0 1 0.1230143
## 6: sti.'lo ti.'lo 80 0 1 0 0 0.8769857
## 7: krɔ.'kɛt kro.'kɛt 35 1 0 0 0 0.2769861
## 8: krɔ.'kɛt kɔ.'kɛt 65 0 1 0 0 0.7230139
## Observed Probability Error
## 1: 0.43 -0.05301723
## 2: 0.57 0.05301723
## 3: 0.53 0.09301233
## 4: 0.47 -0.09301233
## 5: 0.20 -0.07698567
## 6: 0.80 0.07698567
## 7: 0.35 -0.07301393
## 8: 0.65 0.07301393
predict_probabilities(new_file, full_model_map$weights, in_sep=',')
## $loglik
## [1] -254.3244
##
## $predictions
## UR SR Freq *Complex Max MaxStressed SSP Predicted Probability
## 1: 'staʒ 'staʒ 43 1 0 0 1 0.3721018
## 2: 'staʒ 'taʒ 57 0 1 1 0 0.6278982
## 3: 'krɛp 'krep 53 1 0 0 0 0.5831654
## 4: 'krɛp 'kɛp 47 0 1 1 0 0.4168346
## 5: sti.'lo sti.'lo 20 1 0 0 1 0.1622585
## 6: sti.'lo ti.'lo 80 0 1 0 0 0.8377415
## 7: krɔ.'kɛt kro.'kɛt 35 1 0 0 0 0.3137755
## 8: krɔ.'kɛt kɔ.'kɛt 65 0 1 0 0 0.6862245
## Observed Probability Error
## 1: 0.43 -0.05789816
## 2: 0.57 0.05789816
## 3: 0.53 0.05316539
## 4: 0.47 -0.05316539
## 5: 0.20 -0.03774152
## 6: 0.80 0.03774152
## 7: 0.35 -0.03622455
## 8: 0.65 0.03622455
Although the new model does a poorer job of fitting the training data, its predictions about the new data are closer to what is observed.
maxent.ot
has other functionality not covered here,
including:
We are in the process of implementing some additional functionality
in maxent.ot
, including: