Skip to content

Commit c47a197

Browse files
author
Kevin
committed
add EMNLP 2016 system
1 parent b89471c commit c47a197

17 files changed

+535
-299
lines changed

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Coreference Resolution with Deep Learning
22

33
This repository contains code for training and running the neural coreference models decribed in two papers:
4-
* [Coming Soon] ["Deep Reinforcement Learning for Mention-Ranking Coreference Models"](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf), Kevin Clark and Christopher D. Manning, EMNLP 2016.
4+
* ["Deep Reinforcement Learning for Mention-Ranking Coreference Models"](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf), Kevin Clark and Christopher D. Manning, EMNLP 2016.
55
* ["Improving Coreference Resolution by Learning Entity-Level Distributed Representations"](http://cs.stanford.edu/people/kevclark/resources/clark-manning-acl16-improving.pdf), Kevin Clark and Christopher D. Manning, ACL 2016.
66

77
### Requirements
@@ -13,9 +13,14 @@ The easiest way of doing this is within Stanford's [CoreNLP](https://github.com/
1313
```
1414
java -Xmx5g -cp stanford-corenlp.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -coref.algorithm neural -file example_file.txt
1515
```
16-
You will need to fork the latest version from github and download the latest models from [here](http://nlp.stanford.edu/software/stanford-english-corenlp-models-current.jar).
16+
See the [CorefAnnotator](http://stanfordnlp.github.io/CoreNLP/coref.html) page for more details.
17+
1718

1819
#### Training your own model
19-
1. Download pretrained word embeddings. We use 50 dimensional word2vec embeddings for English ([link](https://drive.google.com/open?id=0B5Y5rz_RUKRmdEFPcGIwZ2xLRW8)) and 64 dimenensional [polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) embeddings for Chinese ([link](http://bit.ly/19bTKeS)) in our paper.
20-
2. Run the [NeuralCorefDataExporter](https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/coref/neural/NeuralCorefDataExporter.java) class in the development version Stanford's CoreNLP using [this](https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/coref/neural/properties/english-conll.properties) properties file. This does mention detection and feature extraction on the CoNLL data and then outputs the results as json.
21-
3. Run run_all.py, preferably on a GPU. Training takes roughly 4 days on a GTX TITAN GPU.
20+
The following to trains the neural mention-ranking model with reward rescaling (the highest scoring model from the papers).
21+
1. Download the CoNLL training data from [here](http://conll.cemantix.org/2012/data.html).
22+
2. Download pretrained word embeddings. We use 50 dimensional word2vec embeddings for English ([link](https://drive.google.com/open?id=0B5Y5rz_RUKRmdEFPcGIwZ2xLRW8)) and 64 dimenensional [polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) embeddings for Chinese ([link](http://bit.ly/19bTKeS)) in our paper.
23+
3. Run the [NeuralCorefDataExporter](https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/coref/neural/NeuralCorefDataExporter.java) class in version Stanford's CoreNLP using the [neural-coref-conll](https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/coref/properties/neural-english-conll.properties) properties file. This does mention detection and feature extraction on the CoNLL data and then outputs the results as json.
24+
4. Run run_all.py, preferably on a GPU. Training takes roughly 7 days on a GTX TITAN X GPU.
25+
26+
run_all.py also contains methods to train the other models from the papers.

build_dataset.py renamed to build_datasets.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import directories
2-
import util
3-
from dataset import PairDataBuilder, MentionDataBuilder, DocumentDataBuilder
2+
import utils
3+
from datasets import PairDataBuilder, MentionDataBuilder, DocumentDataBuilder
44
from word_vectors import WordVectors
55
import random
66
import numpy as np
@@ -9,7 +9,7 @@
99
def explore_pairwise_features():
1010
pos_sum, neg_sum = np.zeros(9), np.zeros(9)
1111
pos_count, neg_count = 0, 0
12-
for i, d in enumerate(util.load_json_lines(directories.RAW + "train")):
12+
for i, d in enumerate(utils.load_json_lines(directories.RAW + "train")):
1313
for key in d["labels"].keys():
1414
if d["labels"][key] == 1:
1515
pos_sum += d["pair_features"][key]
@@ -25,7 +25,7 @@ def explore_pairwise_features():
2525

2626

2727
def build_dataset(vectors, name, tune_fraction=0.0, reduced=False, columns=None):
28-
doc_vectors = util.load_pickle(directories.MISC + name.replace("_reduced", "") +
28+
doc_vectors = utils.load_pickle(directories.MISC + name.replace("_reduced", "") +
2929
"_document_vectors.pkl")
3030

3131
main_pairs = PairDataBuilder(columns)
@@ -35,9 +35,9 @@ def build_dataset(vectors, name, tune_fraction=0.0, reduced=False, columns=None)
3535
main_docs = DocumentDataBuilder(columns)
3636
tune_docs = DocumentDataBuilder(columns)
3737

38-
print "Building dataset", name
39-
p = util.Progbar(target=(2 if reduced else util.lines_in_file(directories.RAW + name)))
40-
for i, d in enumerate(util.load_json_lines(directories.RAW + name)):
38+
print "Building dataset", name + ("/tune" if tune_fraction > 0 else "")
39+
p = utils.Progbar(target=(2 if reduced else utils.lines_in_file(directories.RAW + name)))
40+
for i, d in enumerate(utils.load_json_lines(directories.RAW + name)):
4141
if reduced and i > 2:
4242
break
4343
p.update(i + 1)

clustering_learning.py

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
import dataset
1+
import datasets
22
import clustering_models
33
import model_properties
44
import directories
55
import timer
6-
import util
6+
import utils
77
import evaluation
88
from document import Document
99
from clustering_preprocessing import ActionSpace
@@ -118,7 +118,7 @@ def train_all(self):
118118
timer.start("train")
119119

120120
model_weights = self.model.get_weights()
121-
prog = util.Progbar(len(self.memory))
121+
prog = utils.Progbar(len(self.memory))
122122
random.shuffle(self.memory)
123123
for i, X in enumerate(self.memory):
124124
loss = self.train_on_example(X)
@@ -176,7 +176,7 @@ def __init__(self, trainer, docs, data, message, replay_memory=None, beta=0,
176176
random.shuffle(docs)
177177
if self.training:
178178
docs = docs[:docs_per_iteration]
179-
prog = util.Progbar(len(docs))
179+
prog = utils.Progbar(len(docs))
180180
for i, (doc, actionstate) in enumerate(docs):
181181
self.trainer.doc = doc
182182
self.trainer.actionstate = actionstate
@@ -243,27 +243,28 @@ def evaluate(trainer, docs, data, message):
243243

244244

245245
def load_docs(dataset_name, word_vectors):
246-
return (dataset.Dataset(dataset_name, model_properties.MentionRankingProps(), word_vectors),
247-
zip(util.load_pickle(directories.DOCUMENTS + dataset_name + '_docs.pkl'),
248-
util.load_pickle(directories.ACTION_SPACE + dataset_name + '_action_space.pkl')))
246+
return (datasets.Dataset(dataset_name, model_properties.MentionRankingProps(), word_vectors),
247+
zip(utils.load_pickle(directories.DOCUMENTS + dataset_name + '_docs.pkl'),
248+
utils.load_pickle(directories.ACTION_SPACE + dataset_name + '_action_space.pkl')))
249249

250250

251251
class Trainer:
252252
def __init__(self, model_props, train_set='train', test_set='dev', n_epochs=200,
253253
empty_buffer=True, betas=None, write_every=1, max_docs=10000):
254+
self.model_props = model_props
254255
if betas is None:
255-
betas = [0]
256+
betas = [0.8 ** i for i in range(1, 5)]
256257
self.write_every = write_every
257258

258-
print "Model=" + directories.CLUSTERER + ", ordering from " + directories.ACTION_SPACE
259+
print "Model=" + model_props.path + ", ordering from " + directories.ACTION_SPACE
259260
self.pair_model, self.anaphoricity_model, self.model, word_vectors = \
260261
clustering_models.get_models(model_props)
261262
json_string = self.model.to_json()
262-
open(directories.CLUSTERER + 'architecture.json', 'w').write(json_string)
263-
util.rmkdir(directories.CLUSTERER + 'src')
263+
open(model_props.path + 'architecture.json', 'w').write(json_string)
264+
utils.rmkdir(model_props.path + 'src')
264265
for fname in os.listdir('.'):
265266
if fname.endswith('.py'):
266-
shutil.copyfile(fname, directories.CLUSTERER + 'src/' + fname)
267+
shutil.copyfile(fname, model_props.path + 'src/' + fname)
267268

268269
self.train_data, self.train_docs = load_docs(train_set, word_vectors)
269270
print "Train loaded"
@@ -290,7 +291,7 @@ def __init__(self, model_props, train_set='train', test_set='dev', n_epochs=200,
290291
replay_memory = ReplayMemory(self, self.model)
291292
for self.epoch in range(n_epochs):
292293
print 80 * "-"
293-
print "ITERATION", (self.epoch + 1), "model =", directories.CLUSTERER
294+
print "ITERATION", (self.epoch + 1), "model =", model_props.path
294295
ar = AgentRunner(self, self.train_docs, self.train_data, "Training", replay_memory,
295296
beta=0 if self.epoch >= len(betas) else betas[self.epoch])
296297
self.train_pairs = ar.merged_pairs
@@ -312,7 +313,7 @@ def run_evaluation(self):
312313
epoch_stats.update({"train " + k: v for k, v in train_scores.iteritems()})
313314
epoch_stats.update({"test " + k: v for k, v in test_scores.iteritems()})
314315
self.history.append(epoch_stats)
315-
util.write_pickle(self.history, directories.CLUSTERER + 'history.pkl')
316+
utils.write_pickle(self.history, self.model_props.path + 'history.pkl')
316317
timer.print_totals()
317318

318319
test_conll = epoch_stats["test conll"]
@@ -327,17 +328,17 @@ def run_evaluation(self):
327328
print "New best CoNLL in window, saving model"
328329
self.save_progress(dev_pairs, test_pairs,
329330
str(self.write_every * int(self.epoch / self.write_every)))
330-
self.model.save_weights(directories.CLUSTERER + "weights.hdf5", overwrite=True)
331+
self.model.save_weights(self.model_props.path + "weights.hdf5", overwrite=True)
331332

332333
def save_progress(self, dev_pairs, test_pairs, prefix):
333-
self.model.save_weights(directories.CLUSTERER + prefix + "_weights.hdf5", overwrite=True)
334-
write_pairs(dev_pairs, prefix + "_dev_pairs")
335-
write_pairs(test_pairs, prefix + "_test_pairs")
336-
write_pairs(self.train_pairs, prefix + "_train_pairs")
334+
self.model.save_weights(self.model_props.path + prefix + "_weights.hdf5", overwrite=True)
335+
write_pairs(dev_pairs, self.model_props.path + prefix + "_dev_pairs")
336+
write_pairs(test_pairs, self.model_props.path + prefix + "_test_pairs")
337+
write_pairs(self.train_pairs, self.model_props.path + prefix + "_train_pairs")
337338

338339

339-
def write_pairs(pairs, name):
340-
with open(directories.CLUSTERER + name, 'w') as f:
340+
def write_pairs(pairs, path):
341+
with open(path, 'w') as f:
341342
for did, doc_merged_pairs in pairs.iteritems():
342343
f.write(str(did) + "\t")
343344
for m1, m2 in doc_merged_pairs:

clustering_preprocessing.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import util
1+
import utils
22
import directories
33
import shutil
44
import timer
@@ -91,7 +91,7 @@ def write_probable_pairs(dataset_name, action_space_path, scores):
9191
margin_removals = 0
9292
total_pairs = 0
9393
total_size = 0
94-
for did in util.logged_loop(scores):
94+
for did in utils.logged_loop(scores):
9595
doc_scores = scores[did]
9696
pairs = sorted([pair for pair in doc_scores.keys() if pair[0] != -1],
9797
key=lambda pr: doc_scores[pr] - (-1 - 0.3*doc_scores[(-1, pr[1])]),
@@ -121,17 +121,17 @@ def write_probable_pairs(dataset_name, action_space_path, scores):
121121
print "avg size without filter: {:.1f}".format(total_pairs / float(len(scores)))
122122
print "avg size: {:.1f}".format(total_size / float(len(scores)))
123123
print "margin removals size: {:.1f}".format(margin_removals / float(len(scores)))
124-
util.write_pickle(probable_pairs, action_space_path + dataset_name + '_probable_pairs.pkl')
124+
utils.write_pickle(probable_pairs, action_space_path + dataset_name + '_probable_pairs.pkl')
125125
shutil.copyfile('clustering_preprocessing.py',
126126
action_space_path + 'clustering_preprocessing.py')
127127

128128

129129
def write_action_spaces(dataset_name, action_space_path, model_path, ltr=False):
130130
output_file = action_space_path + dataset_name + "_action_space.pkl"
131131
print "Writing candidate actions to " + output_file
132-
scores = util.load_pickle(model_path + dataset_name + "_scores.pkl")
132+
scores = utils.load_pickle(model_path + dataset_name + "_scores.pkl")
133133
write_probable_pairs(dataset_name, action_space_path, scores)
134-
probable_pairs = util.load_pickle(action_space_path + dataset_name + '_probable_pairs.pkl')
134+
probable_pairs = utils.load_pickle(action_space_path + dataset_name + '_probable_pairs.pkl')
135135

136136
possible_pairs_total = 0
137137
action_spaces = []
@@ -152,12 +152,12 @@ def write_action_spaces(dataset_name, action_space_path, model_path, ltr=False):
152152
possible_pairs = get_possible_pairs(probable_pairs[did])
153153
possible_pairs_total += len(possible_pairs)
154154
action_spaces.append(ActionSpace(did, actions, possible_pairs))
155-
util.write_pickle(action_spaces, output_file)
155+
utils.write_pickle(action_spaces, output_file)
156156

157157

158158
def main(ranking_model):
159159
write_action_spaces("dev", directories.ACTION_SPACE,
160-
directories.MODELS_BASE + ranking_model + "/")
160+
directories.MODELS + ranking_model + "/")
161161
write_action_spaces("test", directories.ACTION_SPACE,
162-
directories.MODELS_BASE + ranking_model + "/")
162+
directories.MODELS + ranking_model + "/")
163163

dataset.py renamed to datasets.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import timer
2-
import util
2+
import utils
33
import directories
44
import numpy as np
55

@@ -37,7 +37,7 @@ def write(self, path):
3737
self.data = np.array(self.data, dtype='bool') \
3838
if self.name == 'y' or self.name == 'pf' else np.vstack(self.data)
3939
print "Writing {:}, dtype={:}, size={:}".format(self.name, str(self.data.dtype),
40-
util.sizeof_fmt(self.data.nbytes))
40+
utils.sizeof_fmt(self.data.nbytes))
4141
np.save(path + self.name, self.data)
4242

4343

@@ -47,7 +47,7 @@ def __init__(self, columns=None):
4747
self.mention_inds = DatasetColumn('dmi', columns)
4848
self.pair_inds = DatasetColumn('dpi', columns)
4949
self.features = DatasetColumn('df', columns)
50-
self.genres = util.load_pickle(directories.MISC + 'genres.pkl')
50+
self.genres = utils.load_pickle(directories.MISC + 'genres.pkl')
5151

5252
def add_doc(self, ms, me, ps, pe, features):
5353
self.mention_inds.append(np.array([ms, me], dtype='int32'))
@@ -58,7 +58,7 @@ def add_doc(self, ms, me, ps, pe, features):
5858
def write(self, dataset_name):
5959
path = directories.DOC_DATA + dataset_name + '/'
6060
if not self.columns:
61-
util.rmkdir(path)
61+
utils.rmkdir(path)
6262
self.mention_inds.write(path)
6363
self.pair_inds.write(path)
6464
self.features.write(path)
@@ -115,7 +115,7 @@ def span_vector(start, end):
115115
def write(self, dataset_name):
116116
path = directories.MENTION_DATA + dataset_name + '/'
117117
if not self.columns:
118-
util.rmkdir(path)
118+
utils.rmkdir(path)
119119
self.words.write(path)
120120
self.spans.write(path)
121121
self.features.write(path)
@@ -158,7 +158,7 @@ def add_pair(self, y, i1, i2, did, mid1, mid2, features):
158158
def write(self, dataset_name):
159159
path = directories.PAIR_DATA + dataset_name + '/'
160160
if not self.columns:
161-
util.rmkdir(path)
161+
utils.rmkdir(path)
162162
self.pair_indices.write(path)
163163
self.pair_features.write(path)
164164
self.y.write(path)
@@ -174,6 +174,7 @@ def size(self):
174174
class Dataset:
175175
def __init__(self, dataset_name, model_props, word_vectors):
176176
self.model_props = model_props
177+
self.name = dataset_name
177178
mentions_path = directories.MENTION_DATA + dataset_name + '/'
178179
pair_path = directories.PAIR_DATA + dataset_name + '/'
179180
docs_path = directories.DOC_DATA + dataset_name + '/'
@@ -257,6 +258,16 @@ def featurize_pairs(self, m1, m2, batch, did):
257258

258259

259260
class DocumentBatchedDataset:
261+
"""
262+
Shuffling and then iterating through all mention pairs in the dataset has two problems:
263+
1. We want to compute a representation for a mention (in our case by looking up some
264+
word embeddings and applying a hidden layer) once for every pair of mentions instead of
265+
once for every mention.
266+
2. For mention-ranking models, all pairs involving the current candidate anaphor must be
267+
in the same batch.
268+
We deal with this by instead using each document as a batch, except for large documents, which
269+
we split into chunks).
270+
"""
260271
def __init__(self, dataset_name, model_props, max_pairs=10000, with_ids=False):
261272
self.name = dataset_name
262273
self.model_props = model_props
@@ -300,7 +311,6 @@ def __init__(self, dataset_name, model_props, max_pairs=10000, with_ids=False):
300311
np.ones(ana, dtype='int32')
301312
for ana in range(0, me - ms)])
302313
self.pair_nums += [np.array(p) for p in zip(pair_antecedents, pair_anaphors)]
303-
304314
self.pair_nums = np.vstack(self.pair_nums)
305315

306316
self.doc_sizes = {}
@@ -419,8 +429,8 @@ def __init__(self, dataset_name, model_props, max_pairs=10000, with_ids=False):
419429

420430
min_anaphor = max_anaphor
421431
min_pair = max_pair
422-
423432
timer.stop("preprocess_dataset")
433+
424434
self.n_batches = len(self.batches)
425435
self.pairs_per_batch = float(self.n_pairs) / self.n_batches
426436
self.anaphoric_anaphors_per_batch = float(self.n_anaphoric_anaphors) / self.n_batches
@@ -478,6 +488,8 @@ def __iter__(self):
478488
X['ends'] = ends[:, np.newaxis]
479489
X['costs'] = costs[:, np.newaxis]
480490
X['y'] = np.zeros((starts.size, 1))
491+
if self.model_props.use_rewards:
492+
X['cost_ptrs'] = costs
481493
else:
482494
X['y'] = self.y[pairs][:, np.newaxis]
483495

directories.py

Lines changed: 14 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
import util
1+
import utils
22

3-
DATA = './data/'
3+
DATA = '/scr/kevclark/clean_english_conll/'#'./data/'
44

55
RAW = DATA + 'raw/'
6-
MODELS_BASE = DATA + 'models/'
7-
CLUSTERER_BASE = DATA + 'clusterers/'
6+
MODELS = DATA + 'models/'
7+
CLUSTERERS = DATA + 'clusterers/'
88
DOCUMENTS = DATA + 'documents/'
99
ACTION_SPACES_BASE = DATA + 'action_spaces/'
1010
GOLD = DATA + 'gold/'
@@ -19,36 +19,19 @@
1919
PAIR_DATA = FEATURES_BASE + 'mention_pair_data/'
2020
DOC_DATA = FEATURES_BASE + 'doc_data/'
2121

22-
MODEL_NAME = 'model/'
23-
MODEL = MODELS_BASE + MODEL_NAME
24-
25-
CLUSTERER_NAME = 'clusterer/'
26-
CLUSTERER = CLUSTERER_BASE + CLUSTERER_NAME
27-
2822
ACTION_SPACE_NAME = 'action_spaces/'
2923
ACTION_SPACE = ACTION_SPACES_BASE + ACTION_SPACE_NAME
3024

3125
assert DATA[-1] == '/'
3226
assert ACTION_SPACE_NAME[-1] == '/'
33-
assert MODEL_NAME[-1] == '/'
34-
assert CLUSTERER[-1] == '/'
35-
36-
util.mkdir(MISC)
37-
util.mkdir(FEATURES_BASE)
38-
util.mkdir(MENTION_DATA)
39-
util.mkdir(PAIR_DATA)
40-
util.mkdir(DOC_DATA)
41-
util.mkdir(MODELS_BASE)
42-
util.mkdir(CLUSTERER_BASE)
43-
util.mkdir(MODEL)
44-
util.mkdir(CLUSTERER)
45-
util.mkdir(DOCUMENTS)
46-
util.mkdir(ACTION_SPACES_BASE)
47-
util.mkdir(ACTION_SPACE)
48-
4927

50-
def set_model_name(model_name):
51-
global MODEL_NAME, MODEL
52-
MODEL_NAME = model_name + '/'
53-
MODEL = MODELS_BASE + MODEL_NAME
54-
util.mkdir(MODEL)
28+
utils.mkdir(MISC)
29+
utils.mkdir(FEATURES_BASE)
30+
utils.mkdir(MENTION_DATA)
31+
utils.mkdir(PAIR_DATA)
32+
utils.mkdir(DOC_DATA)
33+
utils.mkdir(MODELS)
34+
utils.mkdir(CLUSTERERS)
35+
utils.mkdir(DOCUMENTS)
36+
utils.mkdir(ACTION_SPACES_BASE)
37+
utils.mkdir(ACTION_SPACE)

0 commit comments

Comments
 (0)