CoNLL-U Tools is a Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora. It provides utilities for format conversion, validation, evaluation, pattern matching, and morphological normalization, supporting workflows with CoNLL-U and brat standoff formats.
- Format Conversion: Bidirectional conversion between brat standoff and CoNLL-U formats
- Validation: Check CoNLL-U files for format compliance and annotation guideline adherence
- Evaluation: Score parser outputs against gold-standard files with comprehensive metrics
- Pattern Matching: Find tokens and sentences matching complex linguistic criteria
- Morphological Utilities: Normalize features, convert between tagsets (Perseus, ITTB, PROIEL, LLCT)
- Extensible: Add custom tagset converters and feature mappings
For detailed information about each feature, see the User Guide.
pip install conllu_toolsFor detailed installation instructions, including platform-specific guidance and troubleshooting, see the Installation Guide.
from conllu_tools.io import conllu_to_brat
conllu_to_brat(
conllu_filename='path/to/conllu/yourfile.conllu',
output_directory='path/to/brat/files',
sents_per_doc=10,
output_root=True,
)
# Outputs .ann and .txt files to 'path/to/brat/files', alongside
# annotation.conf, tools.conf, visual.conf, and metadata.jsonfrom conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data
feature_set = load_language_data('feats', language='la')
brat_to_conllu(
input_directory='path/to/brat/files',
output_directory='path/to/conllu',
ref_conllu='yourfile.conllu',
feature_set=feature_set,
output_root=True
)
# Outputs yourfile-from_brat.conllu to 'path/to/conllu'from conllu_tools import ConlluValidator
validator = ConlluValidator(lang='la', level=2)
reporter = validator.validate_file('path/to/yourfile.conllu')
# Print error count
print(f'Errors found: {reporter.get_error_count()}')
# Inspect first error
sent_id, order, testlevel, error = reporter.errors[0]
print(f'Sentence ID: {sent_id}') # e.g. 34
print(f'Testing at level: {testlevel}') # e.g. 2
print(f'Error test level: {error.testlevel}') # e.g. 1
print(f'Error type: {error.error_type}') # e.g. "Metadata"
print(f'Test ID: {error.testid}') # e.g. "text-mismatch"
print(f'Error message: {error.msg}') # Full error message (see below)
# Print all errors formatted as strings
for error in reporter.format_errors():
print(error)
# Example output:
# Sentence 34:
# [L2 Metadata text-mismatch] The text attribute does not match the text
# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....'
# Reconstructed: 'Una scala ....' (first diff at position 9)from conllu_tools import ConlluEvaluator
evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')
scores = evaluator.evaluate_files(
gold_path='path/to/gold_standard.conllu',
system_path='path/to/system_output.conllu',
)
print(f'UAS: {scores["UAS"].f1:.2%}')
print(f'LAS: {scores["LAS"].f1:.2%}')
# Example output:
# UAS: 64.82%
# LAS: 48.16%import conllu
from conllu_tools.matching import build_pattern, find_in_corpus
# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
corpus = conllu.parse(f.read())
# Find all adjective + noun sequences
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = find_in_corpus(corpus, [pattern])
for match in matches:
print(f"[{match.sentence_id}] {match.substring}")
print(f" Forms: {match.forms}")
print(f" Lemmata: {match.lemmata}")
# More pattern examples:
build_pattern('NOUN:lemma=rex') # Noun with lemma 'rex'
build_pattern('NOUN:feats=(Case=Abl)') # Ablative noun
build_pattern('DET+ADJ{0,2}+NOUN') # Det + 0-2 adjectives + noun
build_pattern('ADP+NOUN:feats=(Case=Acc)') # Preposition + accusative nounFor more examples and detailed usage, see the Quickstart Guide.
The full documentation includes:
- Installation Guide: Detailed installation instructions and troubleshooting
- Quickstart Guide: Get started quickly with common tasks
- User Guide: Comprehensive guides for all features
- Conversion: CoNLL-U ↔ brat conversion
- Validation: Validation framework and recipes
- Evaluation: Metrics and evaluation workflows
- Pattern Matching: Find complex linguistic patterns
- Utilities: Tagset conversion and normalization
- API Reference: Complete API documentation
This toolkit builds upon and extends code from several sources:
- CoNLL-U/brat conversion logic is based on the tools made available by the brat team.
- CoNLL-U evaluation is based on the work of Milan Straka and Martin Popel for the CoNLL 2018 UD shared task, and Gosse Bouma for the IWPT 2020 shared task.
- CoNLL-U validation is based on work by Filip Ginter and Sampo Pyysalo.
The project is licensed under the MIT License, allowing free use, modification, and distribution.