CoNLL-U Tools

CoNLL-U Tools is a Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora. It provides utilities for format conversion, validation, evaluation, pattern matching, and morphological normalization, supporting workflows with CoNLL-U and brat standoff formats.

Read the documentation

Features

Format Conversion: Bidirectional conversion between brat standoff and CoNLL-U formats
Validation: Check CoNLL-U files for format compliance and annotation guideline adherence
Evaluation: Score parser outputs against gold-standard files with comprehensive metrics
Pattern Matching: Find tokens and sentences matching complex linguistic criteria
Morphological Utilities: Normalize features, convert between tagsets (Perseus, ITTB, PROIEL, LLCT)
Extensible: Add custom tagset converters and feature mappings

For detailed information about each feature, see the User Guide.

Installation

Quick Install

pip install conllu_tools

For detailed installation instructions, including platform-specific guidance and troubleshooting, see the Installation Guide.

Quick Start

Convert CoNLL-U to brat

from conllu_tools.io import conllu_to_brat

conllu_to_brat(
    conllu_filename='path/to/conllu/yourfile.conllu',
    output_directory='path/to/brat/files',
    sents_per_doc=10,
    output_root=True,
)

# Outputs .ann and .txt files to 'path/to/brat/files', alongside
# annotation.conf, tools.conf, visual.conf, and metadata.json

Convert brat to CoNLL-U

from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data

feature_set = load_language_data('feats', language='la')
brat_to_conllu(
    input_directory='path/to/brat/files',
    output_directory='path/to/conllu',
    ref_conllu='yourfile.conllu',
    feature_set=feature_set,
    output_root=True
)

# Outputs yourfile-from_brat.conllu to 'path/to/conllu'

Validate CoNLL-U Files

from conllu_tools import ConlluValidator

validator = ConlluValidator(lang='la', level=2)
reporter = validator.validate_file('path/to/yourfile.conllu')

# Print error count
print(f'Errors found: {reporter.get_error_count()}')

# Inspect first error
sent_id, order, testlevel, error = reporter.errors[0]
print(f'Sentence ID: {sent_id}')  # e.g. 34
print(f'Testing at level: {testlevel}')  # e.g. 2
print(f'Error test level: {error.testlevel}')  # e.g. 1
print(f'Error type: {error.error_type}')  # e.g. "Metadata"
print(f'Test ID: {error.testid}')  # e.g. "text-mismatch"
print(f'Error message: {error.msg}')  # Full error message (see below)

# Print all errors formatted as strings
for error in reporter.format_errors():
    print(error)

# Example output:
# Sentence 34:
# [L2 Metadata text-mismatch] The text attribute does not match the text 
# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....' 
# Reconstructed: 'Una scala ....' (first diff at position 9)

Evaluate CoNLL-U Files

from conllu_tools import ConlluEvaluator

evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')
scores = evaluator.evaluate_files(
    gold_path='path/to/gold_standard.conllu',
    system_path='path/to/system_output.conllu',
)

print(f'UAS: {scores["UAS"].f1:.2%}')
print(f'LAS: {scores["LAS"].f1:.2%}')

# Example output:
# UAS: 64.82%
# LAS: 48.16%

Pattern Matching

import conllu
from conllu_tools.matching import build_pattern, find_in_corpus

# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
    corpus = conllu.parse(f.read())

# Find all adjective + noun sequences
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = find_in_corpus(corpus, [pattern])

for match in matches:
    print(f"[{match.sentence_id}] {match.substring}")
    print(f"  Forms: {match.forms}")
    print(f"  Lemmata: {match.lemmata}")

# More pattern examples:
build_pattern('NOUN:lemma=rex')                    # Noun with lemma 'rex'
build_pattern('NOUN:feats=(Case=Abl)')             # Ablative noun
build_pattern('DET+ADJ{0,2}+NOUN')                 # Det + 0-2 adjectives + noun
build_pattern('ADP+NOUN:feats=(Case=Acc)')         # Preposition + accusative noun

For more examples and detailed usage, see the Quickstart Guide.

Documentation

The full documentation includes:

Installation Guide: Detailed installation instructions and troubleshooting
Quickstart Guide: Get started quickly with common tasks
User Guide: Comprehensive guides for all features
- Conversion: CoNLL-U ↔ brat conversion
- Validation: Validation framework and recipes
- Evaluation: Metrics and evaluation workflows
- Pattern Matching: Find complex linguistic patterns
- Utilities: Tagset conversion and normalization
API Reference: Complete API documentation

Acknowledgments

This toolkit builds upon and extends code from several sources:

CoNLL-U/brat conversion logic is based on the tools made available by the brat team.
CoNLL-U evaluation is based on the work of Milan Straka and Martin Popel for the CoNLL 2018 UD shared task, and Gosse Bouma for the IWPT 2020 shared task.
CoNLL-U validation is based on work by Filip Ginter and Sampo Pyysalo.

License

The project is licensed under the MIT License, allowing free use, modification, and distribution.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github/workflows		.github/workflows
conllu_tools		conllu_tools
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoNLL-U Tools

Features

Installation

Quick Install

Quick Start

Convert CoNLL-U to brat

Convert brat to CoNLL-U

Validate CoNLL-U Files

Evaluate CoNLL-U Files

Pattern Matching

Documentation

Acknowledgments

License

About

Uh oh!

Releases 5

Contributors 2

Uh oh!

Languages

License

gpizzorno/conllu_tools

Folders and files

Latest commit

History

Repository files navigation

CoNLL-U Tools

Features

Installation

Quick Install

Quick Start

Convert CoNLL-U to brat

Convert brat to CoNLL-U

Validate CoNLL-U Files

Evaluate CoNLL-U Files

Pattern Matching

Documentation

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Contributors 2

Uh oh!

Languages