Skip to content

A Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora.

License

Notifications You must be signed in to change notification settings

gpizzorno/conllu_tools

Repository files navigation

CoNLL-U Tools

License Python Tests Documentation

CoNLL-U Tools is a Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora. It provides utilities for format conversion, validation, evaluation, pattern matching, and morphological normalization, supporting workflows with CoNLL-U and brat standoff formats.

Read the documentation

Features

  • Format Conversion: Bidirectional conversion between brat standoff and CoNLL-U formats
  • Validation: Check CoNLL-U files for format compliance and annotation guideline adherence
  • Evaluation: Score parser outputs against gold-standard files with comprehensive metrics
  • Pattern Matching: Find tokens and sentences matching complex linguistic criteria
  • Morphological Utilities: Normalize features, convert between tagsets (Perseus, ITTB, PROIEL, LLCT)
  • Extensible: Add custom tagset converters and feature mappings

For detailed information about each feature, see the User Guide.

Installation

Quick Install

pip install conllu_tools

For detailed installation instructions, including platform-specific guidance and troubleshooting, see the Installation Guide.

Quick Start

Convert CoNLL-U to brat

from conllu_tools.io import conllu_to_brat

conllu_to_brat(
    conllu_filename='path/to/conllu/yourfile.conllu',
    output_directory='path/to/brat/files',
    sents_per_doc=10,
    output_root=True,
)

# Outputs .ann and .txt files to 'path/to/brat/files', alongside
# annotation.conf, tools.conf, visual.conf, and metadata.json

Convert brat to CoNLL-U

from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data

feature_set = load_language_data('feats', language='la')
brat_to_conllu(
    input_directory='path/to/brat/files',
    output_directory='path/to/conllu',
    ref_conllu='yourfile.conllu',
    feature_set=feature_set,
    output_root=True
)

# Outputs yourfile-from_brat.conllu to 'path/to/conllu'

Validate CoNLL-U Files

from conllu_tools import ConlluValidator

validator = ConlluValidator(lang='la', level=2)
reporter = validator.validate_file('path/to/yourfile.conllu')

# Print error count
print(f'Errors found: {reporter.get_error_count()}')

# Inspect first error
sent_id, order, testlevel, error = reporter.errors[0]
print(f'Sentence ID: {sent_id}')  # e.g. 34
print(f'Testing at level: {testlevel}')  # e.g. 2
print(f'Error test level: {error.testlevel}')  # e.g. 1
print(f'Error type: {error.error_type}')  # e.g. "Metadata"
print(f'Test ID: {error.testid}')  # e.g. "text-mismatch"
print(f'Error message: {error.msg}')  # Full error message (see below)

# Print all errors formatted as strings
for error in reporter.format_errors():
    print(error)

# Example output:
# Sentence 34:
# [L2 Metadata text-mismatch] The text attribute does not match the text 
# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....' 
# Reconstructed: 'Una scala ....' (first diff at position 9)

Evaluate CoNLL-U Files

from conllu_tools import ConlluEvaluator

evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')
scores = evaluator.evaluate_files(
    gold_path='path/to/gold_standard.conllu',
    system_path='path/to/system_output.conllu',
)

print(f'UAS: {scores["UAS"].f1:.2%}')
print(f'LAS: {scores["LAS"].f1:.2%}')

# Example output:
# UAS: 64.82%
# LAS: 48.16%

Pattern Matching

import conllu
from conllu_tools.matching import build_pattern, find_in_corpus

# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
    corpus = conllu.parse(f.read())

# Find all adjective + noun sequences
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = find_in_corpus(corpus, [pattern])

for match in matches:
    print(f"[{match.sentence_id}] {match.substring}")
    print(f"  Forms: {match.forms}")
    print(f"  Lemmata: {match.lemmata}")

# More pattern examples:
build_pattern('NOUN:lemma=rex')                    # Noun with lemma 'rex'
build_pattern('NOUN:feats=(Case=Abl)')             # Ablative noun
build_pattern('DET+ADJ{0,2}+NOUN')                 # Det + 0-2 adjectives + noun
build_pattern('ADP+NOUN:feats=(Case=Acc)')         # Preposition + accusative noun

For more examples and detailed usage, see the Quickstart Guide.

Documentation

The full documentation includes:

Acknowledgments

This toolkit builds upon and extends code from several sources:

License

The project is licensed under the MIT License, allowing free use, modification, and distribution.