Computational Prescriptivism: Algorithmic Reinforcement of Linguistic Bias in Legal NLP

Master's Thesis in Computational Linguistics
Goldsmiths, University of London | 2024/2025

About This Project

This repository contains the complete implementation, data, and analysis for my Master of Arts thesis in Computational Linguistics at Goldsmiths, University of London (2024/2025). The thesis investigates how large language models systematically erase linguistic features associated with marginalized speakers when summarizing legal testimony.

Author: Malorie Grace Iovino
Degree: MA Computational Linguistics
Institution: Department of Computing, Goldsmiths, University of London
Supervisors: Dr. Geri Popova, Dr. Tony Russell-Rose, Dr. Gregory Mills
Field Project Advisor: Dr. Dave Lewis

Abstract

This Master's thesis investigates computational prescriptivism as an algorithmic enforcement of linguistic standardization in NLP-based summarization systems, with particular focus on legal deposition transcripts. Through empirical analysis of real legal deposition excerpts and curated synthetic examples across six summarization models (BART, Pegasus, T5/Flan-T5, Lead-2, TextRank, and GPT-3.5), this study demonstrates that large language models routinely erase pragmatic features that are disproportionately associated with speakers of non-standard English dialects and women.

Key Findings:

🔴 70-85% of disfluency markers are systematically erased
🟡 30-45% of hedges and modal expressions are removed
🟢 Temporal and conditional markers show higher retention
⚠️ 30-40% of uncertain statements are transformed into categorical claims

Research Question

How do large language models handle linguistic markers of uncertainty, conditional language, temporal expressions, and disfluency when summarizing legal deposition transcripts, and what does this reveal about their reliability and interpretability in general NLP applications?

Thesis Structure

The complete thesis document (PDF) includes:

Introduction - Motivation and research questions
Literature Review - Sociolinguistic variation, computational processing, summarization, legal NLP
Methodology - Data collection, model selection, evaluation framework
Empirical Analysis - Systematic evaluation across 6 models
PDCI Framework - Novel metric for pragmatic distortion
Discussion & Implications - Theoretical and practical contributions
Conclusion - Synthesis and future directions

Quick Start

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (recommended for model inference)
At least 16GB RAM

Installation

# Clone the repository
git clone https://github.com/malorieiovino/Computational-Prescriptivism.git
cd Computational-Prescriptivism

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running the Analysis

# Run feature extraction on datasets
python code/preprocessing/feature_extraction.py --data_path data/

# Evaluate models
python scripts/run_experiments.py --models all --dataset curated

# Generate PDCI scores
python code/evaluation/pdci_analyzer.py --input results/ --output results/pdci_scores/

# Create visualizations
python scripts/generate_figures.py

Repository Structure

Computational-Prescriptivism/
│   └── thesis.pdf            # Complete MA thesis
├── data/                      # Datasets
│   ├── original_datasets/     # Raw deposition data
│   ├── *_curated_data.csv    # Curated synthetic examples
│   └── *_pilot_data.csv      # Pilot study data
├── notebooks/                 # Jupyter notebooks for analysis
├── figures/                   # Visualizations from thesis
│   ├── BART/                 # Model-specific figures
│   ├── Pegasus/
│   ├── GPT-3.5/
│   ├── T5-FlanT5/
│   └── Comparisons/          # Cross-model comparisons
├── code/                     # Source code implementation
└── thesis/                   # Master's Thesis Document

Methodology

Datasets

Real Deposition Dataset: 351 excerpts from 10 anonymized legal depositions
Curated Synthetic Dataset: 126 unique excerpts with gold-standard summaries

Models Evaluated

Model Type	Models	Key Characteristics
Abstractive	BART, Pegasus, T5, Flan-T5	Generate new text, prone to feature erasure
Extractive	Lead-2, TextRank	Select existing sentences, high feature retention
API-based	GPT-3.5	Tested with 4 prompting conditions

Linguistic Features Analyzed

Hedges & Modal Expressions (e.g., "I think", "maybe", "could")
Conditional Constructions (factual and counterfactual)
Temporal Expressions (e.g., "before", "after", "at the time")
Disfluency Markers (e.g., "um", "uh", repetitions, self-corrections)

Key Results

Feature Retention Rates by Model

Model	Overall Retention	Disfluency Retention	Certainty Inflation
BART	91.4%	93.3%	Low
Pegasus	58.7%	74.5%	High
T5	82.1%	84.4%	Moderate
Flan-T5	73.6%	84.3%	Moderate
Lead-2	93.9%	97.2%	None
TextRank	92.7%	96.7%	None
GPT-3.5 (Default)	66.2%	74.0%	High
GPT-3.5 (Feature-Preserving)	80.0%	84.5%	Low

PDCI Framework Scores (Lower is Better)

Model	PDCI Score	Interpretation
BART	0.055	Lowest distortion
T5	0.105	Moderate distortion
Flan-T5	0.105	Moderate distortion
Pegasus	0.225	Highest distortion
GPT-3.5 (Default)	0.193	High distortion
GPT-3.5 (Feature-Preserving)	0.146	Moderate distortion

PDCI Metric

This repository also contains the implementation of the Prescriptive Discourse Confidence Index (PDCI), a custom evaluation metric for summarization systems that captures pragmatic distortions (hedges, modals, conditionals, disfluencies, certainty shifts).

➡️ See pdci for usage instructions and examples.

Academic Context

This thesis was completed as part of the MA Computational Linguistics programme at Goldsmiths, University of London. The research was conducted between September 2024 and September 2025, with field project support from Nextpoint, a legal technology company.

Programme Information

Degree: Master of Arts in Computational Linguistics Department: Computing Institution: Goldsmiths, University of London Academic Year: 2024/2025

Ethical Considerations

This Master's thesis research was conducted with careful attention to ethical implications. All deposition transcripts were anonymized prior to use, with identifying information removed. The work aims to promote fairness and inclusivity in AI systems by making visible the linguistic biases embedded in current NLP technologies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Computational Prescriptivism: Algorithmic Reinforcement of Linguistic Bias in Legal NLP

About This Project

Abstract

Research Question

Thesis Structure

Quick Start

Prerequisites

Installation

Running the Analysis

Repository Structure

Methodology

Datasets

Models Evaluated

Linguistic Features Analyzed

Key Results

Feature Retention Rates by Model

PDCI Framework Scores (Lower is Better)

PDCI Metric

Academic Context

Programme Information

Ethical Considerations

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
figures		figures
notebooks		notebooks
pdci		pdci
thesis		thesis
README.md		README.md
replacements.txt		replacements.txt

malorieiovino/Computational-Prescriptivism

Folders and files

Latest commit

History

Repository files navigation

Computational Prescriptivism: Algorithmic Reinforcement of Linguistic Bias in Legal NLP

About This Project

Abstract

Research Question

Thesis Structure

Quick Start

Prerequisites

Installation

Running the Analysis

Repository Structure

Methodology

Datasets

Models Evaluated

Linguistic Features Analyzed

Key Results

Feature Retention Rates by Model

PDCI Framework Scores (Lower is Better)

PDCI Metric

Academic Context

Programme Information

Ethical Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages