Master's Thesis in Computational Linguistics
Goldsmiths, University of London | 2024/2025
This repository contains the complete implementation, data, and analysis for my Master of Arts thesis in Computational Linguistics at Goldsmiths, University of London (2024/2025). The thesis investigates how large language models systematically erase linguistic features associated with marginalized speakers when summarizing legal testimony.
Author: Malorie Grace Iovino
Degree: MA Computational Linguistics
Institution: Department of Computing, Goldsmiths, University of London
Supervisors: Dr. Geri Popova, Dr. Tony Russell-Rose, Dr. Gregory Mills
Field Project Advisor: Dr. Dave Lewis
This Master's thesis investigates computational prescriptivism as an algorithmic enforcement of linguistic standardization in NLP-based summarization systems, with particular focus on legal deposition transcripts. Through empirical analysis of real legal deposition excerpts and curated synthetic examples across six summarization models (BART, Pegasus, T5/Flan-T5, Lead-2, TextRank, and GPT-3.5), this study demonstrates that large language models routinely erase pragmatic features that are disproportionately associated with speakers of non-standard English dialects and women.
Key Findings:
- 🔴 70-85% of disfluency markers are systematically erased
- 🟡 30-45% of hedges and modal expressions are removed
- 🟢 Temporal and conditional markers show higher retention
⚠️ 30-40% of uncertain statements are transformed into categorical claims
How do large language models handle linguistic markers of uncertainty, conditional language, temporal expressions, and disfluency when summarizing legal deposition transcripts, and what does this reveal about their reliability and interpretability in general NLP applications?
The complete thesis document (PDF) includes:
- Introduction - Motivation and research questions
- Literature Review - Sociolinguistic variation, computational processing, summarization, legal NLP
- Methodology - Data collection, model selection, evaluation framework
- Empirical Analysis - Systematic evaluation across 6 models
- PDCI Framework - Novel metric for pragmatic distortion
- Discussion & Implications - Theoretical and practical contributions
- Conclusion - Synthesis and future directions
- Python 3.8 or higher
- CUDA-capable GPU (recommended for model inference)
- At least 16GB RAM
# Clone the repository
git clone https://github.com/malorieiovino/Computational-Prescriptivism.git
cd Computational-Prescriptivism
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run feature extraction on datasets
python code/preprocessing/feature_extraction.py --data_path data/
# Evaluate models
python scripts/run_experiments.py --models all --dataset curated
# Generate PDCI scores
python code/evaluation/pdci_analyzer.py --input results/ --output results/pdci_scores/
# Create visualizations
python scripts/generate_figures.py
Computational-Prescriptivism/
│ └── thesis.pdf # Complete MA thesis
├── data/ # Datasets
│ ├── original_datasets/ # Raw deposition data
│ ├── *_curated_data.csv # Curated synthetic examples
│ └── *_pilot_data.csv # Pilot study data
├── notebooks/ # Jupyter notebooks for analysis
├── figures/ # Visualizations from thesis
│ ├── BART/ # Model-specific figures
│ ├── Pegasus/
│ ├── GPT-3.5/
│ ├── T5-FlanT5/
│ └── Comparisons/ # Cross-model comparisons
├── code/ # Source code implementation
└── thesis/ # Master's Thesis Document
- Real Deposition Dataset: 351 excerpts from 10 anonymized legal depositions
- Curated Synthetic Dataset: 126 unique excerpts with gold-standard summaries
Model Type | Models | Key Characteristics |
---|---|---|
Abstractive | BART, Pegasus, T5, Flan-T5 | Generate new text, prone to feature erasure |
Extractive | Lead-2, TextRank | Select existing sentences, high feature retention |
API-based | GPT-3.5 | Tested with 4 prompting conditions |
- Hedges & Modal Expressions (e.g., "I think", "maybe", "could")
- Conditional Constructions (factual and counterfactual)
- Temporal Expressions (e.g., "before", "after", "at the time")
- Disfluency Markers (e.g., "um", "uh", repetitions, self-corrections)
Model | Overall Retention | Disfluency Retention | Certainty Inflation |
---|---|---|---|
BART | 91.4% | 93.3% | Low |
Pegasus | 58.7% | 74.5% | High |
T5 | 82.1% | 84.4% | Moderate |
Flan-T5 | 73.6% | 84.3% | Moderate |
Lead-2 | 93.9% | 97.2% | None |
TextRank | 92.7% | 96.7% | None |
GPT-3.5 (Default) | 66.2% | 74.0% | High |
GPT-3.5 (Feature-Preserving) | 80.0% | 84.5% | Low |
Model | PDCI Score | Interpretation |
---|---|---|
BART | 0.055 | Lowest distortion |
T5 | 0.105 | Moderate distortion |
Flan-T5 | 0.105 | Moderate distortion |
Pegasus | 0.225 | Highest distortion |
GPT-3.5 (Default) | 0.193 | High distortion |
GPT-3.5 (Feature-Preserving) | 0.146 | Moderate distortion |
This repository also contains the implementation of the Prescriptive Discourse Confidence Index (PDCI), a custom evaluation metric for summarization systems that captures pragmatic distortions (hedges, modals, conditionals, disfluencies, certainty shifts).
➡️ See pdci for usage instructions and examples.
This thesis was completed as part of the MA Computational Linguistics programme at Goldsmiths, University of London. The research was conducted between September 2024 and September 2025, with field project support from Nextpoint, a legal technology company.
Degree: Master of Arts in Computational Linguistics Department: Computing Institution: Goldsmiths, University of London Academic Year: 2024/2025
This Master's thesis research was conducted with careful attention to ethical implications. All deposition transcripts were anonymized prior to use, with identifying information removed. The work aims to promote fairness and inclusivity in AI systems by making visible the linguistic biases embedded in current NLP technologies.