This repository contains the code used for the experiments in the paper: From Small to Large Language Models: Revisiting the Federalist Papers
.
├── data/ # Contains raw and processed Federalist Papers data
├── eda/ # Notebooks and scripts for exploratory data analysis
├── llm-embeddings/ # Scripts for generating and saving LLM embeddings
├── postprocess/ # Postprocessing and analysis of embeddings
├── utils/ # Shared utility functions
└── README.md # Project documentation
Located in the eda/
directory, this step includes:
- Parsing and cleaning the Federalist Papers text
- Visualizing word frequency, document lengths, and topic assignment
We generate embeddings using both small and large language model.
Tested models:
- BERT (
bert-base-uncased
) - RoBERTa (
roberta-base
) - BART (
facebook/bart-base
) - LLaMA (via
transformers
, if locally supported or via Hugging Face inference endpoints)
You can adapt the following template to load and generate embeddings for any Hugging Face model.
👉 Example: example-llama.ipynb
from transformers import BertModel, BertTokenizer
from transformers import pipeline
import torch
# Load pre-trained BERT model and tokenizer
model = BertModel.from_pretrained("bert-base-uncased)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "To the People of the State of New York"
# Load the fine-tuned BERT model for feature extraction
bert_extractor = pipeline(
task="feature-extraction",
model=model,
tokenizer=takenizer,
device=0 # Use GPU if available
)
bert_extractor(text.astype(str).tolist(), return_tensors=True)
We also generate embeddings using OpenAI's text-embedding-ada-002
and other available GPT models via API.
👉 Example: GPT-API-embedding.ipynb
Requires:
- OpenAI API key
- Adherence to rate limits and token constraints
Located in the postprocess/
directory, this stage includes:
-
Word2Vec Embedding Generation
- Code for generating Word2Vec-based embeddings from the Federalist Papers.
- 📥 Requires downloading the pretrained Google News Word2Vec model here.
-
Classification Models (BART & LASSO)
- Scripts to train and evaluate classifiers (e.g., BART, LASSO) on various embeddings:
- Continuous LLM embeddings (e.g., from BERT, GPT)
- Bag-of-Words (BoW) embeddings (e.g., from LDA, LSA or NMF)
- Scripts to train and evaluate classifiers (e.g., BART, LASSO) on various embeddings:
-
Benjamini-Hochberg (BH) procedure for selecting words
You’ll need:
- OpenAI API key (if using GPT embedding)
- Hugging Face Token (if using open-source LLMs)
- Access to GPU for large-scale embedding generation (optional but recommended)
This README was generated and refined with the help of ChatGPT.