Small to Large Language Models: Revisiting the Federalist Papers

This repository contains the code used for the experiments in the paper: From Small to Large Language Models: Revisiting the Federalist Papers

📁 Project Structure

.
├── data/                  # Contains raw and processed Federalist Papers data
├── eda/                   # Notebooks and scripts for exploratory data analysis
├── llm-embeddings/        # Scripts for generating and saving LLM embeddings
├── postprocess/           # Postprocessing and analysis of embeddings
├── utils/                 # Shared utility functions
└── README.md              # Project documentation

1️⃣ Exploratory Data Analysis (EDA)

Located in the eda/ directory, this step includes:

Parsing and cleaning the Federalist Papers text
Visualizing word frequency, document lengths, and topic assignment

2️⃣ Embedding Generation

We generate embeddings using both small and large language model.

🧠 Large Model Embeddings

Tested models:

BERT (bert-base-uncased)
RoBERTa (roberta-base)
BART (facebook/bart-base)
LLaMA (via transformers, if locally supported or via Hugging Face inference endpoints)

You can adapt the following template to load and generate embeddings for any Hugging Face model.

👉 Example: example-llama.ipynb

from transformers import  BertModel, BertTokenizer
from transformers import pipeline
import torch

# Load pre-trained BERT model and tokenizer
model = BertModel.from_pretrained("bert-base-uncased)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "To the People of the State of New York"

# Load the fine-tuned BERT model for feature extraction
bert_extractor = pipeline(
    task="feature-extraction",
    model=model,
    tokenizer=takenizer,
    device=0  # Use GPU if available
)
bert_extractor(text.astype(str).tolist(), return_tensors=True)

☁️ GPT API Embeddings

We also generate embeddings using OpenAI's text-embedding-ada-002 and other available GPT models via API.

👉 Example: GPT-API-embedding.ipynb

Requires:

OpenAI API key
Adherence to rate limits and token constraints

3️⃣ Postprocessing Embeddings

Located in the postprocess/ directory, this stage includes:

Word2Vec Embedding Generation
- Code for generating Word2Vec-based embeddings from the Federalist Papers.
- 📥 Requires downloading the pretrained Google News Word2Vec model here.
Classification Models (BART & LASSO)
- Scripts to train and evaluate classifiers (e.g., BART, LASSO) on various embeddings:
  - Continuous LLM embeddings (e.g., from BERT, GPT)
  - Bag-of-Words (BoW) embeddings (e.g., from LDA, LSA or NMF)
Benjamini-Hochberg (BH) procedure for selecting words

🧪 Requirements

You’ll need:

OpenAI API key (if using GPT embedding)
Hugging Face Token (if using open-source LLMs)
Access to GPU for large-scale embedding generation (optional but recommended)

🤖 Credits

This README was generated and refined with the help of ChatGPT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Small to Large Language Models: Revisiting the Federalist Papers

📁 Project Structure

1️⃣ Exploratory Data Analysis (EDA)

2️⃣ Embedding Generation

🧠 Large Model Embeddings

☁️ GPT API Embeddings

3️⃣ Postprocessing Embeddings

🧪 Requirements

🤖 Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
eda		eda
llm-embeddings		llm-embeddings
postprocess		postprocess
utils		utils
.gitattributes		.gitattributes
README.md		README.md

sowonjeong/slm-2-llm

Folders and files

Latest commit

History

Repository files navigation

Small to Large Language Models: Revisiting the Federalist Papers

📁 Project Structure

1️⃣ Exploratory Data Analysis (EDA)

2️⃣ Embedding Generation

🧠 Large Model Embeddings

☁️ GPT API Embeddings

3️⃣ Postprocessing Embeddings

🧪 Requirements

🤖 Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages