Skip to content

sowonjeong/slm-2-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Small to Large Language Models: Revisiting the Federalist Papers

This repository contains the code used for the experiments in the paper: From Small to Large Language Models: Revisiting the Federalist Papers

📁 Project Structure

.
├── data/                  # Contains raw and processed Federalist Papers data
├── eda/                   # Notebooks and scripts for exploratory data analysis
├── llm-embeddings/        # Scripts for generating and saving LLM embeddings
├── postprocess/           # Postprocessing and analysis of embeddings
├── utils/                 # Shared utility functions
└── README.md              # Project documentation

1️⃣ Exploratory Data Analysis (EDA)

Located in the eda/ directory, this step includes:

  • Parsing and cleaning the Federalist Papers text
  • Visualizing word frequency, document lengths, and topic assignment

2️⃣ Embedding Generation

We generate embeddings using both small and large language model.

🧠 Large Model Embeddings

Tested models:

  • BERT (bert-base-uncased)
  • RoBERTa (roberta-base)
  • BART (facebook/bart-base)
  • LLaMA (via transformers, if locally supported or via Hugging Face inference endpoints)

You can adapt the following template to load and generate embeddings for any Hugging Face model.

👉 Example: example-llama.ipynb

from transformers import  BertModel, BertTokenizer
from transformers import pipeline
import torch

# Load pre-trained BERT model and tokenizer
model = BertModel.from_pretrained("bert-base-uncased)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "To the People of the State of New York"

# Load the fine-tuned BERT model for feature extraction
bert_extractor = pipeline(
    task="feature-extraction",
    model=model,
    tokenizer=takenizer,
    device=0  # Use GPU if available
)
bert_extractor(text.astype(str).tolist(), return_tensors=True)

☁️ GPT API Embeddings

We also generate embeddings using OpenAI's text-embedding-ada-002 and other available GPT models via API.

👉 Example: GPT-API-embedding.ipynb

Requires:

  • OpenAI API key
  • Adherence to rate limits and token constraints

3️⃣ Postprocessing Embeddings

Located in the postprocess/ directory, this stage includes:

  1. Word2Vec Embedding Generation

    • Code for generating Word2Vec-based embeddings from the Federalist Papers.
    • 📥 Requires downloading the pretrained Google News Word2Vec model here.
  2. Classification Models (BART & LASSO)

    • Scripts to train and evaluate classifiers (e.g., BART, LASSO) on various embeddings:
      • Continuous LLM embeddings (e.g., from BERT, GPT)
      • Bag-of-Words (BoW) embeddings (e.g., from LDA, LSA or NMF)
  3. Benjamini-Hochberg (BH) procedure for selecting words

🧪 Requirements

You’ll need:

  • OpenAI API key (if using GPT embedding)
  • Hugging Face Token (if using open-source LLMs)
  • Access to GPU for large-scale embedding generation (optional but recommended)

🤖 Credits

This README was generated and refined with the help of ChatGPT.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages