A sophisticated text classification system that combines multi-category TF-IDF weighting with GloVe word embeddings for news article categorization. This system achieves 89.97% accuracy on a 4-category news classification task using advanced cross-category weighting and intelligent missing word handling.
This project implements an advanced text classification pipeline specifically designed for categorizing news articles into four main categories:
- World (Politics/International News) - Category 1
- Sports (Sports News) - Category 2
- Business (Business/Economic News) - Category 3
- Sci/Tech (Science/Technology News) - Category 4
-
Multi-Category TF-IDF Vectorization
- Separate TF-IDF vectorizers for each category (4 vectorizers total)
- Cross-category weight calculation for enhanced discrimination
- Maximum 5,000 features per category for balanced representation
-
GloVe Word Embeddings Integration
- 300-dimensional GloVe vectors (glove.6B.300d)
- Contextual word representation for semantic understanding
- Smart missing word handling with semantic similarity
-
Advanced Missing Word Management
- Semantic similarity-based word replacement using cosine similarity
- Context-aware vector generation for OOV (Out-of-Vocabulary) words
- KNN-like approach for finding optimal word replacements
- Caching system for efficient processing
-
Dimensionality Reduction
- PCA reduction from 1200D to 800D
- Preserves essential information while reducing computational complexity
- Explained variance ratio tracking
The system calculates discriminative weights by comparing a word's TF-IDF score in the target category against its scores in other categories:
# Statistical normalization with z-score based distinctiveness
z_score = (target_score - mean_other) / std_other
normalized_distinctiveness = max(0, min(2, z_score)) / 2.0
final_weight = target_score * (1.0 + normalized_distinctiveness)- Edit Distance Calculation: Levenshtein distance for word similarity
- Semantic Vector Creation: Weighted average of similar words using cosine similarity
- Contextual Analysis: Window-based context consideration (Β±2 words)
- Category-Aware Scoring: Category-specific weight assignment
- Best Accuracy: 89.97%
- Vector Dimensions: 1200D β 800D (PCA)
- Categories: 4 (World, Sports, Business, Sci/Tech)
- Cross-validation: 5-fold CV for robust evaluation
Model CV Score Test Acc Time(s) Dimension
LR_PCA 0.8924Β±0.003 0.8997 45.2s 800D
LR_NO_PCA 0.8891Β±0.004 0.8945 78.6s 1200D
SVM_PCA 0.8856Β±0.005 0.8923 156.7s 800D
pip install pandas numpy scikit-learn tqdm requestsimport pandas as pd
import numpy as np
import pickle
import re
import os
import time
import warnings
import requests
import zipfile
import io
from tqdm import tqdm
from sklearn.decomposition import PCA
from scipy.sparse import vstack
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import _stop_wordsarchive/
βββ train.csv # Training data (label, title, description)
βββ test.csv # Test data (label, title, description)
CSV Format:
- Column 1: Category label (1-4)
- Column 2: Article title
- Column 3: Article description
python ml_multi_category_tfidf_faster.pyThe system will automatically:
- Download GloVe embeddings (if not present)
- Analyze missing words in the dataset
- Train multi-category TF-IDF vectorizers
- Build word score cache for optimization
- Train and evaluate multiple models
- Save the best performing model
Key parameters in the configuration:
# Core Settings
GLOVE_DIM = 300 # GloVe vector dimension
TFIDF_MAX_FEATURES = 5000 # Features per category
FINAL_DIM = 800 # PCA output dimension
cv = 5 # Cross-validation folds
# Data Paths
train_path = "archive/train.csv"
test_path = "archive/test.csv"
models_dir = "models"
glove_dir = "glove"
# Optional Settings
SAMPLE_SIZE = None # For quick testing (set to int for sampling)
MAX_WORDS = 15000 # Maximum vocabulary sizetopic_classifier/
βββ ml_multi_category_tfidf_faster.py # Main classification script
βββ README.md # This file
βββ models/ # Saved models directory
β βββ fast_contextual_logistic_regression.pkl
β βββ fast_contextual_vectorizer.pkl
β βββ ...
βββ glove/ # GloVe embeddings
β βββ glove.6B.50d.txt
β βββ glove.6B.100d.txt
β βββ glove.6B.200d.txt
β βββ glove.6B.300d.txt
βββ archive/ # Dataset
βββ train.csv
βββ test.csv
βββ ...
Each category has its own TF-IDF vectorizer trained on category-specific text data:
- Separate feature spaces for better category discrimination
- Cross-category weight calculation for enhanced performance
- Stop word removal and text preprocessing
def calculate_cross_category_weight(token, target_category, category_scores):
"""
Calculate discriminative weight based on target category vs other categories
"""
target_score = category_scores[target_category]
other_scores = [score for cat, score in category_scores.items() if cat != target_category]
avg_other_score = np.mean(other_scores)
# Distinctiveness: high in target, low in others
distinctiveness = target_score / (avg_other_score + 0.01)
final_weight = target_score * min(max(distinctiveness, 0.1), 3.0)
return min(final_weight, 2.0)For words not found in GloVe embeddings:
def create_semantic_missing_word_vector(word, embeddings_index, context_words, target_category):
"""
Generate vector for missing word using semantic similarity and context
"""
# 1. Find semantically similar words using cosine similarity
similar_words = find_similar_words_semantic(word, embeddings_index, context_words)
# 2. Calculate context-aware weights
weights = calculate_contextual_weights(similar_words, context_words, target_category)
# 3. Generate weighted average vector
vectors = [embeddings_index[word] for word in similar_words]
return np.average(vectors, axis=0, weights=weights)The current system uses semantic similarity and edit distance for handling out-of-vocabulary words. Planned improvements include:
- Implementation: Analyze k-nearest neighbors in the embedding space for missing words
- Weight Assignment: Calculate category-specific scores based on neighboring words' category performance
- Dynamic Adjustment: Real-time weight updates based on local context density
# Future enhancement concept
def enhanced_missing_word_vector(word, context, category, k=10):
"""
Generate missing word vector using k-nearest neighbors
with category-specific scoring
"""
# Find k-nearest semantic neighbors
neighbors = find_k_nearest_semantic_words(word, embeddings_index, k=k)
# Calculate category scores for each neighbor
category_scores = []
for neighbor in neighbors:
if neighbor in word_score_cache:
score = word_score_cache[neighbor].get(category, 0.0)
category_scores.append(score)
else:
category_scores.append(0.1) # Default score
# Weight neighbors by category relevance and semantic similarity
semantic_weights = calculate_semantic_similarities(word, neighbors)
category_weights = normalize_weights(category_scores)
final_weights = combine_weights(semantic_weights, category_weights)
# Generate weighted vector
neighbor_vectors = [embeddings_index[neighbor] for neighbor in neighbors]
return weighted_average(neighbor_vectors, final_weights)- Transformer Integration: Incorporate BERT/RoBERTa for context-dependent representations
- Fine-tuning: Domain-specific fine-tuning on news data
- Dynamic Contextualization: Real-time context adaptation
- Dynamic Weights: Self-adjusting category weights based on classification feedback
- Category Evolution: Learning from misclassifications to improve category boundaries
- Real-time Optimization: Continuous improvement during inference
- Phase 1: K-nearest neighbor enhancement for missing words
- Phase 2: Integration with transformer-based embeddings
- Phase 3: Dynamic weight adaptation system
- Phase 4: Real-time learning capabilities
def enhanced_clean_text(text):
"""Advanced text preprocessing with multiple cleaning steps"""
# 1. Lowercase conversion
# 2. URL and email removal
# 3. Number removal
# 4. Punctuation cleaning
# 5. Short/long word filtering
# 6. Repeated character normalization
# 7. Whitespace normalizationThe system provides comprehensive analysis of out-of-vocabulary words:
- Missing word ratio: Typically 8-12% of vocabulary
- Categories: Numerical, technical terms, proper nouns, spelling errors
- Handling strategy: Semantic replacement with contextual weighting
Sample missing word analysis output:
π MISSING WORD ANALYSIS:
Total words: 45,231
Missing words (unique): 3,847
Missing words (total): 5,124
Missing word ratio: 11.33%
π MISSING WORD CATEGORIES:
Technical Terms: 892 words
Proper Nouns: 654 words
Spelling Errors: 432 words
Very Long (15+ chars): 287 words
Mixed (letter+number): 156 words
- Caching System: Global word score cache for repeated calculations
- Batch Processing: Memory-efficient processing for large datasets
- Vectorized Operations: NumPy optimizations for matrix operations
- Sparse Matrix Support: Efficient handling of TF-IDF sparse matrices
from ml_multi_category_tfidf_faster import MultiCategoryTFIDFGloVeVectorizer
# Load your data
train_texts = ["Sample news article text...", ...]
train_labels = [1, 2, 3, 4, ...] # Categories 1-4
# Initialize vectorizer
vectorizer = MultiCategoryTFIDFGloVeVectorizer(
embeddings_index=embeddings_index,
dim=300,
max_features_per_category=5000,
final_dim=800,
handle_missing_words=True
)
# Train and transform
X_train = vectorizer.fit_transform(train_texts, train_labels)
# Train classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=2500)
classifier.fit(X_train, train_labels)
# Predict new texts
X_test = vectorizer.transform(test_texts)
predictions = classifier.predict(X_test)# Custom configuration for specific use cases
vectorizer = MultiCategoryTFIDFGloVeVectorizer(
embeddings_index=embeddings_index,
dim=300, # GloVe dimension
max_features_per_category=7500, # More features for complex domains
final_dim=1000, # Higher dimensional output
handle_missing_words=True # Enable smart missing word handling
) precision recall f1-score support
World 0.91 0.89 0.90 1900
Sports 0.89 0.92 0.90 1900
Business 0.88 0.87 0.87 1900
Sci/Tech 0.91 0.92 0.91 1900
accuracy 0.90 7600
macro avg 0.90 0.90 0.90 7600
weighted avg 0.90 0.90 0.90 7600
- Mean Accuracy: 89.24% Β± 0.3%
- Consistency: Low standard deviation indicates stable performance
- Robustness: Performs well across different data splits
This project is part of an academic research initiative. Contributions are welcome for:
- Algorithm improvements
- Missing word handling enhancements
- Performance optimizations
- Documentation improvements
- GloVe: Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation.
- TF-IDF: Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval.
- Cross-Category Analysis: Custom methodology developed for multi-category text classification.
This project is developed for academic purposes as part of YAP470 coursework.
Note: This implementation represents a state-of-the-art approach to text classification, combining traditional TF-IDF methods with modern word embeddings and intelligent missing word handling for superior performance on news categorization tasks.