Skip to content

Commit 43b588d

Browse files
committed
Added 2 Text Summarizer
1 parent 34eb768 commit 43b588d

File tree

12 files changed

+424
-0
lines changed

12 files changed

+424
-0
lines changed
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Lex_Rank
2+
Lex Rank approach for text summarization.
3+
4+
5+
## Dependencies
6+
7+
- sumy
8+
- spacy
9+
- neologdn
10+
* _This requires requires C++11 compiler_. CLick [here](https://pypi.org/project/neologdn/)
11+
for documentation and [here](https://nuwen.net/mingw.html#install)
12+
for the C++11 compiler I use.
13+
14+
## NLTK models
15+
16+
- `en_core_web_sm`: A spaCy english multi-task CNN trained on OntoNotes.
17+
- `punkt`: NLP sentence tokenizer
18+
19+
## Setup
20+
21+
- Setup a `python 3.x` virtual environment.
22+
- `Activate` the environment
23+
- Install the dependencies using ```pip3 install -r requiremnts.txt```
24+
* Install C++ compiler if `neologdn` is triggering `wheel` errors.
25+
- Setup the models by running the following commands,
26+
27+
```bash
28+
$ python -m spacy download en_core_web_sm
29+
$ python -c "import nltk; nltk.download('punkt')"
30+
```
31+
- Run the `main.py` file
32+
- Enter the source path.
33+
34+
## Results
35+
36+
Results can be found [here](../assets).

Python/Text_Summary/Lex_Rank/main.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/usr/bin/env python
2+
# coding: utf-8
3+
4+
# Import from summary_make.py
5+
from summary_make import summarize_sentences
6+
7+
8+
def main():
9+
"""
10+
Main function, wrapper around summary_make
11+
"""
12+
filepath = input("Enter the Source File: ")
13+
with open(filepath, encoding='utf-8') as f:
14+
sentences = f.readlines()
15+
sentences = ' '.join(sentences)
16+
17+
summary = summarize_sentences(sentences)
18+
19+
filepath_index = filepath.find('.txt')
20+
outputpath = filepath[:filepath_index] + '_lexRank.txt'
21+
22+
with open(outputpath, 'w') as w:
23+
for sentence in summary:
24+
w.write(str(sentence) + '\n')
25+
26+
27+
if __name__ == "__main__":
28+
main()
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/usr/bin/env python
2+
# coding: utf-8
3+
4+
5+
# Import
6+
import spacy
7+
import neologdn
8+
9+
10+
class EnglishCorpus:
11+
"""
12+
A Class for for retaining the structure of text file as a corpus.
13+
14+
...
15+
16+
Methods:
17+
preprocessing(text:str)
18+
Remove Special Characters and whitespaces
19+
20+
make_sentence_list(sentences:str)
21+
Break sentence into a list of sentence suing NLP
22+
23+
make_corpus()
24+
Generates the corpus in Morphological order
25+
"""
26+
27+
# Preparation of morphological analyzer
28+
def __init__(self):
29+
"""
30+
Constructor to initialize spaCy English model (See README)
31+
"""
32+
self.nlp = spacy.load("en_core_web_sm")
33+
34+
# Pre-processing of line breaks and special characters
35+
def preprocessing(self, text: str) -> str:
36+
"""
37+
Removes white spaces and special characters.
38+
Generates a set of sentences.
39+
:param text: String of text to ge processed
40+
:return: Sentence without white space and special characters
41+
"""
42+
text = text.replace("\n", "")
43+
text = neologdn.normalize(text)
44+
45+
return text
46+
47+
# Divide sentences into sentences while retaining the results of morphological analysis
48+
def make_sentence_list(self, sentences: str) -> list:
49+
"""
50+
Retains Morphological analysis and divides sentences in list of sentences.
51+
Using Natural Language Processing
52+
:param sentences: Sentences with morphological meaning
53+
:return: List of sentence
54+
"""
55+
doc = self.nlp(sentences)
56+
self.ginza_sents_object = doc.sents
57+
sentence_list = [s for s in doc.sents]
58+
59+
return sentence_list
60+
61+
# Put a space between words
62+
def make_corpus(self) -> list:
63+
"""
64+
Puts the white spaces between words
65+
Generates Corpus
66+
:return: Corpus for Tokenizing
67+
"""
68+
corpus = []
69+
for s in self.ginza_sents_object:
70+
tokens = [str(t) for t in s]
71+
corpus.append(" ".join(tokens))
72+
73+
return corpus
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
sumy==0.8.1
2+
spacy==2.3.2
3+
neologdn==0.4
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
from preprocessing import EnglishCorpus
2+
3+
from sumy.parsers.plaintext import PlaintextParser
4+
from sumy.nlp.tokenizers import Tokenizer
5+
from sumy.utils import get_stop_words
6+
from sumy.summarizers.lex_rank import LexRankSummarizer
7+
8+
9+
def summarize_sentences(sentences: str, language="english") -> list:
10+
"""
11+
Prepares the summary of sentences.
12+
Calls preprocessing for generating a list of processed sentences.
13+
Uses LexRank Summarization for preparing summary.
14+
:param sentences: Sentences form the text file
15+
:param language: Language used, default=English
16+
:return: Summary of the source file
17+
"""
18+
# Preparation sentences
19+
corpus_maker = EnglishCorpus()
20+
preprocessed_sentences = corpus_maker.preprocessing(sentences)
21+
preprocessed_sentence_list = corpus_maker.make_sentence_list(preprocessed_sentences)
22+
corpus = corpus_maker.make_corpus()
23+
parser = PlaintextParser.from_string(" ".join(corpus), Tokenizer(language))
24+
25+
# Using Rank system for tokenizing the Headwords
26+
summarizer = LexRankSummarizer()
27+
28+
# Generating stopwords, i.e. words which are not affecting the context of the text.
29+
summarizer.stop_words = get_stop_words(language)
30+
31+
# Limiting the summary to one-fifth of the article (See README)
32+
summary = summarizer(document=parser.document, sentences_count=len(corpus) * 2 // 10)
33+
34+
return summary

Python/Text_Summary/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Test_Summary
2+
3+
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
4+
5+
Text Summarization is an advanced project and comes under the umbreall of Natural Language Processing.
6+
There are multiple methods people use in order to summarize text.
7+
8+
they can be affectively clubbed under 2 methods:
9+
10+
- Abstractive: Understand the true context of text before summarization (like a human).
11+
- Extractive: Rank the text within the file and identify the impactful terms.
12+
13+
While both these approaches are under reasearch, extrcative summarization is presently sed acroos multiple latfomrs.
14+
There are multiple methods by which text is summarized under extractive approach as well.
15+
16+
In this script we will use the 2 important approach __Lex__ & __Text__, and will discuss their pros and cons.
17+
click
18+
[here](https://en.wikipedia.org/wiki/Automatic_summarization#:~:text=The%20edges%20between%20sentences%20are,by%20the%20sentences'%20lengths)
19+
for more info.
20+
21+
Both the script uses datasets from Natural Language Processing.
22+
23+
## Structure
24+
25+
- [Lex Rank](Lex%20Rank) contains the necessary files for Lex Ranking approach.
26+
- [Text Rank](Text%20Rank) contains the necessary files for Text Ranking approach.
27+
- [Assets](assets) contains the the text files.
28+
29+
## Instructions
30+
31+
Detailed set of instructions can be found in respective directories.
32+
33+
## Author(s)
34+
35+
Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)
36+
37+
38+
## Setup instructions
39+
40+
- Setup a `python 3.x` virtual environment.
41+
- `Activate` the environment
42+
- Install the dependencies using ```pip3 install -r requiremnts.txt```
43+
- You are all set and the [script](../../AutoStyler/Scripts/text_extract.py) is Ready to run.
44+
- Carefully follow the Instructions.
45+
46+
## Further Readings
47+
48+
Some newcomers for the first time struggle with Tesseract, this is a direct link to the [installer](https://github.com/UB-Mannheim/tesseract/wiki)
49+
50+
Setting up OCR can be found [here](https://stackoverflow.com/questions/50951955/pytesseract-tesseractnotfound-error-tesseract-is-not-installed-or-its-not-i)
51+
52+
## Usage
53+
54+
Just make sure that Tessaract is in proper directory, run the code according the comments and guidelines.
55+
56+
```
57+
Smaple -
58+
Enter the Folder name containing Images: <Name of Folder>
59+
Enter your desired output location: <Name of Folder>
60+
```
61+
62+
## Output
63+
64+
Output
65+
66+
![Output](assets/Output.PNG)
67+
68+
Image containing Text
69+
70+
![Before Compression](assets/Sample.PNG)
71+
72+
After Extraction
73+
74+
![After Backup](assets/TextFile.PNG)
75+
76+
77+
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Text_Rank
2+
Text Rank approach for text summarization.
3+
4+
## Dependencies
5+
6+
- nltk
7+
- numpy
8+
- networkx
9+
10+
## NLTK models
11+
12+
- `stopwords`: Stopwords are the English words which does not add much meaning to a sentence.
13+
14+
## Setup
15+
16+
- Setup a `python 3.x` virtual environment.
17+
- `Activate` the environment
18+
- Install the dependencies using ```pip3 install -r requiremnts.txt```
19+
- Setup the models by running the following commands,
20+
21+
```bash
22+
$ python -m nltk.downloader stopwords
23+
```
24+
- Run the `text_summary.py` file
25+
- Enter the source path.
26+
27+
## Results
28+
29+
The code generates the tokens (same as weights) of set of words, it shows the relative importance of words according to
30+
the summarizer, just uncomment the _l112_
31+
32+
Results can be found [here](../assets).
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
nltk==3.2.4
2+
numpy==1.19.5
3+
networkx==2.5

0 commit comments

Comments
 (0)