Using Python to summarize articles

Earlier this year I got an article published in Acta Borealia.

The paper, The Sami cooperative herding group: the siida system from past to present, is open access.

I usually publish a short summary on this blog, but recently I’ve been learning to analyze text using Python so I thought I should try to leverage Python to help me summarize my own paper.

The result? Have a look (I’ve only removed citations and reorganized the sentences for flow):

Background

The Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.

History

Historically, it has been characterized as a relatively small group based on kinship.

The siida could refer to both the territory, its resources and the people that use it.

The core institutions are the baiki (household) and the siida (band).

Names of siidas were, in other words, local.

Moreover, it was informally led by a wealthy and skillful person whose authority was primarily related to herding.

One of these groups’ critical aspects is that they are dynamic: composition and size change according to the season, and members are free to join and leave groups as they see fit.

Results

Only two herders reported to have changed summer and winter siida since 2000.

Furthermore, while the siida continues to be family-based, leadership is becoming more formal.

Nevertheless, decision-making continues to be influenced by concerns of equality.

Code

The code is shown below. Lacks a bit in comments, but should work for documents. I’ve load the text used from a docx file.

Imports

import numpy as np
import os
import sys
import nltk
from nltk.corpus import stopwords
import re
import textacy # have installed spacy==3.0 and textacy==0.11
import textacy.preprocessing as tprep
import docx
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import os
import warnings
warnings.filterwarnings("ignore")

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

Loading file

doc = docx.Document('your_filename.docx')

Functions for processing and summarizing text

def extract_text_doc(doc):
    paras = [p.text for p in doc.paragraphs if p.text]
    revised_paras = [p for p in paras if len(p.split('.')) >1]
    text = " ".join(revised_paras)
    return text

def normalize_document(paper):
    '''
    Tokenize ++
    '''
    paper = paper.lower()
    paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
    paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
    paper_tokens = [token for token in paper_tokens if len(token) >2]
    paper_tokens = [token for token in paper_tokens if token not in stop_words]

    doc = ' '.join(paper_tokens)
    return doc

def normalize(text):
    '''
    Normalizes text, string as input, returns normalized string
    '''
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    return text

def text_rank_summarizer(norm_sentences, original_sentences, num_sent):
    if len(sentences) < num_sent:
        num_sent -=1
    else:
        num_sent=num_sent

    tv_p = TfidfVectorizer(min_df=1, max_df=1, ngram_range=(1,1), use_idf=True)
    dt_matrix = tv_p.fit_transform(norm_sentences)
    dt_matrix = dt_matrix.toarray()

    vocab = tv_p.get_feature_names()
    td_matrix = dt_matrix.T

    similarity_matrix = np.matmul(dt_matrix, dt_matrix.T)


    similarity_graph = nx.from_numpy_array(similarity_matrix)

    scores = nx.pagerank(similarity_graph)

    ranked_sentences = sorted(((score,index) for index, score in scores.items()))

    top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sent)]
    top_sentence_indices.sort()
    summary = "\n".join(np.array(original_sentences)[top_sentence_indices])
    return summary

Processing text

normalize_corpus = np.vectorize(normalize_document)
text = extract_text_doc(doc)
text1 = normalize(text)
sentences = nltk.sent_tokenize(text1)
norm_sentences = normalize_corpus(sentences)
summary = text_rank_summarizer(norm_sentences, sentences, 10)
print(summary)

Historically, it has been characterized as a relatively small group based on kinship.
Moreover, it was informally led by a wealthy and skilful person whose authority was primarily related to herding.
Only two herders reported to have changed summer and winter siida since 2000.
Furthermore, while the siida continues to be family-based, leadership is becoming more formal.
Nevertheless, decision-making continues to be influenced by concerns of equality.
One of these groups' critical aspects is that they are dynamic: composition and size changes according to the season, and members are free to join and leave groups as they see fit.
Lowie (1945) writes that the Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.
Names of siidas were, in other words, local.
The siida could refer to both the territory, its resources and the people that use it (see also Riseth 2000, 120).
The core institutions are the baiki (household) and the siida (band).

For more options and better ways of summarizing text, check out https://pypi.org/project/sumy/ with its many summarizer classes. For example, TextRankSummarizer is similar (but way better) than the approach taken here.

But similarly, it conceptualizes the relationship between sentences as a graph: each sentence is considered as vertex and each vertex is linked to the other vertex. But, rather than using PageRank from networkx for similarity, it uses Jaccard Similarity.

Advertisement

The pursuit of populations collapses: long-term dynamics of semi-domestic reindeer in Sweden

In Scandinavia there is a growing concern that the reindeer husbandry is in a state of crisis, but results from our recent study indicates that the Swedish reindeer husbandry is in fact in better condition now compared to the past (1945-1965).

By Bård-Jørgen Bårdsen & Marius Warg Næss

In Scandinavia there is a growing concern that the reindeer husbandry is in a state of crisis, but results from our recent study indicates that the Swedish reindeer husbandry is in better condition now compared to the past (1945-1965). Continue reading “The pursuit of populations collapses: long-term dynamics of semi-domestic reindeer in Sweden”

New research paper about cooperation in groups of Saami reindeer herders

The Tangled Woof of Fact

People rely on one another in fundamental ways, but cooperation in groups can be fragile. Every day, we face tensions between acting in a socially responsible manner and following our own self-interest. These situations are called social dilemmas and they come in varying shades of subtlety, from littering and eBay to overpopulation and climate change. Overcoming these dilemmas can make all the difference, especially for marginalised groups such as pastoralists – people who make their living from herding animals.

Pastoralists use about a quarter of the world’s land for grazing their herds. Nowadays, all over the world, governments are privatising many of their pastures, and so herders must work together in increasingly fragmented places.

We wanted to learn how groups of Saami reindeer herders living in Norway’s Arctic Circle worked together. Our study, just published in the journal Human Ecology, found that cooperation pivoted around the ‘siida’: a…

View original post 420 more words

Reindeer Husbandry in a Globalizing North – resilience, adaptations and pathways for Actions (ReiGN)

It’s the time of the year when we eagerly await the results from the year’s (many) research proposals.

Continue reading “Reindeer Husbandry in a Globalizing North – resilience, adaptations and pathways for Actions (ReiGN)”

Workshop in Tromsø February 18

In connection with the project “The Erosion of Cooperative Networks and the Evolution of Social Hierarchies: A Comparative Approach” and NIKU‘s 20th anniversary,  a workshop will be arranged on Wednesday 18th of February in Tromsø, Norway.

Time: Wednesday February 18 12:30-16:00 Continue reading “Workshop in Tromsø February 18”

HIERARCHIES: New research project from the Research Council of Norway

Last week I got the news that I got a 4 year research grant funded by the Research Council of Norway.

Continue reading “HIERARCHIES: New research project from the Research Council of Norway”

What’s killing the reindeer?

Predatory species compete with humans for the use of resources such as livestock and an important tool for managing possible conflicts is damage compensation schemes distributing the costs between those who benefit from conservation and those who suffer the costs of damage.

Continue reading “What’s killing the reindeer?”

Risk-sensitive reproductive allocation: fitness consequences of body mass losses in two contrasting environments

Just got a paper published in Ecology and Evolution. It is basically about reindeer life history and risk sensitivity. Continue reading “Risk-sensitive reproductive allocation: fitness consequences of body mass losses in two contrasting environments”

Why Herd Size Matters – Mitigating the Effects of Livestock Crashes

img058

Just got a paper published in PLOS ONE. Basically, it provides the rationale for why it pays off for pastoralists to keep large herds of livestock. Continue reading “Why Herd Size Matters – Mitigating the Effects of Livestock Crashes”

Reindeer herders’ objectives may differ from official assumptions

DSC_0137

A number of explanations have been raised in the literature as to why pastoralists keep large herds of animals: From the “East African cattle complex”, where the prestigious aspect of having large herds was given weight, to nomadic pastoralists seeking reliable food intake and valuing long-term household survival.  Importantly, however, large herds have been argued and shown to buffer environmental risks, like in the reindeer husbandry where herders with comparable larger herds one year also had comparable larger herds the next. Continue reading “Reindeer herders’ objectives may differ from official assumptions”