Using Python to summarize articles

Earlier this year I got an article published in Acta Borealia.

The paper, The Sami cooperative herding group: the siida system from past to present, is open access.

I usually publish a short summary on this blog, but recently I’ve been learning to analyze text using Python so I thought I should try to leverage Python to help me summarize my own paper.

The result? Have a look (I’ve only removed citations and reorganized the sentences for flow):

Background

The Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.

History

Historically, it has been characterized as a relatively small group based on kinship.

The siida could refer to both the territory, its resources and the people that use it.

The core institutions are the baiki (household) and the siida (band).

Names of siidas were, in other words, local.

Moreover, it was informally led by a wealthy and skillful person whose authority was primarily related to herding.

One of these groups’ critical aspects is that they are dynamic: composition and size change according to the season, and members are free to join and leave groups as they see fit.

Results

Only two herders reported to have changed summer and winter siida since 2000.

Furthermore, while the siida continues to be family-based, leadership is becoming more formal.

Nevertheless, decision-making continues to be influenced by concerns of equality.

Code

The code is shown below. Lacks a bit in comments, but should work for documents. I’ve load the text used from a docx file.

Imports

import numpy as np
import os
import sys
import nltk
from nltk.corpus import stopwords
import re
import textacy # have installed spacy==3.0 and textacy==0.11
import textacy.preprocessing as tprep
import docx
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import os
import warnings
warnings.filterwarnings("ignore")

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

Loading file

doc = docx.Document('your_filename.docx')

Functions for processing and summarizing text

def extract_text_doc(doc):
    paras = [p.text for p in doc.paragraphs if p.text]
    revised_paras = [p for p in paras if len(p.split('.')) >1]
    text = " ".join(revised_paras)
    return text

def normalize_document(paper):
    '''
    Tokenize ++
    '''
    paper = paper.lower()
    paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
    paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
    paper_tokens = [token for token in paper_tokens if len(token) >2]
    paper_tokens = [token for token in paper_tokens if token not in stop_words]

    doc = ' '.join(paper_tokens)
    return doc

def normalize(text):
    '''
    Normalizes text, string as input, returns normalized string
    '''
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    return text

def text_rank_summarizer(norm_sentences, original_sentences, num_sent):
    if len(sentences) < num_sent:
        num_sent -=1
    else:
        num_sent=num_sent

    tv_p = TfidfVectorizer(min_df=1, max_df=1, ngram_range=(1,1), use_idf=True)
    dt_matrix = tv_p.fit_transform(norm_sentences)
    dt_matrix = dt_matrix.toarray()

    vocab = tv_p.get_feature_names()
    td_matrix = dt_matrix.T

    similarity_matrix = np.matmul(dt_matrix, dt_matrix.T)


    similarity_graph = nx.from_numpy_array(similarity_matrix)

    scores = nx.pagerank(similarity_graph)

    ranked_sentences = sorted(((score,index) for index, score in scores.items()))

    top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sent)]
    top_sentence_indices.sort()
    summary = "\n".join(np.array(original_sentences)[top_sentence_indices])
    return summary

Processing text

normalize_corpus = np.vectorize(normalize_document)
text = extract_text_doc(doc)
text1 = normalize(text)
sentences = nltk.sent_tokenize(text1)
norm_sentences = normalize_corpus(sentences)
summary = text_rank_summarizer(norm_sentences, sentences, 10)
print(summary)

Historically, it has been characterized as a relatively small group based on kinship.
Moreover, it was informally led by a wealthy and skilful person whose authority was primarily related to herding.
Only two herders reported to have changed summer and winter siida since 2000.
Furthermore, while the siida continues to be family-based, leadership is becoming more formal.
Nevertheless, decision-making continues to be influenced by concerns of equality.
One of these groups' critical aspects is that they are dynamic: composition and size changes according to the season, and members are free to join and leave groups as they see fit.
Lowie (1945) writes that the Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.
Names of siidas were, in other words, local.
The siida could refer to both the territory, its resources and the people that use it (see also Riseth 2000, 120).
The core institutions are the baiki (household) and the siida (band).

For more options and better ways of summarizing text, check out https://pypi.org/project/sumy/ with its many summarizer classes. For example, TextRankSummarizer is similar (but way better) than the approach taken here.

But similarly, it conceptualizes the relationship between sentences as a graph: each sentence is considered as vertex and each vertex is linked to the other vertex. But, rather than using PageRank from networkx for similarity, it uses Jaccard Similarity.

Advertisement

Cultural group selection and the evolution of reindeer herding in Norway

The debate about reindeer husbandry in Norway is characterised by two contrasting views.

On one hand is the prevailing view of overstocking and rangeland degradation.

On the other hand, is the view that overstocking and overuse represents a misreading of the Arctic landscape that perpetuates a dominant crisis narrative that functions as “… an enduring ‘social fact’, whose narrative reality is in large part decoupled from its supposed scientific basis” (Benjaminsen et al. 2015:228).

While the overstocking perspective is based on a presumed ‘Tragedy of the Commons’, the other perspective argue that reindeer herding is characterised as a non-equilibrium system

“…where herbivore populations fluctuate randomly according to external influences, [and] the concepts of carrying capacity and overgrazing have no discernible meaning” (ibid.:223).

In my new paper, Cultural Group Selection and the Evolution of Reindeer Herding in Norway, I argue differently.

Through a comparative historical analysis, I argue that herding is better viewed as an assurance game with two different strategies for minimising risk:

  1. maximising quantity (i.e., increasing livestock numbers or herd size)
  2. maximising livestock quality (i.e., increasing livestock body mass)

I demonstrate that intra-group competition has led to the
adoption of (1) in the Northern parts of Norway, while inter-group competition has led to the adoption of (2) in the Southern parts.

Read the full paper here.

The pursuit of populations collapses: long-term dynamics of semi-domestic reindeer in Sweden

In Scandinavia there is a growing concern that the reindeer husbandry is in a state of crisis, but results from our recent study indicates that the Swedish reindeer husbandry is in fact in better condition now compared to the past (1945-1965).

By Bård-Jørgen Bårdsen & Marius Warg Næss

In Scandinavia there is a growing concern that the reindeer husbandry is in a state of crisis, but results from our recent study indicates that the Swedish reindeer husbandry is in better condition now compared to the past (1945-1965). Continue reading “The pursuit of populations collapses: long-term dynamics of semi-domestic reindeer in Sweden”

Market economy vs. risk management: how do nomadic pastoralists respond to increasing meat prices?

Just got paper published in Human Ecology that looks at the old question of what exactly motivates nomadic pastoralists. Continue reading “Market economy vs. risk management: how do nomadic pastoralists respond to increasing meat prices?”

Risk-sensitive reproductive allocation: fitness consequences of body mass losses in two contrasting environments

Just got a paper published in Ecology and Evolution. It is basically about reindeer life history and risk sensitivity. Continue reading “Risk-sensitive reproductive allocation: fitness consequences of body mass losses in two contrasting environments”

Why Herd Size Matters – Mitigating the Effects of Livestock Crashes

img058

Just got a paper published in PLOS ONE. Basically, it provides the rationale for why it pays off for pastoralists to keep large herds of livestock. Continue reading “Why Herd Size Matters – Mitigating the Effects of Livestock Crashes”

Climate Change, Risk Management and the End of Nomadic Pastoralism

Tibet

While not a particularly good quality map, it at least show the area my latest publication pertains to (Aru Basin). It is published in the journal International Journal of Sustainable Development & World Ecology.

The topic of the paper is mobility, a classic pastoral stagey for dealing with environmental variation. Mobility is used to manage resource variability, for example, during droughts where pastoralist have moved from affected areas to unaffected (or less affected) areas. Continue reading “Climate Change, Risk Management and the End of Nomadic Pastoralism”

My latest publication

A bit earlier this year I got a paper published in Evolution and Human Behavior. In general, the paper investigates how pastoral slaughter strategies are shaped in the reindeer husbandry in Norway.

From a governmental point of view, the reindeer husbandry is characterised by overstocking of reindeer (especially in the northern parts of the country). As a consequence, the Norwegian government has initiated a subsidy policy aiming to stimulate households to slaughter as many reindeer as possible so as to reduce the number of reindeer and thereby create a sustainable reindeer husbandry. Nevertheless, in spite of this subsidy policy, the number of reindeer has increased rather than decreased. This indicates that reindeer herders do not make slaughter related decisions from a purely economic point of view.

Continue reading “My latest publication”