Using Python to summarize articles

Earlier this year I got an article published in Acta Borealia.

The paper, The Sami cooperative herding group: the siida system from past to present, is open access.

I usually publish a short summary on this blog, but recently I’ve been learning to analyze text using Python so I thought I should try to leverage Python to help me summarize my own paper.

The result? Have a look (I’ve only removed citations and reorganized the sentences for flow):

Background

The Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.

History

Historically, it has been characterized as a relatively small group based on kinship.

The siida could refer to both the territory, its resources and the people that use it.

The core institutions are the baiki (household) and the siida (band).

Names of siidas were, in other words, local.

Moreover, it was informally led by a wealthy and skillful person whose authority was primarily related to herding.

One of these groups’ critical aspects is that they are dynamic: composition and size change according to the season, and members are free to join and leave groups as they see fit.

Results

Only two herders reported to have changed summer and winter siida since 2000.

Furthermore, while the siida continues to be family-based, leadership is becoming more formal.

Nevertheless, decision-making continues to be influenced by concerns of equality.

Code

The code is shown below. Lacks a bit in comments, but should work for documents. I’ve load the text used from a docx file.

Imports

import numpy as np
import os
import sys
import nltk
from nltk.corpus import stopwords
import re
import textacy # have installed spacy==3.0 and textacy==0.11
import textacy.preprocessing as tprep
import docx
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import os
import warnings
warnings.filterwarnings("ignore")

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

Loading file

doc = docx.Document('your_filename.docx')

Functions for processing and summarizing text

def extract_text_doc(doc):
    paras = [p.text for p in doc.paragraphs if p.text]
    revised_paras = [p for p in paras if len(p.split('.')) >1]
    text = " ".join(revised_paras)
    return text

def normalize_document(paper):
    '''
    Tokenize ++
    '''
    paper = paper.lower()
    paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
    paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
    paper_tokens = [token for token in paper_tokens if len(token) >2]
    paper_tokens = [token for token in paper_tokens if token not in stop_words]

    doc = ' '.join(paper_tokens)
    return doc

def normalize(text):
    '''
    Normalizes text, string as input, returns normalized string
    '''
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    return text

def text_rank_summarizer(norm_sentences, original_sentences, num_sent):
    if len(sentences) < num_sent:
        num_sent -=1
    else:
        num_sent=num_sent

    tv_p = TfidfVectorizer(min_df=1, max_df=1, ngram_range=(1,1), use_idf=True)
    dt_matrix = tv_p.fit_transform(norm_sentences)
    dt_matrix = dt_matrix.toarray()

    vocab = tv_p.get_feature_names()
    td_matrix = dt_matrix.T

    similarity_matrix = np.matmul(dt_matrix, dt_matrix.T)


    similarity_graph = nx.from_numpy_array(similarity_matrix)

    scores = nx.pagerank(similarity_graph)

    ranked_sentences = sorted(((score,index) for index, score in scores.items()))

    top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sent)]
    top_sentence_indices.sort()
    summary = "\n".join(np.array(original_sentences)[top_sentence_indices])
    return summary

Processing text

normalize_corpus = np.vectorize(normalize_document)
text = extract_text_doc(doc)
text1 = normalize(text)
sentences = nltk.sent_tokenize(text1)
norm_sentences = normalize_corpus(sentences)
summary = text_rank_summarizer(norm_sentences, sentences, 10)
print(summary)

Historically, it has been characterized as a relatively small group based on kinship.
Moreover, it was informally led by a wealthy and skilful person whose authority was primarily related to herding.
Only two herders reported to have changed summer and winter siida since 2000.
Furthermore, while the siida continues to be family-based, leadership is becoming more formal.
Nevertheless, decision-making continues to be influenced by concerns of equality.
One of these groups' critical aspects is that they are dynamic: composition and size changes according to the season, and members are free to join and leave groups as they see fit.
Lowie (1945) writes that the Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.
Names of siidas were, in other words, local.
The siida could refer to both the territory, its resources and the people that use it (see also Riseth 2000, 120).
The core institutions are the baiki (household) and the siida (band).

For more options and better ways of summarizing text, check out https://pypi.org/project/sumy/ with its many summarizer classes. For example, TextRankSummarizer is similar (but way better) than the approach taken here.

But similarly, it conceptualizes the relationship between sentences as a graph: each sentence is considered as vertex and each vertex is linked to the other vertex. But, rather than using PageRank from networkx for similarity, it uses Jaccard Similarity.

Advertisement

Density or climate: Is that the question?

A recent paper argues that climate is more important than density in the reindeer husbandry in Norway. Using the same analysis, I find that reindeer density is essential: In high-density environments, average varit (1.5-year-old bucks) carcass weight is 8 kg lower, and calf carcass weight is 4 kg lower compared to low-density environments.

A recent paper, ‘Productivity beyond density: A critique of management models for reindeer pastoralism in Norway’, published in Pastoralism: Research, Policy and Practice sets out to investigate the validity of the premise that there is a strong relationship between density and carcass weights in the reindeer husbandry in Norway.

In short, the paper aims to challenge the official view of overstocking and reframe reindeer herding in terms of non-equilibrium ecology.

Their focus of attack is the Røros model which, according to the authors, hinges on

“… classic ecological equilibrium models where there is a clear unequivocal relationship between animal densities, production, and carcass weights”

p. 15

As such the article fits nicely in a growing trend: rather than investigating problems currently facing pastoralists, the main point is to establish systems as non-equilibrium, and thus all issues are assumed resolved, or at least externally caused (for an excellent example from the reindeer husbandry in Norway, check out ‘Conceptualising resilience in Norwegian Sámi reindeer pastoralism‘).

In another paper, some of the same authors have, for example, argued that reindeer herding in Norway is better characterised as a non-equilibrium system

“…where herbivore populations fluctuate randomly according to external influences, [and] the concepts of carrying capacity and overgrazing have no discernible meaning”.

Misreading the Arctic landscape: A political ecology of reindeer, carrying capacities, and overstocking in Finnmark, Norway’, p. 223

Productivity beyond density’ goes at least further in attempting to quantify the relative importance of non-equilibrium factors (such as climate) and equilibrium factors (such as density).

While the paper is well-written and exciting, I find it a bit strange that in the only quantitative analysis they present the sole focus is on statistically significant effects of precipitation and temperature for the carcass weights of reindeer:

Source: Table 1 in publication.

While the analysis shows indeed that climate factors (precipitation: all the daily observation in the stated period and growing degree days [GDD]) are significant, the discussion of the table completely fails to address two critical factors:

They never discuss whether the variables measuring climate is correlated or not (as they are monthly based, it wouldn’t be a huge surprise if they are).

High or even moderate, collinearity is problematic when effects are weak (as the climate effect sizes indicate). If collinearity is ignored, it is possible to end up with a statistical analysis where nothing is significant, but were dropping one predictor may make others significant, or even change the sign of estimated parameters.

The point is technical; it would be interesting to see how these potential problems were accounted for.

Concerning effect size, the most substantial effect by far is that of density: -0.16 kg for calves and -0.32 kg for varit (1.5-year-old bucks).

In effect, this has a considerable impact on the carcass weights in high-density vs low-density environments

Keep in mind that they do not provide information concerning variable transformation, so I take it for granted that the intercept represents average carcass weights when every other variable is at zero. I also take for granted that all variables are continuous. Moreover, not all of the data was in the supplemental material so I couldn’t re-analyse the data properly.

In short, at density 0 average calf carcass weight is 18.52 kg and average varit carcass weight is 25.34 kg.

The paper does not indicate the range of density utilised in this analysis, but Fig. 5 presents the range for mainland districts (which are the same districts used in table) to be from 0 to 25.

Disregarding the climate parameters (since there are no interactions and the range of the climate parameters are not presented) density has a significant effect in high-density environments:

Calves: 18.52 – 0.16 X 25 = 14.52 kg

Varit: 25.34 – 0.32 X 25 = 17.34 kg

In short, the model in Table 1 predicts that the difference in varit carcass weight between a low-density environment and a high-density environment is 8 kg. For calf carcass weight, the model predicts a difference of 4 kg.

While I fully agree with the authors that an over-emphasis on density and herd size is too simplistic when modelling pastoral production, it is bizarre that the above is not communicated at all.

Part of the problem, I think, stems from the simplified representation of non-equilibrium ecology. Concerning Africa and Asia, for example, they write “…a wholesale paradigm shift from equilibrium to non-equilibrium modelling took place from the early 1990s” (p. 9).

This is in fact, only partially true.

In the chapter ‘Why are there so many animals? Cattle population dynamics in the communal areas of Zimbabwe’, Ian Scoones, for example, investigated factors affecting herd growth among Zimbabwean pastoralists.

He mainly focused on periodic events such as droughts (a density-independent factor) and more persistent factors such as herd size (a density-dependent factor)

In other words, he investigated the degree to which density-dependent and density-independent factors explained herd size fluctuations (data for 60 years).

In short, he found:

  • In years with high precipitation, the population of cattle approaches a ceiling, which he terms the carrying capacity. As density increases, the birth rate drops, and mortality rates increases (although they never reach equilibrium and the cattle population never reaches its theoretical maximum).
  • The cattle population never reaches a maximum because stochastic events such as droughts occur and kill off large parts. Noteworthy, the number of animals killed by these events was more substantial than what can be predicted from density-dependent factors alone.

In the long term, it thus looks like non-equilibrium factors have the most significant impacts on cattle populations. Still, equilibrium factors are essential in years without stochastic climatic effects and when the population is high.

Scoones’ investigation show what seems now to be forgotten:

It is unlikely that any system is characterised by either equilibrium or non-equilibrium factors alone, but rather that they both operate on a continuum.

This supports the predominant ecological perspective that at high population sizes, herbivores are sensitive to a combination of density-dependent and -independent factors, which has been shown for reindeer in Norway.

As I argued in the paper ‘Climate Change, Risk Management and the End of Nomadic Pastoralism’.

“To understand the effects of climate change on nomadic pastoralists, it is thus necessary to move beyond the simplistic dichotomy of characterising pastoral system as equilibrial (density dependence: livestock and pastures are regulated by grazing pressure) or non-equilibrial (density independence: livestock and pastures are limited by external factors such as climate) and look at the interplay between density dependent and density independent factors”

p. 131

Cultural group selection and the evolution of reindeer herding in Norway

The debate about reindeer husbandry in Norway is characterised by two contrasting views.

On one hand is the prevailing view of overstocking and rangeland degradation.

On the other hand, is the view that overstocking and overuse represents a misreading of the Arctic landscape that perpetuates a dominant crisis narrative that functions as “… an enduring ‘social fact’, whose narrative reality is in large part decoupled from its supposed scientific basis” (Benjaminsen et al. 2015:228).

While the overstocking perspective is based on a presumed ‘Tragedy of the Commons’, the other perspective argue that reindeer herding is characterised as a non-equilibrium system

“…where herbivore populations fluctuate randomly according to external influences, [and] the concepts of carrying capacity and overgrazing have no discernible meaning” (ibid.:223).

In my new paper, Cultural Group Selection and the Evolution of Reindeer Herding in Norway, I argue differently.

Through a comparative historical analysis, I argue that herding is better viewed as an assurance game with two different strategies for minimising risk:

  1. maximising quantity (i.e., increasing livestock numbers or herd size)
  2. maximising livestock quality (i.e., increasing livestock body mass)

I demonstrate that intra-group competition has led to the
adoption of (1) in the Northern parts of Norway, while inter-group competition has led to the adoption of (2) in the Southern parts.

Read the full paper here.

Workshop in Tromsø February 18

In connection with the project “The Erosion of Cooperative Networks and the Evolution of Social Hierarchies: A Comparative Approach” and NIKU‘s 20th anniversary,  a workshop will be arranged on Wednesday 18th of February in Tromsø, Norway.

Time: Wednesday February 18 12:30-16:00 Continue reading “Workshop in Tromsø February 18”

HIERARCHIES: New research project from the Research Council of Norway

Last week I got the news that I got a 4 year research grant funded by the Research Council of Norway.

Continue reading “HIERARCHIES: New research project from the Research Council of Norway”

What’s killing the reindeer?

Predatory species compete with humans for the use of resources such as livestock and an important tool for managing possible conflicts is damage compensation schemes distributing the costs between those who benefit from conservation and those who suffer the costs of damage.

Continue reading “What’s killing the reindeer?”

Reindeer herders’ objectives may differ from official assumptions

DSC_0137

A number of explanations have been raised in the literature as to why pastoralists keep large herds of animals: From the “East African cattle complex”, where the prestigious aspect of having large herds was given weight, to nomadic pastoralists seeking reliable food intake and valuing long-term household survival.  Importantly, however, large herds have been argued and shown to buffer environmental risks, like in the reindeer husbandry where herders with comparable larger herds one year also had comparable larger herds the next. Continue reading “Reindeer herders’ objectives may differ from official assumptions”

My latest publication

A bit earlier this year I got a paper published in Evolution and Human Behavior. In general, the paper investigates how pastoral slaughter strategies are shaped in the reindeer husbandry in Norway.

From a governmental point of view, the reindeer husbandry is characterised by overstocking of reindeer (especially in the northern parts of the country). As a consequence, the Norwegian government has initiated a subsidy policy aiming to stimulate households to slaughter as many reindeer as possible so as to reduce the number of reindeer and thereby create a sustainable reindeer husbandry. Nevertheless, in spite of this subsidy policy, the number of reindeer has increased rather than decreased. This indicates that reindeer herders do not make slaughter related decisions from a purely economic point of view.

Continue reading “My latest publication”