Using Python to summarize articles

Earlier this year I got an article published in Acta Borealia.

The paper, The Sami cooperative herding group: the siida system from past to present, is open access.

I usually publish a short summary on this blog, but recently I’ve been learning to analyze text using Python so I thought I should try to leverage Python to help me summarize my own paper.

The result? Have a look (I’ve only removed citations and reorganized the sentences for flow):

Background

The Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.

History

Historically, it has been characterized as a relatively small group based on kinship.

The siida could refer to both the territory, its resources and the people that use it.

The core institutions are the baiki (household) and the siida (band).

Names of siidas were, in other words, local.

Moreover, it was informally led by a wealthy and skillful person whose authority was primarily related to herding.

One of these groups’ critical aspects is that they are dynamic: composition and size change according to the season, and members are free to join and leave groups as they see fit.

Results

Only two herders reported to have changed summer and winter siida since 2000.

Furthermore, while the siida continues to be family-based, leadership is becoming more formal.

Nevertheless, decision-making continues to be influenced by concerns of equality.

Code

The code is shown below. Lacks a bit in comments, but should work for documents. I’ve load the text used from a docx file.

Imports

import numpy as np
import os
import sys
import nltk
from nltk.corpus import stopwords
import re
import textacy # have installed spacy==3.0 and textacy==0.11
import textacy.preprocessing as tprep
import docx
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import os
import warnings
warnings.filterwarnings("ignore")

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

Loading file

doc = docx.Document('your_filename.docx')

Functions for processing and summarizing text

def extract_text_doc(doc):
    paras = [p.text for p in doc.paragraphs if p.text]
    revised_paras = [p for p in paras if len(p.split('.')) >1]
    text = " ".join(revised_paras)
    return text

def normalize_document(paper):
    '''
    Tokenize ++
    '''
    paper = paper.lower()
    paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
    paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
    paper_tokens = [token for token in paper_tokens if len(token) >2]
    paper_tokens = [token for token in paper_tokens if token not in stop_words]

    doc = ' '.join(paper_tokens)
    return doc

def normalize(text):
    '''
    Normalizes text, string as input, returns normalized string
    '''
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    return text

def text_rank_summarizer(norm_sentences, original_sentences, num_sent):
    if len(sentences) < num_sent:
        num_sent -=1
    else:
        num_sent=num_sent

    tv_p = TfidfVectorizer(min_df=1, max_df=1, ngram_range=(1,1), use_idf=True)
    dt_matrix = tv_p.fit_transform(norm_sentences)
    dt_matrix = dt_matrix.toarray()

    vocab = tv_p.get_feature_names()
    td_matrix = dt_matrix.T

    similarity_matrix = np.matmul(dt_matrix, dt_matrix.T)


    similarity_graph = nx.from_numpy_array(similarity_matrix)

    scores = nx.pagerank(similarity_graph)

    ranked_sentences = sorted(((score,index) for index, score in scores.items()))

    top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sent)]
    top_sentence_indices.sort()
    summary = "\n".join(np.array(original_sentences)[top_sentence_indices])
    return summary

Processing text

normalize_corpus = np.vectorize(normalize_document)
text = extract_text_doc(doc)
text1 = normalize(text)
sentences = nltk.sent_tokenize(text1)
norm_sentences = normalize_corpus(sentences)
summary = text_rank_summarizer(norm_sentences, sentences, 10)
print(summary)

Historically, it has been characterized as a relatively small group based on kinship.
Moreover, it was informally led by a wealthy and skilful person whose authority was primarily related to herding.
Only two herders reported to have changed summer and winter siida since 2000.
Furthermore, while the siida continues to be family-based, leadership is becoming more formal.
Nevertheless, decision-making continues to be influenced by concerns of equality.
One of these groups' critical aspects is that they are dynamic: composition and size changes according to the season, and members are free to join and leave groups as they see fit.
Lowie (1945) writes that the Sami – both pastoralists and hunters – in Norway had a larger unit than the family, i.e., the siida.
Names of siidas were, in other words, local.
The siida could refer to both the territory, its resources and the people that use it (see also Riseth 2000, 120).
The core institutions are the baiki (household) and the siida (band).

For more options and better ways of summarizing text, check out https://pypi.org/project/sumy/ with its many summarizer classes. For example, TextRankSummarizer is similar (but way better) than the approach taken here.

But similarly, it conceptualizes the relationship between sentences as a graph: each sentence is considered as vertex and each vertex is linked to the other vertex. But, rather than using PageRank from networkx for similarity, it uses Jaccard Similarity.

Advertisement

The pursuit of populations collapses: long-term dynamics of semi-domestic reindeer in Sweden

In Scandinavia there is a growing concern that the reindeer husbandry is in a state of crisis, but results from our recent study indicates that the Swedish reindeer husbandry is in fact in better condition now compared to the past (1945-1965).

By Bård-Jørgen Bårdsen & Marius Warg Næss

In Scandinavia there is a growing concern that the reindeer husbandry is in a state of crisis, but results from our recent study indicates that the Swedish reindeer husbandry is in better condition now compared to the past (1945-1965). Continue reading “The pursuit of populations collapses: long-term dynamics of semi-domestic reindeer in Sweden”

New research paper about cooperation in groups of Saami reindeer herders

The Tangled Woof of Fact

People rely on one another in fundamental ways, but cooperation in groups can be fragile. Every day, we face tensions between acting in a socially responsible manner and following our own self-interest. These situations are called social dilemmas and they come in varying shades of subtlety, from littering and eBay to overpopulation and climate change. Overcoming these dilemmas can make all the difference, especially for marginalised groups such as pastoralists – people who make their living from herding animals.

Pastoralists use about a quarter of the world’s land for grazing their herds. Nowadays, all over the world, governments are privatising many of their pastures, and so herders must work together in increasingly fragmented places.

We wanted to learn how groups of Saami reindeer herders living in Norway’s Arctic Circle worked together. Our study, just published in the journal Human Ecology, found that cooperation pivoted around the ‘siida’: a…

View original post 420 more words

Market economy vs. risk management: how do nomadic pastoralists respond to increasing meat prices?

Just got paper published in Human Ecology that looks at the old question of what exactly motivates nomadic pastoralists. Continue reading “Market economy vs. risk management: how do nomadic pastoralists respond to increasing meat prices?”

Anthropology, science and the challenge of subjectivity

My (somewhat limited) experience teaching anthropology (particularly ecological anthropology) has left me somewhat flabbergasted as to what is taught at universities about science.

Continue reading “Anthropology, science and the challenge of subjectivity”

Workshop in Tromsø February 18

In connection with the project “The Erosion of Cooperative Networks and the Evolution of Social Hierarchies: A Comparative Approach” and NIKU‘s 20th anniversary,  a workshop will be arranged on Wednesday 18th of February in Tromsø, Norway.

Time: Wednesday February 18 12:30-16:00 Continue reading “Workshop in Tromsø February 18”

Risk-sensitive reproductive allocation: fitness consequences of body mass losses in two contrasting environments

Just got a paper published in Ecology and Evolution. It is basically about reindeer life history and risk sensitivity. Continue reading “Risk-sensitive reproductive allocation: fitness consequences of body mass losses in two contrasting environments”

Why Herd Size Matters – Mitigating the Effects of Livestock Crashes

img058

Just got a paper published in PLOS ONE. Basically, it provides the rationale for why it pays off for pastoralists to keep large herds of livestock. Continue reading “Why Herd Size Matters – Mitigating the Effects of Livestock Crashes”

Climate Change, Risk Management and the End of Nomadic Pastoralism

Tibet

While not a particularly good quality map, it at least show the area my latest publication pertains to (Aru Basin). It is published in the journal International Journal of Sustainable Development & World Ecology.

The topic of the paper is mobility, a classic pastoral stagey for dealing with environmental variation. Mobility is used to manage resource variability, for example, during droughts where pastoralist have moved from affected areas to unaffected (or less affected) areas. Continue reading “Climate Change, Risk Management and the End of Nomadic Pastoralism”