Python Download Articles using Newspaper API

29 Dec 2017 3 mins read
python

Many a times developers need to download the articles on the internet for various purposes. While interning, I was working on a project that needed some news articles to be fetched and analyzed. News articles are a great source of information but the problem lies in its format. News articles are not in a standard format as authors for different news articles are not the same. This is where Newspaper API helps us.

Newspaper API is a great tool to quickly download articles from the internet and extract the plain text from them. The best thing about Newspaper API is its simplicity to use and faster download speeds. Moreover, it also provides some advanced options which we will see in a moment. Currently it supports 10+ languages and everything is in unicode!

Newspaper API is developed and maintained by Lucas Ou-Yang. You can view the source code for the API here.

Installation

Newspaper API is a Python based and it can be installed on Python3.x as well as Python2.x.

For installing Newspaper API in Python3 type:

   $ pip3 install newspaper3k

For installing Newspaper API in Python2 type:

   $ pip install newspaper

You might need some dependencies to be installed for extracting images from articles. For that you can find the installation command here

Features

As I had mentioned earlier, Newspaper API provides some advanced options other than just extracting the news article. Let us dive into some of them.

For extracting the news article

    >>> from newspaper import Article
    >>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
    >>> article = Article(url)

For downloading the article and retrieving its basic details such as title, author, published date

    >>> article.download()
    >>> article.html
    '<!DOCTYPE HTML><html itemscope itemtype="http://...'

    >>> article.parse()

    >>> article.title
    'New Year, new laws: Obamacare, pot, guns and drones'

    >>> article.authors
    ['Leigh Ann Caldwell', 'John Honway']

    >>> article.publish_date
    datetime.datetime(2013, 12, 30, 0, 0)

For extracting images from article

    >>> article.top_image
    'http://someCDN.com/blah/blah/blah/file.png'

For extracting all links from the article

    >>> article.movies
    ['http://youtube.com/path/to/link.com', ...]

Applying some basic NLP and analyzing the article

    >>> article.nlp()

    >>> article.keywords
    ['New Years', 'resolution', ...]

    >>> article.summary
    'The study shows that 93% of people ...'

Isn’t that simple! Now go ahead and play with the extracted articles as you want. There are many other similar APIs such as html2text, Lassie, Python-Goose, Textract. But I would personally recommend using Newspaper API due to the simplicity it offers and some of its advanced features that you cannot get your hands off!

I hope you learned something new from the article, comment below for any doubts.
Happy Coding.. :)


All content is licensed under the CC BY-SA 4.0 License unless otherwise specified