The need for content aggregators is pretty clear. The internet is filled with endless information and in order for you to stay updated and informed about the latest news or any other type of content, you might be scrolling through various websites every day.
Content aggregation helps us optimize our content consumption — instead of scrolling through 5 different websites we only need one, and instead of endless scrolling trying to filter the content we care about, we can be presented with content related to our topics of interest.
In this article, you will learn how to create your own customized content aggregator with python from scratch.
To complete this tutorial, you will need:
- A local development environment for Python 3.6+
- Familiarity with Python.
Step 1 — Installing Dependencies
Create a new file called
requirements.txt and copy the following content
Run the following command to install all dependencies
pip install -r requirements.txt
In the next section, we are going to come up with a design for our content aggregator such that it will be easy to add new sources and topics to follow.
Step 2 — Design
In this article, we are going to create a content aggregator from a single source — Reddit, but in order to make it easier to add new sources, we would have to design our project properly.
We are going to create a
Source abstract class which will be the base class for the different sources we want to include (Reddit, Medium, etc).
In our content aggregator, we are going to create a
RedditSource class that inherits from
Lastly, we will create another class
RedditHotProgramming which will represent a topic we want to fetch content for from our source.
If you wish to enter a second topic that will be fetched from the Reddit source, you would simply have to create a new class, for example
In order to fetch posts from a different platform, for example, Medium, you will create a
MediumSource class and the topic classes you wish to follow.
In the next section, we are going to start coding our content aggregator, starting from the
Source abstract class.
Step 3 — Creating the Source Class
Create a new file called
content_agg.py and import the following libraries
from abc import ABC, abstractmethod
Now let’s define our
Source abstract class.
Source class will have two abstract methods which will allow us to connect to a source (its API for example) and to fetch posts from it.
class Source(ABC): @abstractmethod def connect(self): pass @abstractmethod def fetch(self): pass
In the next section, we are going to create the
RedditSource class which will implement the abstract methods in the
Step 4 — Creating the Reddit Source class
In this section, we are going to write the
RedditSource class into
In this class, we will implement the connection to the Reddit API.
In order to access Reddit’s API, you will need to generate a key.
Luckily this procedure is very easy and short, follow it on reddit-archive on github.
Once you have the necessary keys, create environment variables for them or simply use them as constants in your code (only if you are not planning to share your code elsewhere).
If you chose to create environment variables for the two keys like I did, here’s how you access them from your code
CLIENT_ID = os.environ.get('REDDIT_CLIENT_ID')
CLIENT_SECRET = os.environ.get('REDDIT_CLIENT_SECRET')
Now moving on to create the
class RedditSource(Source): def connect(self): self.reddit_con = praw.Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, grant_type_access='client_credentials', user_agent='script/1.0') return self.reddit_con def fetch(self): pass
And now we have a working connection to Reddit’s API.
fetch function is left unimplemented in this class since we will implement it for each topic.
In the next section, we will create the
RedditHotProgramming class where we will fetch the hot posts from r/programming.
Step 5 — Create the Reddit Hot Programming class
In this section, we are going to implement a class that will enable us to fetch hot posts from r/programming.
We will rely on the connection to Reddit’s API from our parent class —
class RedditHotProgramming(RedditSource): def __init__(self) -> None: self.reddit_con = super().connect() self.hot_submissions =  def fetch(self, limit: int): self.hot_submissions = self.reddit_con.subreddit('programming').hot(limit=limit) def __repr__(self): urls =  for submission in self.hot_submissions: urls.append(vars(submission)['url']) return '\n'.join(urls)
Besides the fetch functionality, we implemented the
__repr__ method so that when we call
RedditHotProgramming we will print its custom representation which in our case will be the list of URLs from the hot posts.
In the next section, we will glue everything together and execute our content aggregator.
Step 6 — Gluing Everything Together
In order to run everything, we will create an
RedditHotProgramming instance and fetch from it a few posts. then we will be able to print our object and get all the URLs that appeared in each of the hot posts.
if __name__ == '__main__': reddit_top_programming = RedditHotProgramming() reddit_top_programming.fetch(limit=10) print(reddit_top_programming)
To execute, simply run
python content_agg.py in your terminal.
In this article, you built a news aggregator completely from scratch. now you can pick any platform and topic you want to add and expand this project.