Part 1 - Creating a python based backend
I decided to build an online course aggregator and introduced the project in this post. This post is part 1 of the series and I will break down how I build the backend aggregation framework. The whole point of this project is to build a full stack application as both a learning exercise and an in-depth blogging topic.
The idea for the backend is to collect course data from the major online course providers, normalize the data, and store the data locally. This aggregator will most likely will run on an interval via cron or maybe supervisord. I also wanted the code to be modular so it is very easy to add a provider or a storage engine. The collection methods will very from provider to provider, for instance some may provide listings via an API like Coursera while others may need to be scraped using something like beautiful soup.
In order to make it modular, I can create a provider base class and a storage engine base class that each module can inherit. This makes the main program very simple; insert a list of providers and engines and iterate over them.
import importlib
provider_list = ["coursera", "edx"]
storage_list = ["mongodb", "postgres"]
#loop over the providers and call their get courses method
for provider in provider_list:
print "Collecting Courses for provider: {}".format(provider) #use importlib to "polymorphicly call the providers"
mod = importlib.import_module("providers." + provider.lower()) #import module
func = getattr(mod, provider)() #create instance
data = getattr(func, "get_courses")() #call get courses, from any provide we expect the same dictionary structure to be returned
#loop through Storage Engines Lists
for store in storage_list:
print "Saving Courses into Engine: {}".format(store)
mod = importlib.import_module("storage." + store.lower())
func = getattr(mod, store)()
getattr(func, "store_courses")(data) #call the store courses method to save the data
This is the entirety of the main program, to collect data from a new provider simply create another provider class and add the name to the provider list. Let's show how to create a new provider.
Creating Providers
I am starting with Coursera as it is the most popular mooc provider and it has REST/JSON endpoints that makes it easy to collect the data. Before we dive into the Coursera driver, let's talk about the generic provider class that each driver will inherit. Any driver simply needs to provide a get_courses method that returns a normalized dictionary of data. In my generic provider class I created a method that gives that basic dictionary structure. I used python's ABC structure for creating abstract methods. Here is the basic provider base class.
from abc import ABCMeta, abstractmethod
class ProviderBase(object):
__metaclass__ = ABCMeta
#all provider drivers need to create a get_courses method
@abstractmethod
def get_courses(self):
#abstract implemented in each provider
pass
I also wanted to define the dictionary structure that the storage engines will expect, so I created a static method that defines that dictionary structure. I actually wanted this to be a private static method only available in the class, so I used the @classmethod decorator.
@classmethod
def get_schema_map(cls):
course_schema = {
"course_name": None,
"provider": None,
"language": None,
"instructor": None,
"providers_id": None,
"media": {
"photo_url": None,
"icon_url": None,
"video_url": None,
"video_type": None,
"video_id": None
},
"short_description": None,
"full_description": None,
"course_url": None,
"institution": {
"name": None,
"description": None,
"id": None,
"website": None,
"logo_url": None,
"city": None,
"state": None,
"country": None
},
"sessions": [],
"workload": None,
"categories": [],
"tags": []
}
return course_schema
Creating the first driver
So now lets dive into the Coursera driver, this should be pretty simple to implement as I was able to find some REST endpoints in their front end backbone.js code. The url to get all courses is: https://www.coursera.org/maestro/api/topic/list?full=1 and to get the detail about a specific course the url is: https://www.coursera.org/maestro/api/topic/information?topic-id=:id. This makes it very simple to collect and normalize. I used the requests library to collect the data. This made the Coursera driver pretty easy to build, however I didn't want to keep hitting the API over and over while I developed the driver, so I used a test driven approach and mocked the requests library. I was able to save the json responses into a file and load them via a mock.
Here is my final coursera driver:
from provider import ProviderBase
import requests
from datetime import date
class Coursera(ProviderBase):
def __init__(self):
self.course_data = []
def get_courses(self):
coursera_url = "https://www.coursera.org/maestro/api/topic/list?full=1"
response = requests.get(coursera_url)
courses = response.json()
catalog = []
for item in courses:
course = Coursera.get_schema_map()
print "Processing Course: Coursera - {}".format(item.get("name", "Unknown").encode('utf-8'))
try:
#get required items
course['course_name'] = item['name']
course['providers_id'] = item["short_name"]
course['provider'] = "coursera"
course['language'] = Coursera.get_valid_language(item['language'])
course['instructor'] = item['instructor']
course['course_url'] = "http://class.coursera.org/{}/".format(item["short_name"])
#lets create an ID for the record
course['id'] = Coursera.create_id(course['provider'] + course["course_name"])
#get institution data
university = item['universities'][0]
institution = {
"name": university['name'],
"description": university.get("description", None),
"id": Coursera.create_id(university['name']),
"website": university["home_link"],
"logo_url": university["logo"],
"city": university["location_city"],
"state": university["location_state"],
"country": university["location_country"]
}
course['institution'] = institution
#get the data we need from the full course detail
more_details = self.__get_course_detail(course["providers_id"])
course['full_description'] = more_details.get("about_the_course", "not found")
except KeyError:
#we don't have all required fields, skip for now
#log it
continue
#get MEDIA INFO
media = {
"photo_url": more_details.get("photo", None),
"icon_url": more_details.get("large_icon", None),
"video_url": more_details.get("video_baseurl", None),
"video_type": "mp4",
"video_id": more_details.get("video_id", None)
}
course["media"] = media
#get optional fields
course['short_description'] = item.get('short_description', None)
course['categories'] = item.get('categories', [])
course['workload'] = more_details.get('estimated_class_workload', None)
catalog.append(item['short_name'])
#get tags
tags = []
for cat in more_details["categories"]:
tags.append(cat["name"])
course["tags"] = tags
#get the session data
for c in item.get('courses'):
session = {}
session['duration'] = c.get('duration_string', None)
session['provider_session_id'] = c.get('id', None)
#get Start Date
if all(name in c for name in ['start_year', 'start_month', 'start_day']):
try:
session['start_date'] = date(c['start_year'], c['start_month'], c['start_day']).strftime('%Y%m%d')
except TypeError:
#we don't have a valid start date, skip it
continue
else:
#missing a start date, skip it
continue
course['sessions'].append(session)
self.course_data.append(course)
return self.course_data
def __get_course_detail(self, id):
response = requests.get("https://www.coursera.org/maestro/api/topic/information?topic-id=" + id)
return response.json()
Unit testing
In order to unit test I used nose as my test runner because that is what I am most familiar with. Here is what my unit test looks like.
import unittest
from mock import Mock, patch
from providers.coursera import Coursera
import json
#setup our mock responses
class ListResponse:
def json(self):
with open('test/stubs/coursera_courses.json') as f:
data = f.read()
courses = json.loads(data)
return courses
class CourseResponse:
def json(self):
with open('test/stubs/coursera_course.json') as f:
data = f.read()
course = json.loads(data)
return course
class BadCourseResponse:
def json(self):
with open('test/stubs/coursera_course2.json') as f:
data = f.read()
course = json.loads(data)
return course
#define urls to response object
def get(*args):
if args[0] == "https://www.coursera.org/maestro/api/topic/list?full=1":
return ListResponse()
elif args[0] == "https://www.coursera.org/maestro/api/topic/information?topic-id=ml":
return CourseResponse()
elif args[0] == "https://www.coursera.org/maestro/api/topic/information?topic-id=rt":
return BadCourseResponse()
else:
print "nothing"
#test coursera provider
class CourseraTest(unittest.TestCase):
def setUp(self):
pass
@patch('requests.get')
def test_input(self, MockClass):
MockClass.side_effect = get
coursera = Coursera()
courses = getattr(coursera, "get_courses")()
self.assertTrue(MockClass.called)
self.assertEquals(len(courses), 3)
self.assertEquals(courses[0].get("course_name"), "Machine Learning")
self.assertEquals(courses[2].get("language"), "japanese")
Using nose I can call this test with the nosttests
command. However, I like to check my code coverage as well so I installed the coverage package and I can test my providers like this:
(wisdom)[brett:~/src/wisdom (master)]$ nosetests providers.coursera --with-coverage
Name Stmts Miss Cover Missing
--------------------------------------------------
providers 1 0 100%
providers.coursera 55 1 98% 90
providers.provider 14 1 93% 11
--------------------------------------------------
TOTAL 70 2 97%
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
In my next post I will show my storage implementations. Initially I decided to build an engine for mongodb and postgres. However I used sqlalchemy so it will support any RDBMS that SQL alchemy supports. You can take an early peek at the github page linked below. There you will find the final version of the wisdom backend. I used virtualenv and a yaml config file for the final version so it is a little different than this post.
Part 1 continued, mongodb storage engine