Keeps an eye on rents. Everywhere in Europe, all the time.
Writing scrapers with Rentswatch
Rentswatch focuses on extracting classified ads. To do so, we set up a large collection
of tiny robots that analyze and extract data from websites. For the sake efficiency
we created a framework that harmonizes the way we code those scrapers.
This post details how to write a scraper in python using this framework.
How to install
Install using pip…
How to use
Let’s take a look at a quick example of a Rentswatch Scraper to
build a simple model-backed scraper to collect data from a website.
First, import the package components to build your scraper:
To factorize as much code as possible we created an abstract class that
every scraper will implement. For the sake of simplicity we’ll use a
dummy website as follow:
Without any further configuration, this scraper will start to collect
ads from the list page of dummy.io. To find links to the ads, it will
use the CSS selector .ad-page-link to get <a> markups and follow
their href attributes.
We have now to teach the scraper how to extract key figures from the ad
Every attribute will be saved as an Ad’s property, according to the Ad
Some properties may not be extractable from the HTML. You may need to
use a custom function that received existing properties. For this reason
we created a second field type named ComputedField. Since the
properties order of declaration is recorded, we can use previously
declared (and extracted) values to compute new ones.
All you need to do now is to create an instance of your class and run
As we wanted to make the Scraper very flexible, we isolated most of the extraction
steps in separated methods. The full methods list is available on Github
By overriding those methods you can completely change the behavior of your scraper.
For instance, get_series method is used to extract every ads list and parse
the page to create a iterator with every ads in this list. The get_ad_href method
receives a soup of an ad’s block in order to extract the link to the ad.