Memorious
Memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data.
- Make crawlers modular and simple tasks re-usable
- Provide utility functions to do common tasks such as data storage, HTTP session management
- Integrate crawlers with the Aleph and FollowTheMoney ecosystem
- Get out of your way as much as possible
Design
When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.
Memorious handles this by managing a set of crawlers
, each of which can be composed of multiple stages
. Each stage
is implemented using a Python function, which can be re-used across different crawlers
.
The basic steps of writing a Memorious crawler:
- Make YAML crawler configuration file
- Add different stages
- Write code for stage operations (optional)
- Test, rinse, repeat
Documentation
Installation
Install Memorious and run your own crawlers.
CLI Reference
Reference for the command-line tool to run and monitor crawlers.
Crawler Reference
Build your own crawler using YAML configuration.
Development
Links to our Git repository and licensing information.