Data Trumps Software
Effective machine learning requires quality data
- Ocean Protocol https://oceanprotocol.com - is a ecosystem based on blockchain for sharing data that serves needs for both data producers who want to monetize their data assets and for data consumers who need specific data that is affordable. This ecosystem is still under development but there are portions of the infrastructure (which will all be open source) already available. If you have docker installed you can quickly run their data marketplace demonstration system https://docs.oceanprotocol.com/setup/quickstart/.
- Common Crawl http://commoncrawl.org - is a free source of web crawl data that was previously only available to large search engine companies. There are many open source libraries to access and process crawl data. You can most easily get started by downloading a few WARC data segment files to your laptop. My open source Java and Clojure libraries for processing WARC files are at https://github.com/commoncrawl/example-warc-java
- Amazon Public Dataset Program https://aws.amazon.com/opendata/public-datasets/ - is a free service for hosting public datasets. AWS evaluates applications to contribute data quarterly if you have data to share. To access data sources search using the form at https://registry.opendata.awsto find useful datasets and use the S3 bucket URIs (or ARNs) to access. Most data sources have documentation pages and example client libraries and examples.
Overview of Ocean Protocol
- Publisher: is a service that provides access to data from data producers. Data producers will often also act as publishers of their own data.
- Consumer: any person or organization who needs access to data. Access is via client libraries or web interfaces.
- Marketplace: a service that lists assets and facilitates access to free datasets and datasets available for purchase.
- Verifier: a software service that checks and validates steps in transactions for selling and buying data. A verifier is paid for this service.
- Service Execution Agreement (SEA): a smart contract used by providers, consumers, and verifiers.
- Aquarius: is a service for storing and managing metadata for data assets that uses the off-chain database OceanDB.
- Brizo: used by publishers for managing interactions with market places and data consumers.
- Keeper: a service running a blockchain client and uses Ocean Protocol to process smart contracts.
- Pleuston: an example/demo marketplace that you can run locally with Docker on your laptop.