Documentation: Cardano on BigQuery

Documentation of our product Cardano on BigQuery.

Preliminaries

To start querying data, you need to have a Google project. In case you don’t yet have a Google project:

NB. If you don't query the dataset from your own project you'll be getting the error: "Access Denied: Project blockchain-analytics-392322: User does not have bigquery.jobs.create permission in project blockchain-analytics-392322."

This project is open source

The Github repository of this project contains source code and schema files. Feel free to browse:
https://github.com/Blockchain-Data-Analytics/Cardano_on_BigQuery

About BigQuery

According to its official description🔗, BigQuery is a fully managed enterprise data warehouse that helps to manage and analyze data with built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery's serverless architecture allows using SQL queries to get the data you are looking for with zero infrastructure management.

Our motivation to use BigQuery

Cardano’s on-chain data has considerably grown over the last few months. This means that the time to sync the whole history of the blockchain increases accordingly. Running a node and a Db-sync process (mapping the on-chain data to a relational database) now requires more time and a more robust software instance. We estimate these costs to be around $300 per month. As the on-chain data is considered immutable after a number of confirmations or blocks on top, our BigQuery project allows querying these data at very low costs.

Google BigQuery makes it easier to look up data without the need to run specialized software. And using Google Looker Studio, one can seamlessly create advanced visualizations and dashboards based on this data.

About Db-sync

The Db-sync project was initiated by IOG to map Cardano's blockchain data to a relational model in PostgreSQL. Today, this project is continued in the Intersect MBO. Source code and documentation about Db-sync can be found on:
https://github.com/IntersectMBO/cardano-db-sync

Costs

The cost of querying data from BigQuery is paid by the user initiating the query. BigQuery charges an amount analogous to the amount of the data being transferred for the query.

Currently, the cost is $5 per terabyte (TB) of data queried. Find more information on the BigQuery pricing page🔗.

Note: the data tables in this project are divided by epoch numbers so that queries restrict access to the epochs they need, resulting in less data transfer and thus incurring less costs.

Data validation

After every epoch we validate the exported data on BigQuery against Db-sync in a byte-by-byte manner to ensure that the complete data has been exported and that it matches the source.

Reports about the success of the validation are available to registered members from the Cardano_on_BigQuery product site.

Update process

Currently, every hour we export the newest data from Db-sync to the BigQuery dataset.

explanation of the data update process

The process includes a safety margin: a number of slots are deleted from BigQuery and reexported. This should practically eliminate any data being exported which later is walked back due to chain forks. Moreover, the latest slots are not included for the same reason.