Getting started with the Datasette Cloud API
Sept. 28, 2023, 10:20 p.m. Simon Willison
Your team's space on Datasette Cloud includes a comprehensive JSON API. This can be used to both read and write data to your private databases. In this tutorial learn how to use Python to import data from the Federal Register into a Datasette Cloud space, and how to run that code on a schedule using GitHub Actions.
Datasette Cloud API features include:
- Create a new table, either by specifying the columns you want or by passing in a list of JSON objects and letting Datasette design a schema to fit them
- Insert rows into an existing database table
- Retrieve rows from a table, with filtering, sorting and pagination
- Execute read-only SQL queries against your database and return the results as JSON
- Issue finely-grained tokens that can only read or write to specific tables within your database
The API can be used to build entirely custom applications, but is also designed for running extract, transform and load (ETL) jobs in automation environments such as GitHub Actions.
Authentication with API tokens
Datasette Cloud spaces are private, so every API request needs to be authenticated using an API token.
These tokens can be created at the /-/api/tokens/create
page within your space, linked to from the top right navigation menu as "Create API token".
Your tokens act on behalf of your user account, performing any operation within Datasette that your user has permission to do.
Finely-grained access control
Rather than granting a token all of your permissions, you can instead create a token that is restricted to only being able to perform a subset of the actions that are available to your user.
This is great practice from a security point of view, as it limits the amount of damage that could be done if your token was compromised somehow.
As an additional security measure, each token issued for your space will be displayed along with the date they were last used to access the API, and can be revoked by clicking the "Revoke" button.
The JSON API
Datasette Cloud is running a preview of the forthcoming Datasette 1.0 release - currently 1.0 alpha 7. This means that the API represents the new JSON format that is planned to be stable from the 1.0 release onwards.
Every page in Datasette has a corresponding JSON API - usually available by adding .json
to the end of the URL.
You can start experimenting with the API on a public, un-authenticated database on the latest.datasette.io
demo instance:
- https://latest.datasette.io/fixtures/roadside_attractions.json is the JSON for the roadside_attractions table.
- This example demonstrates a SQL query that returns a subset of the columns.
The Datasette Cloud API also includes endpoints for writing to your database. The easiest way to explore those us to use the API explorer tool, available at /-/api
within your space.
Using GitHub Actions to import data from the Federal Register into Datasette Cloud
Let's build something with the API.
The Federal Register is the official daily publication for notices from federal agencies and related organizations across the USA government.
It includes a neat JSON API, which can be used to retrieve metadata about documents published in the register. You can preview results from that API here:
https://www.federalregister.gov/api/v1/documents.json
Let's build a scheduled GitHub Action that scrapes the Federal Register and pushes the results to a table in Datasette Cloud.
We'll run this on a schedule, fetching documents from the current and previous day to ensure we don't miss anything.
The Python script
I'm going to do the work of fetching and then writing the data to Datasette Cloud using a Python script. I'll write it without any dependencies outside of the standard library, to simplify the process of running it in GitHub Actions later on.
First, the imports:
import datetime
import urllib.parse
import urllib.request
import json
import os
Next, a function to fetch documents for a specific date from the Federal Register API. This needs to handle pagination - the API returns a next_page_url
key if there are more results available, indicating the URL to the API results for the next page.
def fetch_documents(date):
base_url = "https://www.federalregister.gov/api/v1/documents.json"
url = (
base_url
+ "?"
+ urllib.parse.urlencode(
{
"conditions[publication_date][is]": date.strftime("%m/%d/%Y"),
"per_page": 100,
}
)
)
while url:
with urllib.request.urlopen(url) as response:
data = json.loads(response.read())
for result in data["results"]:
yield result
url = data.get("next_page_url")
Now we need a function for sending documents to the Datasette Cloud API. We're going to use the /data/-/create
endpoint, which accepts the name of a table and a list of rows and creates that table with the necessary schema if it doesn't already exist.
The API call will look like this:
POST https://demos.datasette.cloud/data/-/create
{
"table": "documents",
"rows: [{...}, {...}, ...],
"pk": "document_number",
"replace": true
}
We specify that the primary key on the table should be the document_number
from the Federal Register API, and that if a row with that primary key already exists it should be replaced with the new row.
Here's the Python function to make that API call:
def send_documents(documents, api_token):
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer {}".format(api_token),
}
body = {
"table": "documents",
"rows": documents,
"pk": "document_number",
"replace": True,
}
req = urllib.request.Request(
"https://demos.datasette.cloud/data/-/create",
data=json.dumps(body).encode(),
headers=headers,
method="POST",
)
with urllib.request.urlopen(req) as response:
print(response.status, len(documents))
The data provided by the Federal Register API is almost in the exact shape that we want for submitting it to Datasette, with the exception of the agencies
field which contains nested JSON that looks like this:
[
{
"raw_name": "DEPARTMENT OF COMMERCE",
"name": "Commerce Department",
"id": 54,
"url": "https://www.federalregister.gov/agencies/commerce-department",
"json_url": "https://www.federalregister.gov/api/v1/agencies/54",
"parent_id": null,
"slug": "commerce-department"
},
{
"raw_name": "Foreign-Trade Zones Board",
"name": "Foreign-Trade Zones Board",
"id": 208,
"url": "https://www.federalregister.gov/agencies/foreign-trade-zones-board",
"json_url": "https://www.federalregister.gov/api/v1/agencies/208",
"parent_id": 54,
"slug": "foreign-trade-zones-board"
}
]
We could store this directly in Datasette, but I decided to instead store a JSON array of just the names of the departments. Datasette has features for filtering and faceting arrays of plain strings which would be useful for slicing and dicing this data.
Datasette accepts documents in batches of 100. Here's a function that ties this all together - for a specific date it fetches documents from the Federal Register, transforms their agencies
key to the new shape and then sends them to Datasette Cloud:
def save_documents_for_date(date, api_token):
batch = []
for document in fetch_documents(date):
batch.append(
dict(
document,
agencies=[
# Most agencies have a name, but some need to fall back to raw_name
agency.get("name", agency["raw_name"])
for agency in document["agencies"]
],
)
)
if len(batch) == 100:
send_documents(batch, token)
batch = []
if batch:
send_documents(batch, token)
Finally, some Python boilerplate to read our API token from an environment variable and call that function.
We actually call it twice: once for today and once for yesterday. This means we can run it on an hourly basis to catch up on any documents that were published in the last 24 hours, while also backfilling from the day before in case we missed any.
if __name__ == "__main__":
token = os.environ.get("DATASETTE_API_TOKEN")
for date in (
datetime.date.today(),
datetime.date.today() - datetime.timedelta(days=1)
):
save_documents_for_date(date, token)
Here's the full finished script.
Creating an API token
We need an API token that can create tables, insert rows and update rows in our data
database.
I gave it the ability to view rows too, in case I need to have it read data it has written in the future.
Here's how I did that using the "create token" form:
We can swap this out for an even more tightly scoped table later on if we want to, that only has write access to the documents
table.
Running it on as schedule using GitHub Actions
There are many options for running a Python script like this on a schedule. One of my favorites is via GitHub Actions.
I created a repository for this project at simonw/federal-register-to-datasette, added the fetch_documents.py
script and added a .github/workflows/main.yml
file containing this:
name: Fetch Federal Register documents
on:
# Push to repository
push:
# Can be activated manually:
workflow_dispatch:
# Runs twice a day
schedule:
- cron: "0 0,12 * * *"
jobs:
fetch-documents:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Fetch documents
run: python fetch_documents.py
# Make the token available as an environment variable
env:
DATASETTE_API_TOKEN: ${{ secrets.DATASETTE_API_TOKEN }}
Since we don't have any extra Python dependencies this is a really simple workflow.
I added a GitHub Actions secret called DATASETTE_API_TOKEN
to the repository containing the API key I created earlier.
Now every time I push to the repository, click "run this workflow" or automatically twice a day on a schedule, the script will run and fetch the latest documents from the Federal Register and write them to Datasette Cloud.
Enabling full-text search
Having created the table, let's configure it for full-text search.
Here's how to do that using the table cog menu in Datasette Cloud:
Now we can run searches for things like FAA or flood hazard.
The result
I've made the resulting table public at demos.datasette.cloud/data/documents. Here's what it looks like with some facets applied to allow filtering by agency and document type:
Now that it's in Datasette Cloud we can filter it, run SQL queries against it and build further integrations on top of it using the JSON API.
For example, here's the JSON API result for a search for "flood" sorted descending by publication date.
Bonus: locking it down further
The token we used for this project still has quite extensive permissions - it can create new tables and insert and update rows in any table within the data
database.
Now that we've created the table, we can lock our system down further by creating a new, more tightly scoped token. We can grant access to just insert-row
and update-row
in our newdata/documents
table.
Here's what the form for creating that token looks like:
We'll need to update our code as well. We can't use this new token with the /data/-/create
API endpoint - instead, it needs to use /data/documents/-/insert
- and it should send just the rows
and replace
keys.
Here's the updated send_documents()
function:
def send_documents(documents, api_token):
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer {}".format(api_token),
}
body = {
"rows": documents,
"replace": True,
}
req = urllib.request.Request(
"https://demos.datasette.cloud/data/documents/-/insert",
data=json.dumps(body).encode(),
headers=headers,
method="POST",
)
with urllib.request.urlopen(req) as response:
print(response.status, len(documents))
Having updated our code and deployed our new token, we can safely revoke the old one on the "API tokens" page.
Here's the updated script using that new token.
Sign up for the preview
Want to try this out for your own organization? Datasette Cloud is currently in preview, and you can request an invitation to the preview here.
Or run this yourself
Datasette Cloud runs on open source components from the Datasette ecosystem. The following components should be everything you need to get a version of this running on your own server:
- Datasette 1.0 alpha - the 1.0 alpha series introduces the JSON write API used in this tutorial
- datasette-auth-tokens alpha 0.4 introduces database-backed API tokens with finely grained permissions