Getting started with the Datasette Cloud API

Sept. 28, 2023, 10:20 p.m. Simon Willison

Your team's space on Datasette Cloud includes a comprehensive JSON API. This can be used to both read and write data to your private databases. In this tutorial learn how to use Python to import data from the Federal Register into a Datasette Cloud space, and how to run that code on a schedule using GitHub Actions.

Datasette Cloud API features include:

Create a new table, either by specifying the columns you want or by passing in a list of JSON objects and letting Datasette design a schema to fit them
Insert rows into an existing database table
Retrieve rows from a table, with filtering, sorting and pagination
Execute read-only SQL queries against your database and return the results as JSON
Issue finely-grained tokens that can only read or write to specific tables within your database

The API can be used to build entirely custom applications, but is also designed for running extract, transform and load (ETL) jobs in automation environments such as GitHub Actions.

Authentication with API tokens

Datasette Cloud spaces are private, so every API request needs to be authenticated using an API token.

These tokens can be created at the /-/api/tokens/create page within your space, linked to from the top right navigation menu as "Create API token".

Your tokens act on behalf of your user account, performing any operation within Datasette that your user has permission to do.

Finely-grained access control

Rather than granting a token all of your permissions, you can instead create a token that is restricted to only being able to perform a subset of the actions that are available to your user.

This is great practice from a security point of view, as it limits the amount of damage that could be done if your token was compromised somehow.

As an additional security measure, each token issued for your space will be displayed along with the date they were last used to access the API, and can be revoked by clicking the "Revoke" button.

The JSON API

Datasette Cloud is running a preview of the forthcoming Datasette 1.0 release - currently 1.0 alpha 7. This means that the API represents the new JSON format that is planned to be stable from the 1.0 release onwards.

Every page in Datasette has a corresponding JSON API - usually available by adding .json to the end of the URL.

You can start experimenting with the API on a public, un-authenticated database on the latest.datasette.io demo instance:

https://latest.datasette.io/fixtures/roadside_attractions.json is the JSON for the roadside_attractions table.
This example demonstrates a SQL query that returns a subset of the columns.

The Datasette Cloud API also includes endpoints for writing to your database. The easiest way to explore those us to use the API explorer tool, available at /-/api within your space.

Using GitHub Actions to import data from the Federal Register into Datasette Cloud

Let's build something with the API.

The Federal Register is the official daily publication for notices from federal agencies and related organizations across the USA government.

It includes a neat JSON API, which can be used to retrieve metadata about documents published in the register. You can preview results from that API here:

https://www.federalregister.gov/api/v1/documents.json

Let's build a scheduled GitHub Action that scrapes the Federal Register and pushes the results to a table in Datasette Cloud.

We'll run this on a schedule, fetching documents from the current and previous day to ensure we don't miss anything.

The Python script

I'm going to do the work of fetching and then writing the data to Datasette Cloud using a Python script. I'll write it without any dependencies outside of the standard library, to simplify the process of running it in GitHub Actions later on.

First, the imports:

import datetime
import urllib.parse
import urllib.request
import json
import os

Next, a function to fetch documents for a specific date from the Federal Register API. This needs to handle pagination - the API returns a next_page_url key if there are more results available, indicating the URL to the API results for the next page.

def fetch_documents(date):
    base_url = "https://www.federalregister.gov/api/v1/documents.json"
    url = (
        base_url
        + "?"
        + urllib.parse.urlencode(
            {
                "conditions[publication_date][is]": date.strftime("%m/%d/%Y"),
                "per_page": 100,
            }
        )
    )
    while url:
        with urllib.request.urlopen(url) as response:
            data = json.loads(response.read())
            for result in data["results"]:
                yield result
            url = data.get("next_page_url")

Now we need a function for sending documents to the Datasette Cloud API. We're going to use the /data/-/create endpoint, which accepts the name of a table and a list of rows and creates that table with the necessary schema if it doesn't already exist.

The API call will look like this:

POST https://demos.datasette.cloud/data/-/create
{
    "table": "documents",
    "rows: [{...}, {...}, ...],
    "pk": "document_number",
    "replace": true
}

We specify that the primary key on the table should be the document_number from the Federal Register API, and that if a row with that primary key already exists it should be replaced with the new row.

Here's the Python function to make that API call:

def send_documents(documents, api_token):
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer {}".format(api_token),
    }
    body = {
        "table": "documents",
        "rows": documents,
        "pk": "document_number",
        "replace": True,
    }
    req = urllib.request.Request(
        "https://demos.datasette.cloud/data/-/create",
        data=json.dumps(body).encode(),
        headers=headers,
        method="POST",
    )
    with urllib.request.urlopen(req) as response:
        print(response.status, len(documents))

The data provided by the Federal Register API is almost in the exact shape that we want for submitting it to Datasette, with the exception of the agencies field which contains nested JSON that looks like this:

[
    {
        "raw_name": "DEPARTMENT OF COMMERCE",
        "name": "Commerce Department",
        "id": 54,
        "url": "https://www.federalregister.gov/agencies/commerce-department",
        "json_url": "https://www.federalregister.gov/api/v1/agencies/54",
        "parent_id": null,
        "slug": "commerce-department"
    },
    {
        "raw_name": "Foreign-Trade Zones Board",
        "name": "Foreign-Trade Zones Board",
        "id": 208,
        "url": "https://www.federalregister.gov/agencies/foreign-trade-zones-board",
        "json_url": "https://www.federalregister.gov/api/v1/agencies/208",
        "parent_id": 54,
        "slug": "foreign-trade-zones-board"
    }
]

We could store this directly in Datasette, but I decided to instead store a JSON array of just the names of the departments. Datasette has features for filtering and faceting arrays of plain strings which would be useful for slicing and dicing this data.

Datasette accepts documents in batches of 100. Here's a function that ties this all together - for a specific date it fetches documents from the Federal Register, transforms their agencies key to the new shape and then sends them to Datasette Cloud:

def save_documents_for_date(date, api_token):
    batch = []
    for document in fetch_documents(date):
        batch.append(
            dict(
                document,
                agencies=[
                    # Most agencies have a name, but some need to fall back to raw_name
                    agency.get("name", agency["raw_name"])
                    for agency in document["agencies"]
                ],
            )
        )
        if len(batch) == 100:
            send_documents(batch, token)
            batch = []
    if batch:
        send_documents(batch, token)

Finally, some Python boilerplate to read our API token from an environment variable and call that function.

We actually call it twice: once for today and once for yesterday. This means we can run it on an hourly basis to catch up on any documents that were published in the last 24 hours, while also backfilling from the day before in case we missed any.

if __name__ == "__main__":
    token = os.environ.get("DATASETTE_API_TOKEN")
    for date in (
        datetime.date.today(),
        datetime.date.today() - datetime.timedelta(days=1)
    ):
        save_documents_for_date(date, token)

Here's the full finished script.

Creating an API token

We need an API token that can create tables, insert rows and update rows in our data database.

I gave it the ability to view rows too, in case I need to have it read data it has written in the future.

Here's how I did that using the "create token" form:

The Create an API token page. The token name has been set to Federal Register, and token never expires. The permissions for view-table, insert-row, update-row and create-table have been set for the data database.

We can swap this out for an even more tightly scoped table later on if we want to, that only has write access to the documents table.

Running it on as schedule using GitHub Actions

There are many options for running a Python script like this on a schedule. One of my favorites is via GitHub Actions.

I created a repository for this project at simonw/federal-register-to-datasette, added the fetch_documents.py script and added a .github/workflows/main.yml file containing this:

name: Fetch Federal Register documents

on:
# Push to repository
push:
# Can be activated manually:
workflow_dispatch:
# Runs twice a day
schedule:
- cron: "0 0,12 * * *"

jobs:
fetch-documents:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout repository
    uses: actions/checkout@v4
    - name: Fetch documents
    run: python fetch_documents.py
    # Make the token available as an environment variable
    env:
        DATASETTE_API_TOKEN: ${{ secrets.DATASETTE_API_TOKEN }}

Since we don't have any extra Python dependencies this is a really simple workflow.

I added a GitHub Actions secret called DATASETTE_API_TOKEN to the repository containing the API key I created earlier.

Now every time I push to the repository, click "run this workflow" or automatically twice a day on a schedule, the script will run and fetch the latest documents from the Federal Register and write them to Datasette Cloud.

Enabling full-text search

Having created the table, let's configure it for full-text search.

Here's how to do that using the table cog menu in Datasette Cloud:

Click the cog menu, click configure full-text search. Now select the title and abstract columns and click Configure search across these columns. The result is a new search box which can be used to search for terms such as FAA.

Now we can run searches for things like FAA or flood hazard.

The result

I've made the resulting table public at demos.datasette.cloud/data/documents. Here's what it looks like with some facets applied to allow filtering by agency and document type:

Now that it's in Datasette Cloud we can filter it, run SQL queries against it and build further integrations on top of it using the JSON API.

For example, here's the JSON API result for a search for "flood" sorted descending by publication date.

Bonus: locking it down further

The token we used for this project still has quite extensive permissions - it can create new tables and insert and update rows in any table within the data database.

Now that we've created the table, we can lock our system down further by creating a new, more tightly scoped token. We can grant access to just insert-row and update-row in our newdata/documents table.

Here's what the form for creating that token looks like:

Heading: Specific tables in specific databases. For the data: documents table the insert-row and update-row checkboxes are selected.

We'll need to update our code as well. We can't use this new token with the /data/-/create API endpoint - instead, it needs to use /data/documents/-/insert - and it should send just the rows and replace keys.

Here's the updated send_documents() function:

def send_documents(documents, api_token):
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer {}".format(api_token),
    }
    body = {
        "rows": documents,
        "replace": True,
    }
    req = urllib.request.Request(
        "https://demos.datasette.cloud/data/documents/-/insert",
        data=json.dumps(body).encode(),
        headers=headers,
        method="POST",
    )
    with urllib.request.urlopen(req) as response:
        print(response.status, len(documents))

Having updated our code and deployed our new token, we can safely revoke the old one on the "API tokens" page.

Here's the updated script using that new token.

Sign up for the preview

Want to try this out for your own organization? Datasette Cloud is currently in preview, and you can request an invitation to the preview here.

Or run this yourself

Datasette Cloud runs on open source components from the Datasette ecosystem. The following components should be everything you need to get a version of this running on your own server:

Datasette 1.0 alpha - the 1.0 alpha series introduces the JSON write API used in this tutorial
datasette-auth-tokens alpha 0.4 introduces database-backed API tokens with finely grained permissions