Extracting data from unstructured text and images with Datasette and GPT-4 Turbo
April 9, 2024, 10:26 p.m. Simon Willison
Clean data, with well defined columns and rows, is a beautiful thing - the ideal starting point for any data analysis or visualization project. Sadly, very little of the world's interesting data is published in a structured format that we can start using straight away. datasette-extract is a new tool for Datasette that uses GPT-4 Turbo to create and populate database tables using data extracted from unstructured text and images.
Processing unstructured data is one of the most directly useful applications of Large Language Models - the technology behind tools like ChatGPT and Google Gemini.
OpenAI today made their GPT-4 Turbo with Vision model generally available via their API.
Datasette Extract uses that API to convert text and images into structured table data. Here's a demo video of the new plugin in action:
How to use Datasette Extract
Once the datasette-extract
plugin is installed, Datasette will gain a new option in the Database action menu: "Create table with AI extracted data".
This will take you to a page where you can configure the table you would like to create.
Here I'm creating an events
table with event_title
, event_date
, description
, venue_name
and start_time
columns.
All of these are text columns. I've set a hint of YYYY-MM-DD
for the event_date
column and HH:MM 24hr
for the start_time
column.
Next, paste data into the box below. This can be completely unstructured - in this case I hit Command-A
and then Command-C
on the upcoming events calendar for the Bach Dancing & Dynamite Society jazz venue in Half Moon Bay.
Hit that "Extract" button an Datasette will send the text to OpenAI, along with the schema for the table you want to create.
The next screen shows the extracted data as it flows in.
Once the extraction is complete you'll be redirected to the newly created database table:
In addition to pasted text, the plugin can handle images. Here's a screenshot image I took of event information from the Visit Half Moon Bay events calendar:
In the demo I uploaded this image using the "Extract data into this table with AI" table action menu, which meant I didn't have to configure the columns again.
The first time I tried this it didn't quite work: the events in the image did not have a visible year, and GPT-4 Turbo defaulted to assuming they took place in 2019 for some reason.
I fixed this by adding an extra hint to the event_date
column telling it to assume 2024. Running the import again successfully extracted the events with the correct dates.
Try this out yourself
The Datasette Extract plugin is available for you to use with the Datasette open source project right now, or get in touch if you'd like to try It out on Datasette Cloud.