TL;DR: Download the shortcut here and follow the instructions in the appendix to get yourself set up.
Apple shortcuts is a funny tool. On the one hand, it is the closest thing to free programmatic access to the famously closed platform of iOS. It contains some genuinely useful functions, making some cool things easy to set up.1 And it has an active community on reddit, making your questions easy to answer.
But that community on reddit is the closest thing it has to documentation, it has a janky drag and drop interface that doesn't allow for comments or useful error codes, or even editing the top of the script. Half the time you try do something, it ends up being impossible.
In short, Apple shortcuts is fun in the way that a puzzle game is fun, except the puzzle game has a higher production value.
LLMs (Large Language Models) are similarly spiky. There are moments when it genuinely scares me how good models are, and other times when I'm ... not. But glaring weaknesses aside, my issues with AI systems is often not whether an LLM can do something, but whether it's easy to get into to and from the model.
A great example of this is transcribing writing. I handwrite a LOT2. The first draft of this past was hand-written. Every essay, most emails, heck even some texts start life on a piece of paper. So I've spent a lot of hours typing up what I've written, puzzling over my handwriting or awkward turns of phrase. What if I didn't need to do this?
Current LLMs are actually quite good at taking a picture of handwritten text and transcribing the contents, but it's a pain to (1) take a picture on my phone, (2) send that to my computer, (3) download that picture, and (4) upload it to the chatbot interface, and then copy the results. Why not just use my phone? It's a little easier, sure, but I still have to (1) take the picture (2) copy the text (3) send it to my computer (4) copy it from my email to wherever it needs to go. Even if that saves time, the effort didn't feel worth it. Until now...
I think you see where this is going.
One of Apple shortcut's saving graces is that it allows you to make API requests via a URL (see footnote for explanation of what an API is)3. So I thought to myself, "how hard would it be to make a shortcut to take a picture and send that to the Gemini API?"
Not that hard, it turns out. But let's buckle up because it's kind of a long ride...4
The form of the google API key looks like this (when called as a URL)5:
https://generativelanguage.googleapis.com/<apiversion>/model/<model>:<command>?key=<API key>
where <>
denote variables.
Let's break this down: https://generativelanguage.googleapis.com/
is the API endpoint. It's the consistent entry point to using this specific API. That is, if you have multiple commands this is the part that never changes.
The <apiversion>
denotes which version you are using. I used v1-beta, but that was just because the sample code I took from google used v1-beta.6
There are many models you can use.7 I chose to use Gemini-2.0-flash-exp. It's their newest model, and it is fast, cheap, and very competent.
There are many <commands>
you can run, but we want the model to generate text, so our command is generateContent
.
And finally, the API key is google's way of knowing who makes the request. To make an API key, go to this page and follow their instructions. Once you have an API key, you can either store it as a variable, or hard-code it into your URL.
All in all, our URL looks like8:
https://generativelanguage.google.com/v1beta/models/gemini-2.0-flash-exp:generateContent?key=<APIkey>
Thankfully, it looks quite similar in the apple shortcuts.9
At this point, you might be thinking "great, but where is the actual image I'm going to be sending?" That's a great question. We send the prompt and image in the body of a CURL request!
Curl requests are a standard way to make requests through the web -- they're detailed messages to send to web services.10 They allow you to use headers to specify what type of content you are sending and what responses you expect, specify the HTTP method you are using (like POST to send data and GET to retrieve it), and extra data, separate from the URL parameters, in the body of the request.
Curl requests have a specific shape and look to them when used on the command line, but I won't worry explaining them at this moment, because shortcuts deals with them in a slightly different way.
Remember the URL I showed above? There is a shortcuts method called "Get contents of URL" which takes the URL we have defined, calls it, and returns whatever the server has responded with. If we click the little arrow, we get more options, which correspond to the options in a curl request.
Our method is POST, because we are sending data which will create some other data that is then returned to us. Our content-type (the header) is application/json, which is fancy way of saying that we'll get a JSON file back (more details below). And our body is a file, which we've called JSON-Payload. But what's in JSON_Payload
? Keeping reading ...
We have two pieces of data we need to send to Gemini: our image, and a prompt saying how we want Gemini to deal with the image. To contain this information in one message, we will use JSON, which is a way to store and send dictionaries.
What is a dictionary? It is a way of associating keys with values: you put a key in there and it returns an associated value. At its core, it would look something like
dict = {"prompt": <our-prompt>, "image": <our-image written as text>}.
If you put in dict["prompt"]
, you'll get back whatever our prompt is.
However, in the real world, our data there is a little more we have to specify, so it ends up looking like
{ "contents": [{ "parts": [{ "text": <Our Prompt> }, { "inline_data": { "mime-type": "image/jpeg", "data": <Base 64 Encoded> } }] }] }
The exact breakdown doesn't really matter, but two things are worth noticing. First, the mine-type is just what type of data is being sent (in this case, a jpeg image). Second, notice the <Base 64 encoded>
.
Base 64:
When passing things over the web, it can be useful to pass everything as plain text, rather than in a more complicated scheme. As best as I can tell from the Wikipedia page, Base64 is one way to turn binary data (like an image) into text that can be more consistently parsed, at least historically.11 Shortcuts does this for us:12
Congrats! We've now made a valid request to the Gemini API! What now?
The API will return something that looks like:13
{"usageMetadata": { "promptTokenCount": 266,"totalTokenCount": 273,"candidatesTokenCount": 7 }, "modelVersion": "gemini-2.0-flash-exp", "candidates": [{"avgLogprobs": -0.05672707302229451, "content": {"parts": [{ "text": "You are holding up one finger."}], "role": "model"}, "finishReason": "STOP", "safetyRatings": [{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE"}, {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE"}] } ] }
Our job is to extract the little value next to "text"
which says "Your are holding up one finger."
Navigating Nested Dictionaries
If we squint closely, we see that the path to text goes like this:
- We get the value of candidates, which is a list
- We take the first element of the candidates list, which is a dictionary
- We get the value of “content”, which is a dictionary
- We get the value of “parts” from the value of content, which is a list
- We get the first element of that list, which is a dictionary
- We get the value of "text"
Phew ... that's a lot. But a lot of languages let us stack all these steps together into one line (a little bit like specifying a full file path). In python, we could just write
response["candidates"][0]["content"]["parts"][0]["text"]
where a["b"]
means getting the value corresponding to the key "b", and c[0]
means getting the first element of the list c
. It’s long, but honestly pretty easy to read.
In shortcuts, nothing is that nice. We have to write everything out. One. By. One.
At each step in this chain we get value from dictionary or get first item from list, and then set the result to a variable name. It looks like this:
And that's not even the entire thing ... I'll spare you the rest.14
It’s not that hard, but it can be frustrating if you either use quotes around your key, or forget to set the variable name.15
After finishing out all the getting values from dictionaries and lists, we have our text! Are we done?
Not quite! We still need to get the transcribed text off the phone. Thankfully this is mercifully easy.
There is a "Send email" action in shortcuts, with some fairly straightforward options:
You need to have the iOS mail app set up and use your email address there, but that's fairly straightforward to set up. I would also input your destination email and disable the "Show Compose Sheet" option so that the transcript will automatically send, but again, that's optional.16
And there you have it! You have a functioning shortcut which takes a picture, transcribes it, and then sends you an email with the transcribed text.
It's a lot to describe, but this is all something I did in about 3 hours in the evening after work one day.17 I genuinely use it every day, and I learned a ton about APIs and requests through doing this.
What's next? I'd like to modify the shortcut to take multiple images in one go, so that I don't have an essay like this split over 8 separate emails! But that's a project for another time ...
Setting up the shortcut
If you're interested in using this shortcut yourself, follow these instructions:18
- Download the API Keys Template Shortcut and the transcription shortcut on an iOS device.
- Just click the above links on your desired device
- Go to this website to obtain a Gemini API Key. (You can do this on a computer, even if it isn't Apple.)
- After you make the API key, copy it into a safe place somewhere, where you can access it, but other people can't see it.
- Important: Make sure that the API key you create is on the paid tier. It is literally less than a tenth of a cent per photo, and it prevents google from training on your data. If you put anything sensitive in, you'll want the paid API key.
- To make sure the API key is on the paid tier, click the button that says "billing" under the plan header and edit your credit/debit card information.
- Edit the API Keys Template Shortcut
- If necessary, navigate to the shortcuts app
- Press the three dots to edit the shortcut (may be different on MacOS)
- Enter your API key where it says to do so.
- Edit the Gemini Transcription Public shortcut
- Press the three dots to call the Template Shortcut
- Click on the first cell, and choose "API Keys Template Shortcut" as the shortcut to call
- Click the arrow on the last cell, and toggle "Show Compose Sheet." It will prompt you to include your email address, do so.
- Make sure you are using the iOS mail app (may be different on MacOS)
- Test the shortcut by pressing the little triangle at the very bottom right of your screen.
- When the shortcut prompts you, press "always allow"
With any luck, it should be ready to go!
I like having the shortcut as a widget on my home screen. To do so yourself, follow these instructions from Apple.
The prompt I've included in the public version is
Please transcribe this handwritten text, making sure to transcribe each word. If there is a reasonable English word that could have been meant, try to write that. Ignore all crossed out words. Do not add new line markers for new lines, but instead represent them as spaces. Do separate paragraphs.
This is the result of some trial and error, but I am still experimenting with better prompts. It may be worthwhile to tweak or change the prompt if you are not getting good performance. There are good resources online for writing good prompts.
Thank you to Claire Pettit for her comments and providing valuable feedback on the installation process. Thank you also to Atty Yang, Zachary Kelly, and Julia Evans for reading draft posts and providing feedback!