What can GPT3 actually do?
Reflections from using GPT to reduce email newsletter overload
Large Language Models (LLMs) like GPT3 made a big splash in 2022 and sparked a lot of interest. I wanted to see what they could do, so I used GPT3 to build a TLDR bot over the holidays to help with email newsletter overload. My solution was to have GPT3 read my newsletters and send me a summary of the top 5 takeaways each day (check out Dec 27th’s TLDR in the screenshot below).
It was a fun project and I used AI in two ways: as a backend service to read and summarize newsletters, and as a coding assistant to write code and brainstorm technical approaches. I’ll share my overall thoughts on using GPT3 and a high-level outline of my technical approach.
Overall takeaway
GPT3 is a good educated guesser, but not yet great at simulating bespoke reasoning. It's like a well-intentioned friend who has read everything ever written, but doesn't like to think too much and just speaks from the gut. Because it’s so well-read, its gut sense is pretty good for short directionally correct answers on public domain topics. But you wouldn’t want to take it at face value if you need a precise answer (especially if it’s a longer one or requires bespoke reasoning with situation-specific context). For example, Chat GPT was very helpful with writing generic few-line code snippets, but didn’t offer much leverage with coding up extended app-specific logic.
I found the following ideas helpful to make the most of GPT3:
Breaking down the overall task into chunks that can be solved by an educated guesser. E.g., for a bootstrapped ranking algorithm, instead of asking it to rate “usefulness” of each newsletter summary, found it more effective to have it rate subcomponents like whether a summary has a clear focus and is informative.
Optimization is the name of the game because small changes to prompts, parameters, system design and fine tuning make big differences to output quality. See OpenAI’s prompt engineering best practices and API docs.
LLMs are good hammers but not everything is a nail. For example, GPT3 was great for quickly summarizing text excerpts but, given its limited ~4k token context window, it was more effective to use good old k-means clustering to identify topics across an unbounded corpus of newsletters.
There are likely opportunities to improve the developer experience of running production LLM-powered apps as well (e.g., model iteration, and testing & system reliability for a probabilistic AI function), but I haven’t delved as deeply here.
All told, LLMs like GPT3 are exciting because they’re a flexible & rapidly improving tool that opens up a large design space to developers. So the best way to know whether they’re useful for a particular application is to try them at it.
Technical outline
I wanted an email every morning with summaries of the top 5 topics from the newsletters I received the previous day.
The script follows these steps to make that happen -
Ingest newsletters via Gmail API. Pull all emails in the “Forum” category (where my newsletters already go) from the past day.
Preprocess newsletters. Clean & slice text into paragraph-length sections to improve topic tagging & summarization granularity.
Extract topics. Apply k-means clustering on embedding representations of sections (generated using OpenAI’s 🔥 new Ada-002 embedding model), and using the silhouette method to determine the optimal number of distinct topics.
Draft TL;DR. Use GPT3 prompts augmented with relevant sections to generate an appropriate title and pithy 2-3 sentence summary for each topic.
Rank TL;DR. Estimate a “usefulness” score for each TL;DR based on zero-shot GPT3 prompts to rate the clarity of focus and usefulness of each.
Send TL;DR via Gmail API.
Automate. Hosted on AWS Lambda using Zappa, runs at 10am ET every day.
It works pretty well, and as a user I’m happy with ~70% of the summaries I have received so far. The next ~30% will likely require setting up a performance benchmark dataset to quantitatively iterate against which I’ll probably try next (maybe using existing summarization datasets to start).


