My dad, Bob Ambrogi, has a very popular blog on legal technology that he’s been writing for over 20 years. The blog is called LawSites, you can find it here: lawnext.com. He uses AI a bit in his workflows but not as a tool to actually do writing on his behalf. He and I have discussed how a major issue with most AI writing tools is that they fail to mimic your style. They produce writing, but it doesn't sound like you. The content on my dad’s blog is pretty matter of fact journalism, with a well defined style that is quite different than the hyperbolic fluff LLMs tend to produce.
Thinking about this, I inspired to see if I could fine-tune a language model to produce writing that sounded exactly like him. And so I did just that. I primary wanted to test whether fine-tuning enabled the model to more accurately adopt my dad’s style than a simple prompting based approach. The best way to get an LLM to adopt a certain personality or writing style, is unsolved both for builders and consumers. My hope is that this narrow experiment might shed some light on the broader problem.
In the end I trained five variations of gpt-4o, put together a competing approach using Claude with in-context examples, and wrapped this all up in a web app. The results were very interesting and I thought it would be worth while to share.
I came away from this experiment with a few conclusions
In this post I’ll touch on: (1) how I setup the data for fine-tuning, (2) the training and experimentation process, (3) the results of the best model compared to prompting an existing model, and (4) some conclusions as to whether I would suggest fine-tuning over prompting for this type of task. I will throw out some technical terms related to fine-tuning which will assume knowledge of neural networks but if you are only interested in the high level the post should still be accessible and might help you learn a bit about the training process.
Before we dive into the details, here’s an example of the output from one of my models. I took a recent post LawSites that was not included in the training data and ran it through a prompt to reverse engineer notes that might have feasibly led to the post. Here’s the first few paragraphs of the output produced by running those notes through my tool:
Bloomberg Law today released for general availability two generative AI tools that it has been testing and developing for the legal research platform over the past year through its Innovation Studio. Both tools are now available in beta to all Bloomberg Law subscribers at no additional cost.
The first of these tools, Bloomberg Law Answers, is a research tool that provides brief, cited responses to users' natural language questions. The second, AI Assistant, is a document-specific tool that can be used to generate summaries and answers about a specific document the user is viewing.
"We took our time to bring these products to market," Bobby Puglia, chief product officer at Bloomberg Industry Group, told me during an interview yesterday. "We were very deliberate and patient in bringing these things to beta until we felt like we'd had extensive user testing and that we've got it to a place where we could confidently say that we could trust the answers."
And here’s the first few paragraphs of the actual post:
Bloomberg Law has launched two new generative AI-powered research tools – Bloomberg Law Answers and Bloomberg Law AI Assistant – marking the company’s most significant foray into providing its customers with AI-enhanced legal research capabilities.*
While Bloomberg Law had previously rolled out targeted gen AI applications, such as AI-powered compliant summaries on dockets, these new tools are its first to deploy gen AI broadly within its platform for legal research and document analysis.
The new features, released in beta after a year of testing through Bloomberg Law’s Innovation Studio, aim to streamline legal research workflows by providing direct answers to legal questions and enabling document-specific querying. Both tools are available to current Bloomberg Law subscribers at no additional cost.
“We took our time to bring these products to market,” Bobby Puglia, chief product officer at Bloomberg Industry Group, told me in an interview. “We were very deliberate and patient in bringing these things to beta until we felt like we’d had extensive user testing and that we’ve got it to a place where we could confidently say that we could trust the answers.”
Would you be able to tell which of these was AI generated?
Fine-tuning data setup
In order to fine tune an LLM like gpt-4o, you need to provide it with a dataset of input output pairs that represent your desired behavior. In this case, the output I wanted to train the model on was complete blog posts written by my dad. I gathered these by scraping his blog. The ideal input for this training is more subjective. My dad is a journalist and is often taking notes on products or interviews and later turning those into complete blog posts, so my intuition was that a good input would be a page of notes or an outline of a post. Then when using the model, he could feed in notes and get a draft blog post in return. Another benefit of notes as input is that its also easy to condense other potential input formats, like an interview transcript, into a page of notes using an LLM.
I didn’t have access to the notes he might have used while writing all of these blog posts, so I used Claude to synthetically generate them. With a bit of prompt engineering I was able to get a setup which would take in an existing blog post and generate notes that feasibly could have lead to it. By looping the posts through this setup I was able to create a dataset of notes → final blog post pairs.
Creating a high quality dataset is probably the most important part of this whole process. I wanted to time box this experiment, but given more time I would have also made sure to manually review all the pairs to make sure they were representative of what I wanted the model to learn. But having a starting point, I jumped into model training.
The Training Process and Results
When fine-tuning gpt-4o you are able to tweak four hyper-parameters:
It’s usually best practice to start with a simple setup to see if there’s any evidence the model can learn the task. I started with a modest data set size and allowed OpenAI to automatically assign the other hyper-parameters. In my case this lead to the following for my first run:
blog posts: 150
tokens: 467k
epochs: 3
batch size: 1
learning rate multiplier: 2
This setup took about half an hour to train and cost on the order of $10. I was immediately able to tell that the model had picked up some of my Dad’s style. When I tested the model, it produced blog posts that sounded significantly more like my dad than the base model. It picked up on some of the structure and style in his writing without a doubt. However, when looking at the loss curve, there was little evidence of learning.
Typically what we’d like to see here is the validation loss having a clear downward trend. That was not the case. Additionally, the validation loss being so much higher than the training loss suggests overfitting - meaning the model learned to memorize the examples it was given but did not perform as well on withheld data. However loss does not perfectly translate to real world performance and given that I could clearly see the model had learned something, I continued to try a variety of other approaches.
In total I spent over $100 trying four more variations in the hyper-parameters. I tweaked all of the variables mentioned above and also explored holding out different amounts of the test set for validation. Here are the details for each run:
Run 2: blog posts: 300, tokens: 1.15M, epochs: 3, batch size: 1, lr multiplier: 2
Run 3: blog posts: 300, tokens: 1.9M, epochs: 5, batch size: 1, lr multiplier: 1.5
Run 4: blog posts: 300, tokens: 1.15M, epochs: 3, batch size: 2, lr multiplier: 1.5
Run 5: blog posts: 300, tokens: 2.6M, epochs: 7, batch size: 2, lr multiplier: 3.5
It was not until the final run that I saw very clear evidence of the model learning.
Looking at the training loss, we see a definitive downward trend that was missing in all of my previous runs. However we also see very clear evidence of over fitting given that validation loss begins to rise dramatically as training loss continues to drop.
Traditionally this would be a serious red flag. Overfitting usually means the model is memorizing training examples at the expense of generalizing well on new inputs. But there is something strange about language model fine tuning, especially for style, where overfitting doesn’t seem to be such a big problem. I’m not the only one to come to this conclusion.
From first principles there’s a few reasons this configuration might have worked well. It seems that a more aggressive learning rate helped the model explore the solution space and escape a performance plateau. The batch size of 2 smooths the learning curve out a little bit while still allowing the model to update based on nuances somewhat frequently, which we need with a smaller training set.
Part of my inspiration for this project was a very similar experiment by Linus Lee. And after the fact I actually discovered this was exactly in line with what had worked for him.
This model undoubtedly produces the outputs that sounded most like my Dad’s blog. Some of the posts it produced in testing sounded almost indistinguishable from him. If was to train this one more time I would keep the high learning rate and stop training earlier to mitigate the overfitting.
Prompting based approach
The question remained: are these outputs noticeably better than what one could get from a well crafted prompt? In my, and my Dad’s, experience, Claude produces way better writing than any of the OpenAI models. Probably for the same reasons that make it widely loved as a ‘daily driver’ llm. It just sounds more human. So it seemed possible that even without this concrete learning, Claude might produce outputs that felt better than our fine-tuned model.
I spent some time crafting a prompt which contained both detailed instructions and a handful of examples of desired input / output pairs. At this point I had created a web app for testing these models and added this Claude / prompt based version to the model selector.
Results
I sent this tool to my Dad to have him test it. To me it seemed quite clear that of all the models the final fine-tune produced writing that sounded most like him.
But his feedback was interesting. His clear favorite was not the final version of the fine-tuned models, but the Claude based version. He said: “I’m not sure if it sounds more like me of if I just like how it sounds more. It consistently sounds more like a professionally written news story.”
Theres lots you could take away from this but I think the most important is that you need to stay very close to your customers when evaluating what a ‘good’ output from a LLM based feature is. In this case simply mimicking his style was not the only criteria for a preferred output. The non-fine-tuned model was certainly less likely to phrase things how he would, use the exact words he would, or even structure the blog post exactly how he would. Yet, something about it was preferable as a tool. Perhaps it turns out that getting high quality writing that you can quickly adjust to match your style is actually more useful than getting lower quality writing that more closely mimics the structure of your own in some way.
It's also noteworthy that my Dad has enough writing online that LLMs already know who he is and may to some extent already understand his writing style. This could reasonably lead to prompting alone to work better for him than it would for others.
Take aways: In Context Learning vs Fine Tuning for style.
Fine-tuning requires a lot of resources: data availability, time, and money. The optimal hyper-parameters are not clear and may even be totally dependent on the type of content you are trying to mimic.
On the other hand, prompting is very accessible. Even with just three examples in context, the Claude output was fairly good. From existing research, there’s reason to believe that the output would only get better with more examples. It’s also much easier to iterate upon and tweak specific instructions - i.e. ‘don’t use this word’, ‘make sure to use inverted pyramid style unless the post is of this type…’ etc.
It’s impossible to definitively say which method works better for style adoption. I certainly could have squeezed better results out of both approaches. But I think the take away for builders is clear. If you want to enable users to get writing that sounds like themselves, it probably makes sense to start with a prompting based approach. This is also good news for those who are not building products, but just like to use LLMs. You might not get a perfect replication of your tone, but you can get a pretty useful tool just by showing the model some examples of your style and thinking carefully about the instructions that accompany them. This is also promosing for researchers interested in post-training models for style and personality on a broader scale. If I was able to find some success with a modest amount of training data, we should only expect better results at scale. The biggest obstacle, as we often see in fine-tuning, is curating the right data.