Three times over the past year (as new generations came out), I tested the usual suspects to see which model writes the best short-to-medium content using my prompts and method.
The tests weren’t lab-grade. No crazy rubric. Just the same prompts, lots of outputs, and a practitioner’s eye for what breaks in real workflows.
What kept surfacing: larger OpenAI models follow instructions best, respect the source, and leave me with the least editing. That matters more than “creative voice.”
Context
My prompts are 2–4 pages. They include:
Example Post to mirror tone, hook, length, and format.
Target Audience so the piece lands where it should.
Source Material that the model must stick to for facts and framing.
Instructions to glue it all together (structure, cadence, sectioning, constraints).
This isn’t zero-shot “go write something.”
It’s a lightly sophisticated prompt with many moving parts and points of failuer. I’m optimizing for instruction following and source faithfulness. It is much easier to add personality and flair to content than it is to restructure it.
Here is the prompt that I most recently tested.
You are an expert content and social media creator.
Write a <Platform> post that matches the GOAL, STYLE, TONE, LENGTH, READABILITY, and PUNCTUATION of the CONTENT_EXAMPLE, using the SOURCE_MATERIAL as the single source of truth, and is maximally relevant to the TARGET_AUDIENCE.
Matching The Content Example
I'm giving you a CONTENT_EXAMPLE that performed very well on <Platform>. Analyze that piece of content and understand why it performed well.
Is it making a polarizing claim and then giving an unpopular opinion? Directly solving TARGET_AUDIENCE's pain point? etc...
First make a central claim that has the same tone and goal as the CONTENT_EXAMPLE, but uses the Source Material is the inspiration for the post.
Mirror cadence, sentence length, heading/bold usage, list style, and spacing of CONTENT_EXAMPLE.
Using the Source Material
The SOURCE_MATERIAL is the inspiration for your new post. It is what your new post is ABOUT.
Although the post should read just like the CONTENT_EXAMPLE, the SOURCE_MATERIAL should guide what's in the post.
You do not need to RIGIDLY COPY the SOURCE_MATERIAL.
You can make opinionated claims that do not appear in the SOURCE_MATERIAL if it helps you match the CONTENT_EXAMPLE better.
But do not state an action, process, results, or metrics as facts if they do not directly appear in the SOURCE_MATERIAL.
Aligning the Post For The Target Audience
Identify how the SOURCE_MATERIAL is relevant to the TARGET_AUDIENCE based on pain point and desired outcomes.
Let that guide how you shape your central claim to be relevant to both the SOURCE_MATERIAL and the TARGET_AUDIENCE.
Continue to make the post about solving the pain point and achieving the desired outcomes of the TARGET_AUDIENCE within the bounds of the content example framework.
Copy Editing Considerations
Follow the principles of Hormozi's Content Method (Hook, Promise, Reward)
Review your work to make sure you have follow this method as well as you can while staying within the instructions given above.
Hook - Needs to grab the readers attention and get them to stop scrolling
Promise - Needs to convince the reader to read the whole post
Reward - Needs to fulfill the stated promise.
Prioritize the Reddit Snippet
The Title and the first 6-7 lines are what a reader can see before deciding to click the post. So those parts need to be compelling enough to get reader to click on the post.
Inputs
CONTENT_EXAMPLE = """<paste example post>"""
TARGET_AUDIENCE = """<your audience snippet>"""
SOURCE_MATERIAL = """<your source>"""
Silent Consistency QA (do not print)
Before returning the post, silently verify and revise if needed:
Give yourself a score out of 10 on how well you followed each instruction. If below a 7 on anything, make the necessary changes to get above a 7.
Return only the final post.1. Gemini Seems To Get It… and Disagree
Gemini understands the brief, but tends to do it’s own thing anyway.
It recognizes the ask but wanders on structure and formatting.
It tends to default to a safe, moderately formal tone.
It uses the source material… but not the content example constraint as tightly as needed.
I would frankly rather it hold tightly to the content example, even if the wording sounds weird because it’s not quite a fit.
Otherwise, I have to spend my editing time re-imposing my structure into the post.
Verdict: Good comprehension, weak obedience. If you value predictable formatting and example-mirroring, you’ll be herding cats.
2. Claude Has an Attitude
Claude’s voice is great. Its judgment can overrule your brief.
Most readable, “human” tone.
But when my instructions collided with its preferences, it adjusted the deliverable on its own.
It respected the spirit, not the letter, of the brief.
Really it’s much like Gemini, but with more fun and casual writing.
Verdict: Best raw voice, not the best follower of specific detail inside your prompt.
Note: Claude still my favorite for coding.
3. Open Source: I Wanted It to Win
I use a lot of tokens for content (21M in the last 14 days). I wanted open source to work. Would save me $100+ per month
I tested locally with the biggest I could practically run (e.g., Llama 3.1 70B). Two problems:
Throughput: Too slow to use. Reasoning models are generally better so I could have used a deep-seek model. But with thinking it would just be incredibly slow.
Capacity: Even at 70B, it struggled to juggle all the instructions + example + source content.
For content that must follow a complex prompt and not hallucinate beyond your source, you need more capability.
For internal tasks that other people don’t see, I could see myself using open source.
5) OpenAI (and Why the Biggest Model Wins)
Across versions, one pattern was stable: the bigger the OpenAI model, the better the content quality for instruction-heavy prompts.
This is probably obvious, but the bigger the model the better it performed. And for content you intend to publish, that really matters.
So I’m not recommending openAI models, I’m specifically recommending GPT-5 with the additional parameter of “high” thinking effort, which basically increases the amount of thinking tokens used for each prompt.
Why it’s better:
Instruction adherence: It mirrored the content example I gave it ruthlessly. Even if it meant the posts didn’t exactly make sense. That might sound like a weird thing to want, but it’s way easier for me to edit those posts.
Source faithfulness: It successfully derived all substance of the post from my original source content. Didn’t make up anything.
Editability: Even when the style is a touch “plain,” the structure is correct. So improving is fast.
Is it worth saving money on the smaller models? No. It would cost me more time spent editing. If OpenAI had a model that was twice the size and twice the cost of GPT-5 I would probably use that one instead.
The End
For now, I’m sticking with OpenAI’s biggest model for content, specifically GPT-5 with high-effort thinking. Because it follows instructions and doesn’t make up new stuff.
Voice is easy to add. Structure is expensive to fix.



