Behind the scene, on making the 'competition notes from IPI series' – LLM Struggles, Snakes & Sanity Checks
This post reports the difficulties I faced while making this series of write-ups.
- Deciding what to write for each post (planning phase)
Since there are 41 behaviors (I didn't do all of them, but still quite a lot). I don't think that I should put everything on one post.
Therefore, I started my write-up by categorizing the behaviors in this IPI competition. It was truly much harder than I first imagine. I was afraid that I might miss something and not put all the behaviors that belong in the same group in one place.
I tried to use LLM to assist with that, but they did quite horribly. For example, GPT-5.2 showed me groupings where the ‘challenge ID’ (which I assigned for my own reference) and ‘the verbatim description’ were hallucinated.
I tried to create (1) extraction files that extract key ideas from each behavior, (2) validation files that check whether the LLM did it correctly, and (3) prompts to let the LLM handle grouping and taxonomy.
The problem with this method: it took so long, especially for one person to do.
Moreover, some nuances were missing in the extraction files. What the LLM judged as “necessary” and “sufficient” was not exactly the same as what I had in mind.
However, with 41 behaviors it was quite confusing and cognitively taxing for me, and I knew my own limits. I might actually make some mistakes, but I would do my best, and if I found mistakes later, I would fix them.
Claude Sonnet 4.5 and Grok 4 did a bit better. They could provide more verbatim output. However, they failed to put everything correctly in groups. For example, they were very confident when I asked, “Are you sure all the email-related ones are there?” They replied with great confidence. But when I checked again, I found that there were more.
So, in the end, I manually grouped all the behaviors myself. Later, I found that I had missed one behavior and hadn’t put it in the coding-agent group. I did ask the LLM to help me check, but they were so nice that they said my version was fine and complete. I didn’t fully believe them, so I tried to check myself, but I slipped too.
- Making posts
Then came the writing part… I tried to make the LLM fill my template for me (from the ‘location’ field to the ‘criteria’ field). Again, the problem was “verbatim”, it goes against their nature. The results were too sloppy for me, so I ended up checking their work and doing it all by myself again.
It’s not that I wrote every word myself. LLMs are very good at generating 5 variants of “this paragraph,” especially when I want “impact,” because my writing experience is more in factual reporting and not very dramatic. I could save time by choosing the word choices I liked. If I had to look up synonyms, that would take longer.
- Checking for mistakes
I wanted someone to help me check, but I didn’t have anyone, so I asked the LLM to check for me, to cross-check against the raw files and the structure. They said everything looked fine, but luckily I later found that I hadn’t written some “normal audience openings” for a few behaviors. One reason was that what Claude saw was truncated. I needed to nudge it to check the full file. I know that feeding it slowly to check each page is much more reliable, but if I’m only doing a small amount each time, I can just skim and verify by myself too. Even with that, and 30% of my weekly limit consumed, it produced a verification file showing all “passed” results even though they weren’t correct.
Therefore, I asked ChatGPT (it’s much cheaper) to do it bit by bit on Canvas to verify whether I had made any mistakes. It could not do it well. It hallucinated the criteria. The verification results became incorrect. I couldn’t order it to 'verify the first one on my list' either, I have to say the behavior description verbatim. And even with that, it still did it wrong. I really felt that I was working twice: (1) I manually checked myself, and (2) I checked whether the AI could do what I was doing.
On the bright side, now I know that GPT-5.2 still can’t do what I can yet… but for me, I think it’s quite labor-intensive work that humans should be able to delegate to AI.
I let Grok verify too, it pointed out a few mismatches, which I considered useful.
- Analysis of the feedback
I posted them slowly, one by one, it’s because I hadn’t really finished the next one yet. For example, when I posted the first one, I already had a full draft of the second one. The strategy was that I would make changes to the draft based on the feedback on post 1.
The posts didn’t do well on my personal Facebook at all — I believe it’s because they were too anti-Facebook. They were too technical and had not enough emotion.
However, I tried to explain better in post 2, in case someone found it useful.
Reading the feedback is quite hard here because I actually have very little feedback. Unlike when I taught classes, where I could see students’ expressions immediately and change how I explained right away.
I had some friends DM me on Discord though, and I felt thankful.
In post 3, I tried to make it easier to understand for a more general audience than in post 2.
- Translation
I always aim for both Thai and international readers. I know that if I only made an English version, almost no one here would read it. Therefore, after I finish any post, I translate it using an LLM.
The translation part was a pain too. No LLM can translate 100% correctly. LLMs are horrible at Thai. They can’t translate into a good Thai essay that’s on par with one from an Arts graduate. Their Thai is much worse than my English. I can't stand their Thai. That's why I always talk to them in English. That’s the reason I mostly wrote in Thai and then translated to English.
The problem with the translation was that the full text was too long. The first post on the email attack seemed OK to me, with just a few drifts. However, for post 2, Claude saw truncated text and still acted as if it saw everything, making up a very plausible English version for me.
And I even asked Claude to verify whether (1) the structure of the translation was correct and (2) all the meaning was correct. Claude assured me it was a faithful translation. However, when I checked, I found that it wasn’t a faithful translation at all.
The tradeoff was that this time the translation looked stiff, like a word-for-word translation. But it’s much cheaper than hiring professional translators. It’s a known fact that LLMs can’t replace professional translators yet. Moreover, the tone was more formal than in my Thai version. But I’m not such a good translator that I can capture all the nuances either. (The academic writing I learned in school is more formal, so it’s harder to capture the right nuance.)
I felt so tired from checking their translation. One reason is that Thai lacks explicit tense. The LLM needs to infer from the words what tense I meant, and it can’t do it with 100% success rate. Sometimes I doubt myself: why did I do quite well in the competition while I can’t even make my “friendly AI” do what I want?
For the third post, I still used Claude, this time I only do it bit by bit. The problem is that the translation doesn't look very natural, it feels like word-for-word translation.
- Publication-ready markdown
The last part was using the LLM to make markdown for styling the post. I was paranoid about whether they would keep everything else as is and only add readability. For short files, they have no problem, but for long files with 1000+ lines, I don’t really trust them.
I asked ChatGPT, “Can you make a markdown version?” It said yes, as long as I gave it the text little by little. I asked Claude, it said yes, definitely. But Claude took very long to complete, and then it stopped halfway before it finished. I tried twice, the problem persisted, and 20% of my 5-hour session limit was consumed.
I asked Grok, it said yes, but then the markdown it made cut off, saying something like 'and then continue....'
In the end, I used ChatGPT and completed only 1/10 each time. But the results were not what I expected. It cut my paragraphs into pieces “for readability,” but I didn’t like it. So I made some words bold, italic, and bigger-font by myself. The consistency between the Thai and English versions is not that great, but since this is not an academic report, only my blog... I let it be.
- What was running in the background: My busy January
This was my busiest January, ever. I had issues like a persistent snake that kept peeking from my toilet. At first my maid saw it and reported it to be quite big… around 6 feet. But once I got a clear view, I thought it might be much smaller, maybe not even 4 feet. Or maybe there were many different snakes that kept showing up. It kept visiting and meeting many residents of my house every 1–3 days, and it went on for 2 weeks.
I called the emergency service, but they would not come, so I bought snake grabbers and set up a monitor to watch out in case it came out. But it quickly went away as soon as I went into the toilet with a grabber.
I tried changing to a new toilet bowl + snake prevention box. It wasn’t working. So I escalated and had a new drain system installed.
My kids got norovirus too, stayed in the hospital, and then took a week off to recover at home. So my usual quiet mornings turned into full-blown chaos.
That’s why this series took me so long. I stagnated on the Gray Swan Proving Ground during January.
P.S. This original post is written in English. No Thai version.