May 30, 2026 / 12 min read

I built my portfolio and blog with Codex. Here is what I learned.

A practical case study about building my portfolio and blog with Codex, using TDD-like workflows, CI/CD, e2e tests, security checks, human-in-the-loop review and lessons learned from real deployment issues.

AICI/CDCodexDevOpsGCPSecuritySoftware EngineeringTesting

I built my portfolio and blog with Codex. Here is what I learned.

Article

Recently, I built my personal portfolio and blog application. I used Codex as an AI coding agent because I wanted to see how much it could realistically speed up the development process. I did not want this project to be just a quick AI-generated static page. I wanted to treat it like a real software project: with tests, CI/CD, staging and production environments, infrastructure decisions, security review and enough discipline to actually ship it with confidence. In my case, the project was a web/API application deployed as Dockerized services, with GitHub Actions for CI/CD, Caddy/proxy configuration, GCP-related infrastructure work, and both local and live E2E checks. In practice, I used a TDD-like workflow where tests and checks were not an afterthought. Sometimes Codex helped write the tests. Sometimes I adjusted them myself. But generated code was never accepted only because it looked correct. It had to pass checks, match the expected behavior and survive review. The experience gave me a much clearer view of where AI coding agents are already very useful — and where they still need strong human supervision.

What worked really well

The biggest productivity boost came from frontend development. Codex did a very good job designing and implementing the frontend. It created a good-looking, interactive UI much faster than I would have done manually. This was especially useful for me because frontend design is usually the part where I would either spend a lot of time iterating or need help from someone more design-oriented. Of course, it was not perfect. At one point it failed to properly link CSS styles and the layout was broken. This was not a deep architectural failure, but it was a good reminder that even simple generated UI code still needs to be run and inspected visually. The fix itself was easy, but the broken layout would have been obvious to any user. Still, compared to the amount of frontend work Codex delivered, this was a small issue. I moved from idea to a functional, responsive interface much faster than I expected. I was also satisfied with the infrastructure direction. The application itself is not huge, so I did not need anything overengineered or expensive. The final setup was closer to a pragmatic low-cost deployment than a shiny cloud architecture: Dockerized services, CI/CD automation, deployment checks and infrastructure choices that were good enough for a production portfolio project without creating unnecessary cloud costs. That mattered to me. A portfolio project should prove engineering judgment, not burn money just to look “enterprise”. The backend part also went relatively well, but it required much more supervision. And this is where the most interesting lessons started to appear.

Where Codex struggled

Codex struggled the most in areas where it could not fully test the result by itself. The best examples were environment-specific behavior: staging vs production, CI/CD variables, backend configuration, proxy behavior and browser security mechanisms like CORS, CSRF and secure cookies. One concrete issue was related to CSRF and environment-specific API URLs. Some flows worked locally or in staging-like conditions, but failed when tested closer to the real deployment surface. Cloud Shell preview origins caused `csrf_invalid` for admin mutations, and one production-like flow exposed an incorrect `/api/api/...` URL composition. This was not a simple “Codex wrote bad code” problem. The tests were too local and too mocked for behavior that actually depended on origins, cookies, proxies and deployment topology. Codex could work well when the feedback loop was short: change code, run test, see failure, fix. But if the bug lived at the boundary between browser, backend, proxy and environment configuration, local confidence was not enough. The fix was not to throw away local tests. They are still valuable. The fix was to add the right kind of verification at the right boundary: live staging checks for admin mutations, preview-origin behavior, URL composition and production-like API prefix handling. That became one of the most important testing rules from this project: > Local mocks are useful, but for environment-sensitive behavior they are not proof. Another backend/runtime example was configuration reload. In one case, the frontend flow was correct and the browser sent the expected token, but the deployed API still failed verification because the running container had stale runtime environment and had not picked up the updated secret/configuration. The mistake was treating an `.env` or deploy-script change as enough, without verifying that the process consuming it had actually restarted or reloaded. That is the kind of bug an agent can easily miss if it focuses only on code diff correctness. The code can be fine, the environment can be wrong, and the feature can still fail.

A green deploy can still hide a broken system

Another good example came from CI/CD. At one point, production served a new web build that expected an API route to exist, but the production API service was still running an older image. A later web-only deploy succeeded, so the workflow looked green, but the system was internally inconsistent. This kind of problem is easy to miss if the deployment pipeline is optimized around changed files only. The deploy technically succeeded, but only for the selected service. It did not prove that the whole application was current. The fix was to stop treating “production deploy succeeded” as a global state. In a partial deploy system, each deployable service has to be treated as its own stateful unit. The workflow needs to track or infer which service revisions are expected, detect stale services, and force a redeploy when a previous API deploy failed, was skipped, or was masked by a later unrelated web deploy. In other words: > A green partial deploy is not proof that the whole application is healthy. This was one of the most valuable lessons from the whole project, because it is not really an AI-specific problem. It is a software delivery problem. The AI agent just made it easier to move fast enough to hit this kind of edge case.

The agent needed structure

During the project, I noticed a more general limitation. Sometimes the agent started to go beyond the scope of the task. It could add unnecessary things, solve one problem while creating another, or lose context after the context window was compressed. This did not happen all the time, but it happened often enough that I had to adjust the way I worked with it. The solution was not to give the agent more freedom. The solution was to give it more structure. I relied on one main principle — constant human-in-the-loop supervision — and supported it with two practical mechanisms: project guidelines and a `lessons-learned.md` workflow.

Human-in-the-loop was not optional

I did not use the agent in a full autopilot mode. I supervised its work all the time. I accepted good ideas, rejected bad ones, explained what was wrong with some approaches and pointed it in a better direction when it started drifting. This made the whole process much more effective. For me, human-in-the-loop was not just a final review after the code was already written. It was part of the workflow. This is important not only for quality, but also for safety. AI agents can make mistakes, misunderstand instructions or be influenced by unwanted content. Human supervision does not magically solve risks like prompt injection, but it is an important safety layer. Someone still needs to understand what the agent is doing and decide whether a change should be accepted. Codex was not the owner of the project. I was.

The `lessons-learned.md` pattern

One thing that worked surprisingly well was creating a `lessons-learned.md` file. After the agent made a few repetitive mistakes, especially after longer sessions or context compression, I started thinking about how to reduce that. I wanted the agent to keep track of its mistakes and reuse those lessons before touching similar areas again. The file was not just an archive. It became a pre-flight gate. Before making implementation, CI, deployment, infrastructure, auth, admin or public-surface changes, the agent had to read the relevant lessons and state which rules were guiding the work. A simplified example of the kind of lesson I wanted the agent to keep in mind looked like this: ```md

Lesson: Partial deploy success can hide stale service drift

What happened: - The web service was deployed successfully. - The API service was still running an older image. - The final workflow looked green, but the application was internally inconsistent. Why it happened: - The deploy pipeline selected services based only on the latest changed files. - A later web-only success masked an earlier failed or skipped API deploy. Working rule: - Treat each deployable service as its own stateful unit. - Track or infer expected service revisions. - Force redeploy of stale services before reporting the release healthy. ``` This was simple, but useful. It did not make the agent perfect. But I noticed fewer repeated mistakes and better alignment with the project-specific rules over time. More importantly, it forced the agent to treat previous failures as part of the current context. For me, this became one of the most practical patterns for working with coding agents: do not rely only on the current prompt. Give the agent a place where project-specific lessons can accumulate.

Project guidelines matter

Another thing that helped was defining clear project guidelines for the agent. Modern coding agents allow us to define instructions in dedicated files or directories that the agent can read and follow. I used that to describe the expected workflow, project rules and quality expectations. For example, the agent had to follow rules like: * verify current state before acting: branch, latest `main`, commit under test and target environment; * identify the boundary first when behavior crosses browser/server, build/runtime, CI/VM, proxy/app or GitHub/GCP/Cloudflare; * avoid treating local mocks as proof for environment-sensitive behavior; * validate the full lifecycle for cross-system features: provision, configure, deploy, verify, operate and roll back; * inspect the full execution plan, not only the part it intended to change; * never commit real or reused operational secrets as test fixtures. This was not about hyper-optimized prompting. It was standard engineering onboarding. If I were giving the project to a new developer, I would also give them context, rules, previous incidents and constraints. With an AI agent, that discipline matters even more, because vague context forces it to guess. And an AI agent guessing backend security defaults, deployment behavior or infrastructure state is a recipe for a bad deployment.

The literal cost of guardrails

However, implementing this level of structure is not free. It introduces a very practical bottleneck: token consumption. On the Codex plan and environment I was using at the time, the available usage was enough for roughly 1–1.5 hours of continuous intensive development in a 5-hour window. For this project it was not a disaster because I did not have an external deadline. But it was still frustrating. There were moments when I had momentum, knew what I wanted to do next, and then suddenly had to stop. This connected directly with the workflow above. The more context I gave the agent, the better it worked. But that context also consumed more tokens. More verification meant more iterations. More careful agent work meant more usage. That made the dependency very visible. If your development workflow depends entirely on an AI agent, then your productivity is also affected by LLM uptime, rate limits, pricing, model quality and context limits. This does not mean we should avoid AI tools. I clearly benefited from them. But it does mean we should not pretend they are free, unlimited or always available. That is why I believe important workflows should have fallback paths whenever possible. Sometimes the fallback is a deterministic process. Sometimes it is a manual approval flow. Sometimes it is a rule-based system. And sometimes, in the context of software development, the fallback is simply a developer who still understands the code and can continue working without the agent.

Vibe coding is fine. Vibe shipping is not.

After this project, I do not think vibe coding is a bad thing. Actually, I think it can be very useful. It helps with exploration, prototyping, learning unfamiliar technologies and moving faster in areas where we are not experts. In my case, it was especially helpful with frontend development and design. Without Codex, this project would have taken me much longer. But there is a difference between using AI to move faster and shipping code without understanding it. AI can generate code, but it does not take responsibility for the code. If something breaks in production, nobody will blame the model. The responsibility stays with the people and the team that shipped the change. That means generated code should still go through review, tests and security checks. LLM-based review tools can help, but they should not replace human review completely. We still need to understand what we accept, what we deploy and what we will have to debug when the happy path is gone. In my opinion, the more AI-generated code we use, the more important engineering judgment becomes. So my final takeaway is simple: > Vibe coding is fine. Vibe shipping is not. Codex helped me move faster, but engineering discipline made the project shippable. BP

Comments

Discussion

0 approved

No approved comments yet.