AI Image Generation

The Client is a manufacturer of roof windows and skylights, focused on bringing daylight and fresh air into indoor spaces. Their brand is built on the transformative quality of natural light — how it shapes rooms, moods, and everyday life.

We were asked to explore the current state of AI image generation and its potential for the client's visual content. The goal was to understand what's possible, what's not, and where the technology could be heading.

AI-generated room with roof-window light

Two Directions

We defined two creative directions to explore, both centered on the presence of natural daylight without depicting the product directly.

Direction 1
Abstract

Evoking the presence of daylight through close-up textures, materials, and atmospheric details — without showing a window or room. Think light cones falling on linen, ceramic, fruit.

Direction 2
Indicative

Full room scenes where the light source is implied but never shown. Hinting at the product through distinctive angled light patterns and shadows cast from above.

Abstract output example Indicative output example

Starting Point

I started with a basic Flux Dev workflow in ComfyUI — text prompt to image generation. The initial outputs had reasonable composition but didn't feel right. The textures were plastic. The lighting was generic. Nothing matched the editorial, photographic quality of the client's existing imagery.

More importantly, the light patterns were wrong. Every image showed light from a vertical window — sharp horizontal beams across a wall. But roof windows cast fundamentally different light: angled rectangular patches on the floor, coming from above. This distinction is central to the brand, and the AI couldn't produce it.

Vertical window light — common in AI training data Roof window light — rare in training data

3D simulations comparing vertical vs. roof window light cone geometry.

Prompt Architecture

Before resorting to model training, I spent significant time on prompt engineering — testing over 500 variations. I developed a structured prompt format with five layers: subject, camera position, light source, LoRA activators, and style prompts. Each layer controlled a different aspect of the output.

The style prompts referenced specific camera systems (Hasselblad X2D 100C), photographic qualities (ultra-high resolution, photorealistic textures), and editorial approaches (soft but directional lighting, editorial food photography). This pushed the aesthetic quality significantly, but couldn't solve the light cone geometry problem.

No amount of prompt engineering could make the model generate light from a roof window. The concept simply didn't exist well enough in the training data.

Training Custom LoRAs

When prompting hit its ceiling, I trained custom LoRA adapters using the client's existing image library from their DAM. The process involved multiple iterations — varying dataset size, training epochs, LoRA rank, and learning rate to find the right balance between style adherence and flexibility.

Dataset48 curated images from client DAM
Training4 repeats × 7 epochs (1,344 steps)
LoRA Rank4
Learning Rate1e-4
Data Size1024px
Training Time~3 hours
Trigger WordCase-sensitive brand style activator

I also experimented with CLIP model selection. Switching from Clip L to Long Clip Vit-L made a noticeable difference — fewer artifacts, cleaner compositions, and better adherence to complex multi-layered prompts.

Results — Abstract

The LoRA transformed what the model could produce. Abstract outputs now matched the client's DAM imagery — photorealistic textures, correct directional daylight, and an editorial quality that felt intentionally photographed rather than generated.

DAM reference image AI-generated output

Left: reference from the client's existing asset library. Right: AI-generated with custom LoRA + structured prompts.

Abstract result 1 Abstract result 2 Abstract result 3

Results — Indicative

The first LoRA iteration still struggled with the indicative direction — many outputs showed roof windows in the frame, which wasn't desired. The model needed to imply the light source without depicting it.

After refining the training dataset and adjusting the approach, the second LoRA iteration cracked it. The model could generate convincing attic and loft spaces — bedrooms, bathrooms, kitchens, home offices, dining rooms — all with characteristic roof-window light patterns, without showing the product.

Window visible — not desired Window implied — correct approach

Left: early LoRA showing window in frame (not desired). Right: refined LoRA with correct light but no visible product.

Dining room Bathroom Home office

Inpainting & Refinement

Even fine-tuned outputs contain artifacts — garbled text on book spines, impossible object geometry, inconsistent shadows. Single-shot generation isn't production-ready. I built two additional ComfyUI workflows to close the gap.

Workflow 1
Object Inpainting

Mask and regenerate specific objects — fix broken geometry, replace garbled text, add or remove elements while preserving composition and lighting.

Workflow 2
Detail Pass

A second-pass workflow using differential diffusion to upscale and refine textures, sharpen edges, and add photographic detail without altering the base composition.

Before inpainting After inpainting

The Generator Tool

Beyond the R&D outputs, I built a web-based image generation tool for the client's team. The interface abstracts away the complexity of ComfyUI and prompt engineering, enabling creatives to generate brand-consistent imagery directly.

The tool supports three modes matching the creative categories (Abstract, Indicative, Edit), includes an AI-enhanced prompt toggle, and gives power users control over aspect ratio, composition locking, and CFG scale. A beta launched with a select group within the client organization.

Screenshot of the image generator tool interface

What I Learned

Prompt engineering has a ceiling. When a concept doesn't exist well in training data, no amount of prompt craft will overcome it. Fine-tuning is the answer — and surprisingly accessible. A rank-4 LoRA trained on just 48 images for three hours fundamentally changed what the model could produce.

CLIP model selection matters more than expected. Switching encoders reduced artifacts and improved complex prompt adherence with no additional training. And the real production workflow isn't generation — it's generation plus inpainting plus a detail pass. The multi-step pipeline is what makes the output actually usable.

The LoRA didn't just learn a style — it learned the physics of roof-window light.