FaceForge Generator: Vision Transformer-based Face Manipulation

🎨 252M Parameters | ViT-Based | Baseline Training Complete

⚠️ RESEARCH USE ONLY - This model is for academic research and developing detection systems.

Model Description

FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.

Key Features:

🏗️ 252 million trainable parameters
🔄 Dual encoder architecture for source and target faces
🎯 Cross-attention fusion mechanism
🖼️ Generates 224×224 RGB face images
⚡ ~300ms inference time per image
📉 Achieved 0.204 validation loss after 3 epochs

Model Architecture

FaceForge Generator (252.5M parameters)
│
├── ViT Encoders (172M params)
│   ├── Source Encoder: ViT-B/16 (86M)
│   │   └── 12 layers, 768-dim, 12 heads
│   └── Target Encoder: ViT-B/16 (86M)
│       └── 12 layers, 768-dim, 12 heads
│
├── Cross-Attention Module (14M params)
│   ├── 2 layers, 8 heads
│   ├── FFN: 768 → 3072 → 768
│   └── Dropout: 0.1
│
├── Transformer Decoder (58M params)
│   ├── 256 learnable queries (16×16)
│   ├── 6 decoder layers, 8 heads
│   └── 2D positional embeddings
│
└── CNN Upsampler (9M params)
    ├── TransposeConv: 768→512→256→128→64
    ├── 4 upsampling stages (16×16 → 224×224)
    └── Conv: 64→32→3 + Tanh

Training Progress

Baseline Training (3 Epochs)

Epoch	Train Loss	Val Loss	Time (min)
1	0.2873	0.2804	227.5
2	0.2432	0.2304	231.2
3	0.2143	0.2043	228.8

Total Training Time: 11.5 hours (687.5 minutes)

Loss Reduction

Training loss: 0.287 → 0.214 (25.3% reduction)
Validation loss: 0.280 → 0.204 (27.1% reduction)
Minimal overfitting (train-val gap: 0.010)

Usage

Installation

pip install torch torchvision timm pillow numpy

Loading the Model

import torch
import torch.nn as nn
import timm
from torchvision import transforms

class FaceForgeGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        # Source and Target ViT Encoders
        self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
        self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
        
        # Cross-attention (implement your architecture)
        # Transformer decoder
        # CNN upsampler
        # ... (see full architecture in paper)
    
    def forward(self, source_face, target_face):
        # Encode both faces
        source_features = self.source_encoder.forward_features(source_face)
        target_features = self.target_encoder.forward_features(target_face)
        
        # Cross-attention fusion
        fused_features = self.cross_attention(source_features, target_features)
        
        # Decode to spatial map
        spatial_features = self.transformer_decoder(fused_features)
        
        # Upsample to 224×224
        generated_face = self.cnn_upsampler(spatial_features)
        
        return generated_face

# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Generate face swap
def generate_face_swap(source_path, target_path):
    source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
    target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
    
    with torch.no_grad():
        generated = model(source, target)
    
    # Denormalize and convert to PIL
    generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
    generated = transforms.ToPILImage()(generated)
    
    return generated

# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")

Training Details

Dataset

Source: FaceForensics++ (c40 compression)
Training: 7,000 face images (triplets: source, target, ground truth)
Validation: 1,500 face images
Resolution: 224×224 RGB

Hyperparameters

optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 → 1e-6)

Training Configuration

Hardware: CPU
Throughput: ~32 samples/minute
Batch Processing: 219 train batches, 47 val batches per epoch
Best Model: Saved at epoch 3

Current Status

⚠️ Baseline Training: This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.

Current Capabilities:

✅ Learns pose transfer
✅ Captures facial structures
✅ Shows convergence trend
⏳ Some blur in generated images (expected at baseline)
⏳ Benefits from extended training

Use Cases

Research Applications

Detector Training: Generate challenging samples for deepfake detection
Adversarial Training: Min-max game with detector
Understanding Manipulation: Study how synthetic faces are created
Benchmark Creation: Generate test sets for evaluation

Educational Uses

Demonstrate face generation techniques
Teach computer vision concepts
Illustrate transformer architectures
Show attention mechanism visualization

Limitations

Training Duration: Only 3 epochs completed; extended training needed for photo-realism
Blur: Generated faces show some blur at baseline stage
Dataset Scale: Trained on 10K images; larger datasets would improve quality
Single Frame: Doesn't consider temporal consistency for video
Compute: Large model (252M params) requires significant memory

Ethical Guidelines

⚠️ Responsible Use Required

This model is intended for: ✅ Academic research ✅ Deepfake detection development ✅ Educational demonstrations ✅ Ethical AI studies

Prohibited uses: ❌ Creating misinformation ❌ Identity theft or impersonation ❌ Non-consensual face manipulation ❌ Malicious content creation

Recommendations:

Watermark generated content
Maintain audit logs
Require user consent
Implement content filters

Future Improvements

Planned enhancements:

Extended training (15-20 epochs)
Perceptual loss functions (VGG, LPIPS)
GAN-based adversarial training
Multi-scale architecture
Attention visualization
Video temporal consistency

Citation

@techreport{nasir2026faceforge,
  title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
  author={Nasir, Huzaifa},
  institution={National University of Computer and Emerging Sciences},
  year={2026},
  doi={10.5281/zenodo.18530439}
}

Architecture Details

Vision Transformer Encoder

Patch Size: 16×16
Patches: 196 + 1 CLS token
Embedding Dim: 768
Layers: 12
Attention Heads: 12
MLP Ratio: 4.0

Cross-Attention Mechanism

Query: Source features
Key/Value: Target features
Attention: Multi-head (8 heads)
FFN Expansion: 4× (768 → 3072 → 768)

CNN Upsampler

Input: 768×16×16
Output: 3×224×224
Stages: 4 transpose convolutions
Kernel: 4×4, Stride: 2, Padding: 1
Activation: ReLU → Tanh (output)

License

This model is released under CC BY 4.0 license. Use responsibly and ethically.

Author

Huzaifa Nasir
National University of Computer and Emerging Sciences (NUCES)
Islamabad, Pakistan
📧 [email protected]

Acknowledgments

Vision Transformer (Dosovitskiy et al.)
FaceForensics++ dataset
PyTorch and timm libraries
Open-source AI community

Downloads last month: -; Downloads are not tracked for this model. How to track

huzaifanasirrr
/

faceforge-generator