FaceForge Generator: Vision Transformer-based Face Manipulation
π¨ 252M Parameters | ViT-Based | Baseline Training Complete
β οΈ RESEARCH USE ONLY - This model is for academic research and developing detection systems.
Model Description
FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.
Key Features:
- ποΈ 252 million trainable parameters
- π Dual encoder architecture for source and target faces
- π― Cross-attention fusion mechanism
- πΌοΈ Generates 224Γ224 RGB face images
- β‘ ~300ms inference time per image
- π Achieved 0.204 validation loss after 3 epochs
Model Architecture
FaceForge Generator (252.5M parameters)
β
βββ ViT Encoders (172M params)
β βββ Source Encoder: ViT-B/16 (86M)
β β βββ 12 layers, 768-dim, 12 heads
β βββ Target Encoder: ViT-B/16 (86M)
β βββ 12 layers, 768-dim, 12 heads
β
βββ Cross-Attention Module (14M params)
β βββ 2 layers, 8 heads
β βββ FFN: 768 β 3072 β 768
β βββ Dropout: 0.1
β
βββ Transformer Decoder (58M params)
β βββ 256 learnable queries (16Γ16)
β βββ 6 decoder layers, 8 heads
β βββ 2D positional embeddings
β
βββ CNN Upsampler (9M params)
βββ TransposeConv: 768β512β256β128β64
βββ 4 upsampling stages (16Γ16 β 224Γ224)
βββ Conv: 64β32β3 + Tanh
Training Progress
Baseline Training (3 Epochs)
| Epoch | Train Loss | Val Loss | Time (min) |
|---|---|---|---|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |
Total Training Time: 11.5 hours (687.5 minutes)
Loss Reduction
- Training loss: 0.287 β 0.214 (25.3% reduction)
- Validation loss: 0.280 β 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)
Usage
Installation
pip install torch torchvision timm pillow numpy
Loading the Model
import torch
import torch.nn as nn
import timm
from torchvision import transforms
class FaceForgeGenerator(nn.Module):
def __init__(self):
super().__init__()
# Source and Target ViT Encoders
self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
# Cross-attention (implement your architecture)
# Transformer decoder
# CNN upsampler
# ... (see full architecture in paper)
def forward(self, source_face, target_face):
# Encode both faces
source_features = self.source_encoder.forward_features(source_face)
target_features = self.target_encoder.forward_features(target_face)
# Cross-attention fusion
fused_features = self.cross_attention(source_features, target_features)
# Decode to spatial map
spatial_features = self.transformer_decoder(fused_features)
# Upsample to 224Γ224
generated_face = self.cnn_upsampler(spatial_features)
return generated_face
# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# Generate face swap
def generate_face_swap(source_path, target_path):
source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
with torch.no_grad():
generated = model(source, target)
# Denormalize and convert to PIL
generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
generated = transforms.ToPILImage()(generated)
return generated
# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")
Training Details
Dataset
- Source: FaceForensics++ (c40 compression)
- Training: 7,000 face images (triplets: source, target, ground truth)
- Validation: 1,500 face images
- Resolution: 224Γ224 RGB
Hyperparameters
optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 β 1e-6)
Training Configuration
- Hardware: CPU
- Throughput: ~32 samples/minute
- Batch Processing: 219 train batches, 47 val batches per epoch
- Best Model: Saved at epoch 3
Current Status
β οΈ Baseline Training: This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.
Current Capabilities:
- β Learns pose transfer
- β Captures facial structures
- β Shows convergence trend
- β³ Some blur in generated images (expected at baseline)
- β³ Benefits from extended training
Use Cases
Research Applications
- Detector Training: Generate challenging samples for deepfake detection
- Adversarial Training: Min-max game with detector
- Understanding Manipulation: Study how synthetic faces are created
- Benchmark Creation: Generate test sets for evaluation
Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization
Limitations
- Training Duration: Only 3 epochs completed; extended training needed for photo-realism
- Blur: Generated faces show some blur at baseline stage
- Dataset Scale: Trained on 10K images; larger datasets would improve quality
- Single Frame: Doesn't consider temporal consistency for video
- Compute: Large model (252M params) requires significant memory
Ethical Guidelines
β οΈ Responsible Use Required
This model is intended for: β Academic research β Deepfake detection development β Educational demonstrations β Ethical AI studies
Prohibited uses: β Creating misinformation β Identity theft or impersonation β Non-consensual face manipulation β Malicious content creation
Recommendations:
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters
Future Improvements
Planned enhancements:
- Extended training (15-20 epochs)
- Perceptual loss functions (VGG, LPIPS)
- GAN-based adversarial training
- Multi-scale architecture
- Attention visualization
- Video temporal consistency
Citation
@techreport{nasir2026faceforge,
title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
author={Nasir, Huzaifa},
institution={National University of Computer and Emerging Sciences},
year={2026},
doi={10.5281/zenodo.18530439}
}
Links
- π Paper: https://doi.org/10.5281/zenodo.18530439
- π» Code: https://github.com/Huzaifanasir95/FaceForge
- π Detector Model: https://huggingface.co/Huzaifanasir95/faceforge-detector
- π Notebooks: See repository for training/inference notebooks
Architecture Details
Vision Transformer Encoder
- Patch Size: 16Γ16
- Patches: 196 + 1 CLS token
- Embedding Dim: 768
- Layers: 12
- Attention Heads: 12
- MLP Ratio: 4.0
Cross-Attention Mechanism
- Query: Source features
- Key/Value: Target features
- Attention: Multi-head (8 heads)
- FFN Expansion: 4Γ (768 β 3072 β 768)
CNN Upsampler
- Input: 768Γ16Γ16
- Output: 3Γ224Γ224
- Stages: 4 transpose convolutions
- Kernel: 4Γ4, Stride: 2, Padding: 1
- Activation: ReLU β Tanh (output)
License
This model is released under CC BY 4.0 license. Use responsibly and ethically.
Author
Huzaifa Nasir
National University of Computer and Emerging Sciences (NUCES)
Islamabad, Pakistan
π§ [email protected]
Acknowledgments
- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community