Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
Paper • 2602.01738 • Published
Official repository for the paper "Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models"(https://arxiv.org/pdf/2602.01738)
If you have any questions, please feel free to open a discussion in the Community tab. For direct inquiries, you can also reach out to us via email at [email protected].
This directory contains the 7 vision foundation model baselines used in the paper:
MetaCLIP-LinearMetaCLIP2-LinearSigLIP-LinearSigLIP2-LinearPE-CLIP-LinearDINOv2-LinearDINOv3-Linearmodels.py: unified model-loading code for all 7 baselinestest_vfm_baselines.py: unified evaluation scriptweights/: released checkpointscore/vision_encoder/: vendored PE vision encoder code required by PE-CLIP-LinearThe unified loader and test script accept these names:
metacliplinmetaclip2linsigliplinsiglip2linpelindinov2lindinov3linThe paper names such as MetaCLIP-Linear and DINOv3-Linear are also accepted.
Evaluate a single model:
python test_vfm_baselines.py \
--model sigliplin \
--real-dir /path/to/0_real \
--fake-dir /path/to/1_fake \
--max-samples 100
Evaluate all 7 models:
python test_vfm_baselines.py \
--model all \
--real-dir /path/to/0_real \
--fake-dir /path/to/1_fake \
--max-samples 100
Optional arguments:
--checkpoint: override the default checkpoint for single-model evaluation--batch-size: batch size for evaluation--num-workers: dataloader workers--device: explicit device such as cuda:0 or cpu--save-json: save results to a JSON fileThe release code expects these Python packages:
torchtorchvisiontransformersscikit-learnPillowtimmeinopsftfyregexhuggingface_hubPE-CLIP-Linear uses the vendored core/vision_encoder code in this directory.weights/ are arranged locally for packaging convenience. For public release, they can be uploaded as the same filenames.