- HD resolution - 1280×720 · 32 fps
- For each frame keyboard and mouse + world state (player position, velocity, weapon ...)
- HD Stereo audio
- All 10 players perspective
https://huggingface.co/collections/blanchon/opencs2
The pipeline adapts to the source, beginning with collecting target URLs from sitemaps or APIs into a text file to track progress. I fetch the content concurrently. Go with 50 to 200 goroutines handles large scrapes, while Python ThreadPoolExecutor works for smaller jobs. This stage requires retry logic, rate limiting, and checkpoint files to resume interrupted downloads. The custom work happens during parsing since every site structures its data differently. I extract the target data using BeautifulSoup or goquery for HTML and standard parsers for APIs. I then filter the output to drop binaries, validate UTF-8, and skip generated files using tools like go-enry. The clean data gets written to an intermediate JSONL format, appending with a file lock for thread safety. I convert the final JSONL files to Parquet using DuckDB, PyArrow, or parquet-go. These get compressed with Zstandard at level 19, using 10K to 100K row groups and 512MB to 1GB shards. Go handles the high-throughput scraping, Python manages the custom parsing, and DuckDB takes care of the format conversions.
Thanks! Since the datasets vary so much in size and format, I write custom parsing and processing pipelines for almost every single one.
In short, the students won. They did so by fine-tuning LFM2. LFM2 is a foundation built by Liquid AI. Liquid AI is a $2 billion startup from MIT.
SEO spam has also become a lot less noticeable. I'm hoping that the next step will be to crack down on storage and traffic abuse, and maybe that will mean more generous storage limits.