@giffmana @ducha_aiki I don't understand what's the purpose to do benchmark of data loaders on such minuscule datasets. What problem is this solving? All these datasets fit in ram.
@rom1504 @giffmana @ducha_aiki Exactly, completely invalid use cases to test these datasets.... and why would you ever want to train on a home machine from an S3 bucket?
@rom1504 @giffmana @ducha_aiki Also, this is a really bad thing to do for your credit card bill. Egress from buckets can get very costly. I racked up a $3k bill mixing up compute vs bucket regions in a train session over 24h period ... yeurk.
@wightmanr @rom1504 @giffmana What would be a reasonable strategy in terms of speed and cost?
@ducha_aiki @rom1504 @giffmana If you train in the cloud use sharded (record based) datasets in buckets, same region as your compute. s3 for AWS, gs for GCP, etc. webdataset (tarfiles), TFDS (tfrecord). Local training, if beyond size of a fast SSD, sharded on a local NAS or spinning disk.
@wightmanr @ducha_aiki @rom1504 @giffmana There's also a new format on top of @ApacheArrow for CV datasets github.com/eto-ai/lance
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow I don't see why you'd want to put compressed data in arrow, it doesn't make sense....
@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Lance is an alternative to parquet or tfrecords, being arrow compatible makes it usable with things like duckdb, arrow flight and ray.
@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow @changhiskhan might have a better pitch
@michalwols @wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow You pretty much nailed it. Being Arrow compatible makes it easy to analyze cv datasets (e.g., ensure training data has right distribution). You can also consolidate rich metadata with training data and supports really fast scans over cloud storage.
@michalwols @wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow The flip-side of the cloud-storage thing is that having a single source of truth in cloud storage makes it a lot easier for MLOps