@ducha_aiki @rom1504 @giffmana If you train in the cloud use sharded (record based) datasets in buckets, same region as your compute. s3 for AWS, gs for GCP, etc. webdataset (tarfiles), TFDS (tfrecord). Local training, if beyond size of a fast SSD, sharded on a local NAS or spinning disk.
@wightmanr @ducha_aiki @rom1504 @giffmana There's also a new format on top of @ApacheArrow for CV datasets github.com/eto-ai/lance
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow I don't see why you'd want to put compressed data in arrow, it doesn't make sense....
@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Lance is an alternative to parquet or tfrecords, being arrow compatible makes it usable with things like duckdb, arrow flight and ray.
@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Personally I'd love a parquet with fast point lookups, filtering and built in indexing support (ANN, inverted index, FTS).
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow But you can beat physics. If you have terabytes of large binary objects you either need to have them all in RAM or on very very fast SSD on a VERY expensive cluster OR you wait for seek/lookup on slower disks or network storage. No freebies.
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow Further packing already maximally compressed image/video data into columnar formats or other formats just adds more decode overhead and passes over the same data.
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow For training, record formats are optimal because you usually want to see every record. For analysis columnar / DB makes sense. So should store meta-data, extracted feature in that form and point to data in record blobs. Cover both use cases ....
@wightmanr @michalwols @ducha_aiki @rom1504 @ApacheArrow Exactly, and with the move to ever fewer epochs over ever larger datasets, there's less and less need for efficient random indexing _for training_ (but yes need for pseudorandom ordering)
@giffmana @wightmanr @ducha_aiki @rom1504 @ApacheArrow That really depends on what you're doing. For datasets with large class imbalance and metric learning it's hard to get away with pure sequential scans.
@michalwols @giffmana @ducha_aiki @rom1504 @ApacheArrow You can still address that by sharding with oversampling (increase in data for cheap storage still much less $ than random access). Filtering down is easy but limits to that. You can also get creative and tier data into different sets of shards, adjust mix on read. Not OOB though