@wightmanr @ducha_aiki @rom1504 @giffmana There's also a new format on top of @ApacheArrow for CV datasets github.com/eto-ai/lance
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow I don't see why you'd want to put compressed data in arrow, it doesn't make sense....
@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Lance is an alternative to parquet or tfrecords, being arrow compatible makes it usable with things like duckdb, arrow flight and ray.
@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Personally I'd love a parquet with fast point lookups, filtering and built in indexing support (ANN, inverted index, FTS).
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow But you can beat physics. If you have terabytes of large binary objects you either need to have them all in RAM or on very very fast SSD on a VERY expensive cluster OR you wait for seek/lookup on slower disks or network storage. No freebies.
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow Further packing already maximally compressed image/video data into columnar formats or other formats just adds more decode overhead and passes over the same data.
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow For training, record formats are optimal because you usually want to see every record. For analysis columnar / DB makes sense. So should store meta-data, extracted feature in that form and point to data in record blobs. Cover both use cases ....
@wightmanr @michalwols @ducha_aiki @rom1504 @ApacheArrow Exactly, and with the move to ever fewer epochs over ever larger datasets, there's less and less need for efficient random indexing _for training_ (but yes need for pseudorandom ordering)
@giffmana @wightmanr @ducha_aiki @rom1504 @ApacheArrow That really depends on what you're doing. For datasets with large class imbalance and metric learning it's hard to get away with pure sequential scans.
@michalwols @giffmana @wightmanr @ducha_aiki @rom1504 @ApacheArrow how do LLM data loaders handle sliding windows of blocks of data? i assume here you'd want the data in sequential storage, but without a shuffle buffer, that breaks the pseudorandomness. shuffle between worker shards reading sequentially?
@ericjang11 @michalwols @wightmanr @ducha_aiki @rom1504 @ApacheArrow because pure text is rather small, you can cover a huge range of pages/examples in very little space, so these things are simpler. But I can't answer your question exactly, as I never implemented one.