Michal Wolski @michalwols, Twitter Profile

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana There's also a new format on top of @ApacheArrow for CV datasets github.com/eto-ai/lance

1 1 1 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow I don't see why you'd want to put compressed data in arrow, it doesn't make sense....

1 0 0 0 0

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Lance is an alternative to parquet or tfrecords, being arrow compatible makes it usable with things like duckdb, arrow flight and ray.

2 0 2 0 0

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Personally I'd love a parquet with fast point lookups, filtering and built in indexing support (ANN, inverted index, FTS).

2 0 1 0 0

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow But you can beat physics. If you have terabytes of large binary objects you either need to have them all in RAM or on very very fast SSD on a VERY expensive cluster OR you wait for seek/lookup on slower disks or network storage. No freebies.

1 0 3 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow Further packing already maximally compressed image/video data into columnar formats or other formats just adds more decode overhead and passes over the same data.

2 0 1 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow For training, record formats are optimal because you usually want to see every record. For analysis columnar / DB makes sense. So should store meta-data, extracted feature in that form and point to data in record blobs. Cover both use cases ....

2 0 3 0 1

Lucas Beyer (bl16) @giffmana

2 years ago

@wightmanr @michalwols @ducha_aiki @rom1504 @ApacheArrow Exactly, and with the move to ever fewer epochs over ever larger datasets, there's less and less need for efficient random indexing _for training_ (but yes need for pseudorandom ordering)

1 0 2 0 0

Michal Wolski @michalwols

2 years ago

@giffmana @wightmanr @ducha_aiki @rom1504 @ApacheArrow That really depends on what you're doing. For datasets with large class imbalance and metric learning it's hard to get away with pure sequential scans.

3 0 0 0 0

Eric Jang @ericjang11

2 years ago

@michalwols @giffmana @wightmanr @ducha_aiki @rom1504 @ApacheArrow how do LLM data loaders handle sliding windows of blocks of data? i assume here you'd want the data in sequential storage, but without a shuffle buffer, that breaks the pseudorandomness. shuffle between worker shards reading sequentially?

1 0 0 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@ericjang11 @michalwols @wightmanr @ducha_aiki @rom1504 @ApacheArrow because pure text is rather small, you can cover a huge range of pages/examples in very little space, so these things are simpler. But I can't answer your question exactly, as I never implemented one.

0 0 0 0 0