Ross Wightman @wightmanr, Twitter Profile

Ross Wightman @wightmanr

2 years ago

@ducha_aiki @rom1504 @giffmana If you train in the cloud use sharded (record based) datasets in buckets, same region as your compute. s3 for AWS, gs for GCP, etc. webdataset (tarfiles), TFDS (tfrecord). Local training, if beyond size of a fast SSD, sharded on a local NAS or spinning disk.

1 2 6 0 1

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana There's also a new format on top of @ApacheArrow for CV datasets github.com/eto-ai/lance

1 1 1 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow I don't see why you'd want to put compressed data in arrow, it doesn't make sense....

1 0 0 0 0

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Lance is an alternative to parquet or tfrecords, being arrow compatible makes it usable with things like duckdb, arrow flight and ray.

2 0 2 0 0

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Personally I'd love a parquet with fast point lookups, filtering and built in indexing support (ANN, inverted index, FTS).

2 0 1 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow But you can beat physics. If you have terabytes of large binary objects you either need to have them all in RAM or on very very fast SSD on a VERY expensive cluster OR you wait for seek/lookup on slower disks or network storage. No freebies.

1 0 3 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow Further packing already maximally compressed image/video data into columnar formats or other formats just adds more decode overhead and passes over the same data.

2 0 1 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow For training, record formats are optimal because you usually want to see every record. For analysis columnar / DB makes sense. So should store meta-data, extracted feature in that form and point to data in record blobs. Cover both use cases ....

2 0 3 0 1

Lucas Beyer (bl16) @giffmana

2 years ago

@wightmanr @michalwols @ducha_aiki @rom1504 @ApacheArrow Exactly, and with the move to ever fewer epochs over ever larger datasets, there's less and less need for efficient random indexing _for training_ (but yes need for pseudorandom ordering)

1 0 2 0 0

Michal Wolski @michalwols

2 years ago

@giffmana @wightmanr @ducha_aiki @rom1504 @ApacheArrow That really depends on what you're doing. For datasets with large class imbalance and metric learning it's hard to get away with pure sequential scans.

3 0 0 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @giffmana @ducha_aiki @rom1504 @ApacheArrow You can still address that by sharding with oversampling (increase in data for cheap storage still much less $ than random access). Filtering down is easy but limits to that. You can also get creative and tier data into different sets of shards, adjust mix on read. Not OOB though

0 0 2 0 0