Romain Beaumont @rom1504, Twitter Profile

Romain Beaumont @rom1504

2 years ago

@giffmana @ducha_aiki I don't understand what's the purpose to do benchmark of data loaders on such minuscule datasets. What problem is this solving? All these datasets fit in ram.

2 0 5 0 0

Ross Wightman @wightmanr

2 years ago

@rom1504 @giffmana @ducha_aiki Exactly, completely invalid use cases to test these datasets.... and why would you ever want to train on a home machine from an S3 bucket?

2 0 2 0 0

@rom1504 @giffmana @ducha_aiki Also, this is a really bad thing to do for your credit card bill. Egress from buckets can get very costly. I racked up a $3k bill mixing up compute vs bucket regions in a train session over 24h period ... yeurk.

1 0 2 0 1

Dmytro Mishkin 🇺🇦 @ducha_aiki

2 years ago

@wightmanr @rom1504 @giffmana What would be a reasonable strategy in terms of speed and cost?

3 0 0 0 0

Ross Wightman @wightmanr

2 years ago

@ducha_aiki @rom1504 @giffmana If you train in the cloud use sharded (record based) datasets in buckets, same region as your compute. s3 for AWS, gs for GCP, etc. webdataset (tarfiles), TFDS (tfrecord). Local training, if beyond size of a fast SSD, sharded on a local NAS or spinning disk.

1 2 6 0 1

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana There's also a new format on top of @ApacheArrow for CV datasets github.com/eto-ai/lance

1 1 1 0 0

Ross Wightman @wightmanr

2 years ago

@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow I don't see why you'd want to put compressed data in arrow, it doesn't make sense....

1 0 0 0 0

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow Lance is an alternative to parquet or tfrecords, being arrow compatible makes it usable with things like duckdb, arrow flight and ray.

2 0 2 0 0

Michal Wolski @michalwols

2 years ago

@wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow @changhiskhan might have a better pitch

1 0 1 0 0

changhiskhan @changhiskhan

2 years ago

@michalwols @wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow You pretty much nailed it. Being Arrow compatible makes it easy to analyze cv datasets (e.g., ensure training data has right distribution). You can also consolidate rich metadata with training data and supports really fast scans over cloud storage.

1 0 2 0 0

changhiskhan @changhiskhan

2 years ago

@michalwols @wightmanr @ducha_aiki @rom1504 @giffmana @ApacheArrow The flip-side of the cloud-storage thing is that having a single source of truth in cloud storage makes it a lot easier for MLOps

0 0 3 0 0