Dmytro Mishkin 🇺🇦 @ducha_aiki, Twitter Profile

Dmytro Mishkin 🇺🇦 @ducha_aiki

2 years ago

An Overview of the Data-Loader Landscape: Comparative Performance Analysis Iason Ofeidis, Diego Kiedanski, Leandros Tassiulas tl;dr: use ffcv if you can and DeepLake otherwise. arxiv.org/abs/2209.13705…

2 15 84 0 38

Download Image

Lucas Beyer (bl16) @giffmana

2 years ago

@ducha_aiki Should also have included tf.data. I hate it with a passion, but can't deny its efficiency, especially when reading data over the wire like in cloud environments.

4 0 13 0 0

Ross Wightman @wightmanr

2 years ago

@giffmana @ducha_aiki Yeah, definitely should compare tf.data / TFDS ... it's my default for most cloud training, esp in GCP even though I'm always using PyTorch. For other large scale training webdataset is the default.

1 1 8 0 0

Ross Wightman @wightmanr

2 years ago

@giffmana @ducha_aiki There appear to be issues with the analysis, and don't feel it focuses on interesting configs/scenarios. Doing any analysis of CIFAR in formats like webdataset is pretty much pointless.

1 0 6 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@wightmanr @ducha_aiki Good point. We can actually hold all the tested datasets in RAM nowadays ¯\_(ツ)_/¯

1 0 3 0 0

Mark Tenenholtz @marktenenholtz

2 years ago

@giffmana @wightmanr @ducha_aiki Every time I try to write some extra fancy streaming dataloader or optimized disk loading beyond the standard stuff, I realize it'd probably be cheaper to run fewer experiments on a bigger machine 😂

2 0 2 0 0

Sebastian Raschka @rasbt

2 years ago

@marktenenholtz @giffmana @wightmanr @ducha_aiki Or back in grad school when everyone was trying to configure Spark when you could just run it on a different machine using pandas 😆

0 0 3 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@marktenenholtz @wightmanr @ducha_aiki Yes! And the older we get, the more we value our time ;-)

0 0 2 0 0