In a previous article, I highlighted the importance of processing speed when choosing a data preparation solution (ETL). I made a first benchmark between Alteryx, Tableau Prep, and Anatella on a file of 108 million lines. This time I repeated the operation on 1.039 billion lines and added Talend to the benchmark. The results are unexpected since the processing speeds vary by a factor of 1 to 20.

TEASER : stay tuned. In my next article I’ll show you a trick to increase the speed of processing by a factor 10.

crédits : Shutterstock

Introduction

In terms of data preparation, speed is, in my opinion, one of the differentiating factors from “ETL” type solutions. Many data preparation operations are always done with files extracted from information systems. However, handling large files can quickly make the data preparation work very laborious.

In my first test, I showed a 39% difference in processing time between Alteryx and Anatella for a file of 108 million lines. The processing was straightforward: open a flat-file (CSV) and sort on the first column. Here, we go to the next level with a flat-file of one billion lines (almost 50 Gb!).

crédits : Shutterstock

Methodology

For this test, I prepared a .CSV file with 1.039 billion lines and 9 columns. The file “weighed” 48 Gb and I stored everything on a standard hard disk.

I developed a straightforward data preparation process consisting of 3 steps:
opening the CSV file
descending sort on the first column
group by the values of the 7th column (which only contained 0 and 1)

For this benchmark, I compared 4 well-known data preparation solutions:

Talend Open Studio v7.3.1
Anatella v2.35
Tableau Prep 2020.2.1
Alteryx 2020.1

I performed all of the tests on a desktop machine equipped with 96 GB of Ram and a 7th generation i7 processor.

The data was stored on a 6TB Western Digital HDD.

The ETL solution at the top of the podium is 19 times faster than the one at the bottom.

Results

The results are pretty surprising and totally unexpected for me. I performed tests on smaller data sets, and I observe that the processing time is not linear.

Solution	Processing time (seconds)
Alteryx	2290
Talend	13954
Anatella	730
Tableau Prep	2526

You’ll find below the screenshots of the different pipelines and the associated processing times.

The results of this test are surprising since we can see huge differences in similar processes.

The best performing solution is Anatella v2.35, proposed by Timi. The processing is done in only 730 seconds.

The slowest perfoming solution is Talend Open Studio v7.3.1 which takes more than 4 hours to process the same dataset. Talend Studio is, therefore, 19 times slower than Anatella! Imagine the poor data scientist who has to work with Talend Studio. In the morning, he starts processing his data, and he starts working on it when he comes back from his lunch break around 1 pm.

The 2nd and 3rd places in the ranking are occupied respectively by Alteryx and Tableau Prep, whose performances are separated by only +/- 4 minutes.

Result of processing a dataset of one billion lines using Tableau Prep.

Processing a billion lines using Talend Open Studio.

Processing a dataset of one billion lines using Alteryx.

processing of one billion rows in anatella

Test pipeline for processing one billion lines using Anatella.

crédits : Shutterstock

Conclusion

The differences are huge between the different solutions tested. We are talking about a factor of 20 between the best-performing solution and the one that performed the worst.

This highlights the importance of processing speed in the data preparation process. Solutions that work in “No Code” mode are far from being optimized in the same way. Therefore, the work of the data scientist can be strongly impacted and, as a consequence, slowed down.

However, there are solutions to speed up processing (use of an owner format as input, storage of data on an SSD rather than on an HDD). They will be the subject of a future test, which will highlight the progress margins of each of the ETLs tested

Data science

Posted in big data.

By Pierre-Nicolas Schwab Pierre-Nicolas est Docteur en Marketing et dirige l'agence d'études de marché IntoTheMinds. Ses domaines de prédilection sont le BigData l'e-commerce, le commerce de proximité, l'HoReCa et la logistique. Il est également chercheur en marketing à l'Université Libre de Bruxelles et sert de coach et formateur à plusieurs organisations et institutions publiques. Il peut être contacté par email, Linkedin ou par téléphone (+32 486 42 79 42)

ETL benchmark: how long does it take to process 1 billion lines?

Introduction

Methodology

Results

Conclusion

Post your opinion