Ablation on data size #66

yssjtu · 2022-03-15T03:02:44Z

Hi, appreciate the amazing work in unsupervised code translation!
I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages).
How's the performance of Transcoder if less data provided?

baptisteroziere · 2022-03-15T11:00:28Z

Hi,
Thank you.
We have not really done an ablation study on the dataset size. However, the numbers you are quoting are for non deduplicated functions. We get about the same results training on around 15M deduped functions.
I also remember that we were losing only a few points of computational accuracy when using only a fraction (1/8th) of the data.

yssjtu · 2022-03-15T11:10:30Z

Hi, thanks for the quick reply!
I see that TransCoder use functions in the training of DAE and BT. But it uses complete source codes for XLM.(https://github.com/facebookresearch/TransCoder#data-needed)
So the 15M deduped functions for DAE and BT?
What about the data size used in XLM?

baptisteroziere added the question Further information is requested label Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ablation on data size #66

Ablation on data size #66

yssjtu commented Mar 15, 2022

baptisteroziere commented Mar 15, 2022

yssjtu commented Mar 15, 2022 •

edited

Loading

Ablation on data size #66

Ablation on data size #66

Comments

yssjtu commented Mar 15, 2022

baptisteroziere commented Mar 15, 2022

yssjtu commented Mar 15, 2022 • edited Loading

yssjtu commented Mar 15, 2022 •

edited

Loading