You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, appreciate the amazing work in unsupervised code translation!
I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages).
How's the performance of Transcoder if less data provided?
The text was updated successfully, but these errors were encountered:
Hi,
Thank you.
We have not really done an ablation study on the dataset size. However, the numbers you are quoting are for non deduplicated functions. We get about the same results training on around 15M deduped functions.
I also remember that we were losing only a few points of computational accuracy when using only a fraction (1/8th) of the data.
Hi, thanks for the quick reply!
I see that TransCoder use functions in the training of DAE and BT. But it uses complete source codes for XLM.(https://github.com/facebookresearch/TransCoder#data-needed)
So the 15M deduped functions for DAE and BT?
What about the data size used in XLM?
Hi, appreciate the amazing work in unsupervised code translation!
I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages).
How's the performance of Transcoder if less data provided?
The text was updated successfully, but these errors were encountered: