Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ablation on data size #66

Open
yssjtu opened this issue Mar 15, 2022 · 2 comments
Open

Ablation on data size #66

yssjtu opened this issue Mar 15, 2022 · 2 comments
Labels
question Further information is requested

Comments

@yssjtu
Copy link

yssjtu commented Mar 15, 2022

Hi, appreciate the amazing work in unsupervised code translation!
I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages).
How's the performance of Transcoder if less data provided?

@baptisteroziere baptisteroziere added the question Further information is requested label Mar 15, 2022
@baptisteroziere
Copy link
Contributor

Hi,
Thank you.
We have not really done an ablation study on the dataset size. However, the numbers you are quoting are for non deduplicated functions. We get about the same results training on around 15M deduped functions.
I also remember that we were losing only a few points of computational accuracy when using only a fraction (1/8th) of the data.

@yssjtu
Copy link
Author

yssjtu commented Mar 15, 2022

Hi, thanks for the quick reply!
I see that TransCoder use functions in the training of DAE and BT. But it uses complete source codes for XLM.(https://github.com/facebookresearch/TransCoder#data-needed)
So the 15M deduped functions for DAE and BT?
What about the data size used in XLM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants