Published February 7, 2019
| Version v1
Dataset
Open
Source Code Embeddings
Description
A set of six pretrained fastText models for semantic representations of source code.
Each of the models has been trained on high-quality GitHub repositories where the primary language is one of Java, Python, C++, C#, C, PHP. For collecting training data 13.144 repositories were cloned, 2.402.790.348 lines of code were read out of 944,467,560 files and preprocessed, to finally produce a total of 944.467.560 tokens of clean training data.
For further details refer to the following paper:
Efstathiou, V., Spinellis, D., 2019. "Semantic Source Code Models Using Identifier Embeddings". In 16th International Conference on Mining Software Repositories: Data Showcase Track. MSR'19.
Files
Files
(13.2 GB)
Name | Size | Download all |
---|---|---|
md5:0a1797b09aa8020deaea4096e2dad518
|
3.1 GB | Download |
md5:0331aa4fad384854552f79b9f7d382dc
|
2.6 GB | Download |
md5:68de3f02881ff244033ee7a4fc4a7135
|
1.6 GB | Download |
md5:f6701447ee02802c8dcc35a76c40d661
|
2.8 GB | Download |
md5:2e691933bd4b7a5114cf09a06c91c1ed
|
1.4 GB | Download |
md5:85127ad0f34bfeb1a17edf2cead912e8
|
1.6 GB | Download |