Source Code Embeddings

doi:10.5281/zenodo.2558730

Published February 7, 2019 | Version v1

Dataset Open

Source Code Embeddings

1. Athens University of Economics and Business

A set of six pretrained fastText models for semantic representations of source code.

Each of the models has been trained on high-quality GitHub repositories where the primary language is one of Java, Python, C++, C#, C, PHP. For collecting training data 13.144 repositories were cloned, 2.402.790.348 lines of code were read out of 944,467,560 files and preprocessed, to finally produce a total of 944.467.560 tokens of clean training data.

For further details refer to the following paper:

Efstathiou, V., Spinellis, D., 2019. "Semantic Source Code Models Using Identifier Embeddings". In 16th International Conference on Mining Software Repositories: Data Showcase Track. MSR'19.

Files

Files (13.2 GB)

Name	Size	Download all
c-ftskip-dim100-ws5.bin md5:0a1797b09aa8020deaea4096e2dad518	3.1 GB	Download
cpp-ftskip-dim100-ws5.bin md5:0331aa4fad384854552f79b9f7d382dc	2.6 GB	Download
csharp-ftskip-dim100-ws5.bin md5:68de3f02881ff244033ee7a4fc4a7135	1.6 GB	Download
java-ftskip-dim100-ws5.bin md5:f6701447ee02802c8dcc35a76c40d661	2.8 GB	Download
php-ftskip-dim100-ws5.bin md5:2e691933bd4b7a5114cf09a06c91c1ed	1.4 GB	Download
python-ftskip-dim100-ws4.bin md5:85127ad0f34bfeb1a17edf2cead912e8	1.6 GB	Download

Additional details

CROSSMINER – Developer-Centric Knowledge Mining from Large Open-Source Software Repositories 732223: European Commission

	All versions	This version
Views	732	730
Downloads	364	364
Data volume	3.1 TB	3.1 TB

Source Code Embeddings

Creators

Description

Files

Files (13.2 GB)

Additional details

Funding