CoNaLa: The Code/Natural Language Challenge


Welcome to the site of CMU CoNaLa, the Code/Natural Language Challenge, a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab! This challenge was designed to test systems for generating program snippets from natural language. For example, if the input is sort list x in reverse order, then the system would be required to output x.sort(reverse=True) in Python.

Dataset Information

We have released a dataset crawled from Stack Overflow, automatically filtered, then curated by annotators, split into 2,379 training and 500 test examples (read more about the process here). We also provide a large automatically-mined dataset with 600k examples, and links to other similar datasets. These data sets can be used for the CoNaLa challenge, or for any other research on the intersection of code and natural language.

We describe the data briefly below, and you can find more detail in our MSR 2018 paper, which we’d appreciate you cite if you use the corpus in your research or participate in the challenge:

@inproceedings{yin2018mining,
  author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
  title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
  booktitle = {International Conference on Mining Software Repositories},
  series = {MSR},
  pages = {476--486},
  year = {2018},
  publisher = {ACM},
  doi = {https://doi.org/10.1145/3196398.3196408},
}

Manually Curated Data

The manually curated CoNaLa dataset contains high-quality natural language intent and source code snippet pairs in Python, split into the conala-train and conala-test datasets.

The train/test splits are stored in json format, and you can see some examples below: Some examples in the dataset are:

{
  "question_id": 36875258,
  "intent": "copying one file's contents to another in python", 
  "rewritten_intent": "copy the content of file 'file.txt' to file 'file2.txt'", 
  "snippet": "shutil.copy('file.txt', 'file2.txt')", 
}

{
  "intent": "How do I check if all elements in a list are the same?", 
  "rewritten_intent": "check if all elements in list `mylist` are the same", 
  "snippet": "len(set(mylist)) == 1", 
  "question_id": 22240602
}

{
  "intent": "Iterate through words of a file in Python", 
  "rewritten_intent": "get a list of words `words` of a file 'myfile'", 
  "snippet": "words = open('myfile').read().split()", 
  "question_id": 7745260
}

Here is the description of each field in an example:

Field Description
question_id Id of the Stack Overflow question
intent Natural Language intent (i.e., the title of a Stack Overflow question)
rewritten_intent Crowdsourced revised intents that try to better reflect the full meaning of the code, typically done by incorporating variable names and function arguments that appeared in the code into the intent. This is the input to be used by systems in the CoNaLa challenge.
snippet A code snippet that implements the intent. This is the output of systems in the challenge.

Other Data Sources

In the CoNaLa challenge, you are allowed to use other data sources to improve your system accuracy as long as you exclude any information from the specific Stack Overflow questions that are included in the test set. We provide links to a number of data sources below, but other sources may be used as well:

Automatically Mined Intent/Snippet Pairs

The above archive includes 598,237 candidate intent/snippet pairs mined by our system, in the conala-mined data set. The file is stored in Json lines format. A description of each field is:

Field Description
question_id Id of the Stack Overflow question
parent_answer_post_id Id of the answer post from which the candidate snippet is extracted
intent The natural language intent
snippet The extracted code snippet
id Unique id for this intent/snippet pair
prob Probability given by the mining model

External Datasets

You may also use data from other external sources such as:

Training Systems

To participate in the CoNaLa challenge, you should use the conala-train and/or conala-mined datasets to train a system, take the rewritten_intent field of the conala-test dataset as input, and generate output from it. More details of how to do so, along with example scripts to perform preprocessing and train a baseline sequence-to-sequence model can be found on the following GitHub site:

Submitting Results

The results are submitted by creating a zip file containing a single file answer.txt, which is in JSON array format with one line being one code snippet. An example of how to create this file can also be found in the conala-baseline directory.

Once you have created this file, you can submit it to the Leader Board on CodaLab. The results are evaluated according to BLEU score after tokenization, as detailed in the scripts in the baseline github repository. The official results are on the leaderboard, but we’ll also be maintaining a (potentially outdated) copy here for easy browsing:

Date Team Name Description BLEU
6/18/2018 Organizers seq2seq annot+mine A baseline sequence-to-sequence model trained on both annotated and 100k mined data. 14.26
6/18/2018 Organizers seq2seq annot A baseline sequence-to-sequence model trained on only annoated data. 10.58

Organizers

Carnegie Mellon University NeuLab Strudel

Acknowledgement

This development and maintenance of the CoNaLa corpus is supported in part by the National Science Foundation under Grant No. 1815287, “Open-domain, Data-driven Code Synthesis from Natural Language”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.