Today we are going to show few demos of Sequence to Sequence models for code completion.
In the first demo we trained a seq2seq model, more on which below, on all the Accepted java solutions on CodeForces. The goal of the model is to predict the next token based on all the tokens seen so far. We then plug the output of the model into a code editor to see how it behaves when one actually solves a competitive programming problem. Here we are solving problem A from Codeforces Round 407:
Notice how early on the model nearly perfectly predicts all the tokens, which is not surprising, since most of the Java solutions begin with rather standard imports, and the solution class definition. Later it perfectly predicts the entire line that reads
n, but doesn’t do that well predicting reading
k, which is not surprising, since it is a rather rare name for the second variable to be read.
There are several interesting moments in the video. First, note how after
int n = it predicts
sc, understanding that
n will probably be read from the scanner (and while not shown in the video, if the scanner name was
in, the model would have properly predicted
int n =), however when the line starts with
int ans =, it then properly predicts
ans is rarely read from the input.
The second interesting moment is what happens when we are printing the answer. At first when the line contains
System.out.println(ans it predicts a semicolon (mistakenly) and the closing parenthesis as possible next tokens, but not
- 1, however when we introduce the second parenthesis
System.out.println((ans, it then properly predicts
-1, closing parenthesis, and the division by two.
You can also notice a noticeable pause before the
for loop is written. This is due to the fact that using such artificial intelligence suggestions completely turns off the natural intelligence the operator of the machine possesses :)
One concern with such autocomplete is that in the majority of cases most of the tokens are faster to type than to select from the list. To address it, in the second demo we introduce beam search that searches for the most likely sequences of tokens. Here’s what it looks like:
Here there are more rough edges, but notice how early on the model can correctly predict entire lines of code.
Currently we do not condition the predictions on the task. Partially because number of tasks available on the Internet is too small for a machine learning model to predict anything reasonable (so, please help us fix it by participating here: https://r-nn.com). Once we have a working model that is conditioned on the statement, we expect it to be able to predict variable names, snippets to read and write data and computing some basic logic.
Let’s review Seq2Seq models in the following section.