Please review the full instructions.

Conversation

${conversation}

Human response

${responseGT}
Given the conversation on the right, evaluate the responses below based on four characteristics: grammaticality, naturalness, context awareness and semantical correctness. Every characteristic is graded in a 1 to 3 scale. See full instructions for details and examples of the scale.

Grammaticality

Decide whether a sentence is correct according to the English grammar. 1 - a word sequence without any structure, 2 - two or three minor mistakes which do not harm the understanding, 3 - correct sentence in the English language with max. 1 minor mistakes.

Naturalness

Decide whether a response sounds natural, as in a human would use that sentence. 1 - a concatenation of phrases, possibly in a unnatural order, 2 - response is possibly based on a template, 3 - the response cannot be distinguish from a human response

Context awareness

Decide whether a response fits into the context of the conversation. 1 - response assumes different context and , 2 - the response is generic and unspecific to the question, 3 - response is specific to the user question and fits well in the context

Semantical correctness

Decide whether the response has the same content as the provided template. 1 - response is mostly unrelated to the human ground truth, 2 - response differs in a few minor details from the human ground truth, 3 - the response resembles a paraphrase without changing the content

Introduction

You are provided with a short conversation between an user and an agent, where the user asks for information and bookings of restaurants, hotels, attractions, taxis and trains. Both roles, user and agent, were performed by humans. For a full description of the dataset, see MULTIWOZ.
The goal of this work is to replace the agent by an automated model which generates the responses instead. For evaluating the quality of the model, example responses need to be reviewed by humans which is the aim of this MTurk project.

Review process

Below the conversation, you are provided with the next human response for the agent. This response can be seen as "ground truth", and all the generated responses should express the same content (more on review metrics below). On the bottom of the page, you find a table of 6 responses, each generated by a different model setting. Your task is to evaluate these responses based on four metrics: grammaticality, naturalness, context awareness, and semantical correctness. All metrics are based on a scale of 1 (worst) to 3 (best) with a step size of 0.5. Below, the four metrics and the corresponding grading scheme for the values 1, 2 and 3 is explained in more detail. Please use the intermediate steps (1.5 and 2.5) if a response cannot be clearly assigned to a single bin.

Grammaticality

For this metric, you have to evaluate the response on the correct usage of the English language. This includes having a correct sentence structure, but also the correct tense and form of words. Note that this does not include capitalizing and lowercasing as for simplicity, most words were lowercased.
Please, use the following scale to evaluate the grammaticality of a sentence:
1 - The sentence is a (random) sequence of words without any clear structure, and is not understandable.
Example 1: "The the the the the the"
Example 2: "Drive hotel center I sorry"
2 - The sentence contains a few minor mistakes such as repeating a word or wrong tense of a word. However, this does not significantly harm the understanding of the sentence.
Example 1: "The the churchill college is in the west, and the address king's parade ."
Example 2: "Yes, there is a museums, parks and boat tours. I would recommend a cinemas."
3 - The response is a valid sentence according to English language with max. 1 small (grammar) mistakes.
Example 1: "I would recommend the museum of classical art. It is located in the centre, address wollaston road."
Example 2: "I absolutely can! What type of entertainment are you looking for?"

Naturalness

The "naturalness" of a sentence summarizes the characteristics of a response which makes it sounds as a human would have said it, and not a typically machine-generated response. Thereby, we consider words which are not strictly necessary in a sentence, but make it sound more like a human conversation. This includes words in the beginning like "Sure" and "Certainly", but also words like "also" or "as well". Nevertheless, it is not required for a sentence to include any of these words to be considered as "natural". If the sentence/response could have also been written by a human, then the sentence is considered as natural. Note that natural sentences are not required to be free of any grammar or spelling mistakes as humans also make mistakes. Hence, "naturalness" can also be considered as "humaneness" of the response.
Please, use the following scale to evaluate the naturalness of a sentence:
1 - A sentence is considered as unnatural if it simply has a concatenation of short phrases and no words to connect those. It also includes unnatural orderings of information, as stating the detailed information before the general (i.e. phone number before the actual name).
Example 1: "The phone number is 1234. The address is ABC street. The name is XYZ restaurant."
Example 2: "Center, ABC Street, Phone 1234, "
2 - The sentence is generic and is similar to a template response.
Example 1: "I have found the Christian Art museum. The entrance fee is none."
Example 2: "The museum is located in the south. The phone number is 123000123."
3 - The sentence is possibly written by a human and/or cannot be distinguished from any other human response in terms of naturalness. It combines information into a single sentence and is less likely to be created from a template.
Example 1: "The fitzwilliam museum is a free museum in the center of town, on trumpington street."
Example 2: "I absolutely can! What type of entertainment are you looking for?"

Context awareness

To have a natural, consistent conversation, the responses of the agent need to fit to the question of the user and the previous context of the conversation. Therefore, the metric "context awareness" measures whether a response takes the conversation so far into account or not. This includes both short-term dependencies, which directly relate to the last user's question, and long-term dependencies which model relations to earlier questions and/or responses. Please, use the following scale to evaluate the context awareness of a sentence:
1 - The response is clearly not suitable for the user question, and assumes a different context.
Example 1:
User question: "When is the last train leaving from Cambridge on Thursday?"
Generated response: "Great choice! The last train is leaving at 5pm."
Example 2:
User question: "How much does the entrance costs to the cinema?"
Generated response: "I absolutely can! It costs 4 pounds."
2 - The response is generic (i.e. can fit in almost every context) and unspecific to the question.
Example 1:
User question: "Which cinema would you recommend?"
Generated response: "There are four cinemas in the south. The moonlight cinema has a 3.50 pounds entrance."
Example 2:
User question: "Can you tell me the phone number of the restaurant?"
Generated response: "The phone number is 123000123."

3 - The response is specific to the question and fits well in the context.
Example 1:
User question: "Which cinema would you recommend?"
Generated response: "I can recommend the moonlight cinema, it only costs 3.50 pounds!"
Example 2:
User question: "Ok, then I take the indian restaurant. Can you book me a table for four people?"
Generated response: "Great choice! I reserved a table for you, the reference number is ABC123."

Semantical correctness

For this metric, you need to compare the responses with the "ground truth" human response that is stated below the conversation. An optimal response only paraphrases the sentence, but expresses the exact same semantic/content. This includes the task (i.e. whether the agent tells the user the phone number, informs about a booking, asks for more specific information, etc.), the sentiment (i.e. "Yes I can do it" or "No I'm sorry"), and mentioning all information (i.e. returns all information that was asked, like the phone number, address, etc.). Note that ungrammatical sentences which are hard or not understandable at all, should be evaluated with a low score as well as they do not reflect the same content and is not a paraphrase of the human response.
Please, use the following scale to evaluate the semantical correctness of a sentence:
1 - The response is mostly unrelated to the human response and/or does not answer the last question of the conversation. This also includes sentences which do not have a clear content and/or are not understandable.
Example 1:
Human response: "Train ID123 leaves at 4pm."
Generated response: "I am sorry, there are no trains, there is ID123."
Example 2:
Human response: "I have the West Side hotel and the Cambridge suite. Which one would you like?"
Generated response: "There are no hotels."

2 - The response is related to the human response, but differs in a few details such as missing out one information (e.g. phone number, postcode) or confusing numbers.
Example 1:
Human response: "Train ID123 leaves at 4pm."
Generated response: "."
Example 2:
Human response: "I have the West Side hotel and the Cambridge suite. Which one would you like?"
Generated response: "."

3 - The response is a paraphrase of the human response, representing the same content.
Example 1:
Human response: "Train ID123 leaves at 4pm."
Generated response: "I have one train that leaves at 4pm, namely ID123."
Example 2:
Human response: "I have the West Side hotel and the Cambridge suite. Which one would you like?"
Generated response: "Both the Cambridge suite and the West Side hotel are available."

Contact/Questions

In case any of the instructions are unclear or the data seems to be incorrect, please don't hesitate to contact me (p.lippe@uva.nl).

Generated responses

Response
Grammaticality
Naturalness
Context awareness
Semantical correctness
${response1}
${response2}
${response3}
${response4}
${response5}
${response6}