Create Quality Utterances
Last updated
Last updated
Good quality utterances may have two characteristics: Typicality and Pattern Diversity.
Typicality is fulfilled when certain utterances cannot be interpreted other than within the context of the topic subject example. In other words, when builders create utterances they should consider whether their submissions are relevant and sound natural. An utterance is considered relevant when it clearly relates to the topic subject. Also, it is considered natural when a native speaker does not feel any awkwardness in accepting it.
Note that what is ‘natural’ to each language user is different. While social and regional varieties are equally valid, we focus on collecting utterances that are in line with the most common linguistic standards.
Remember that all topic subjects are assigned to specific domains. Thus, whether an utterance is relevant and natural will be judged within the context of each related domain. Builders need to take into account the particular characteristics of the domains and natural language expressions when creating utterances.
Pattern Diversity
We say utterances have Pattern Diversity when they have varied syntactic patterns. Pattern Diversity is an important standard of good quality data since it means the utterances cover many different ways of expressing the same intent. Good data have pattern diversity without duplicates. That is, utterances with pattern diversity DO NOT include changes in one word or various types of specific entities. The example of bad pattern diversity shows no diversity in the sentence structure, but only swapping of words.
Types of Bad data Boolean and multiple intents
Conversational AI (CAI) understands natural language in standardized forms. Despite the diverse form of sentences CAI values the core meaning of the sentence. In computer science, Boolean is a data type that has two possible values. In natural language, combinations of sentences are mostly considered Boolean. For CAI, sentences including booleans tend to cause confusion.
Examples
Is it safe and effective to use a sheet mask twice?
Is it safe to use a sheet mask twice?
Is it effective to use a sheet mask twice?
I need to have an eye check up, it's been hurting since I used the sheet mask
I need to have an eye check up
Eyes have been hurting since I used the sheet mask
There is a 50% chance of CAI to understand one of two sentences separated (shown above). Therefore, it’s suggested to include one intent per sentence.
Builders and validators need to reflect carefully on length of utterances. Relatively long utterances warrant closer review unless they are necessary to express the intent due to adjunct (optional) elements. On the other hand, average-length utterances (3-8 tokens) promote efficient classification. Typically, longer utterances tend to have semantically redundant components. Builders and validators are advised to pay close attention to the naturalness of the given utterances.
Submitting average length sentences (consisting of 3 to 8 words) is recommended but not a strict rule. The ultimate goal of Synesis train2earn is to crowdsource a wide range of natural utterances that conversational AI can encounter across different domains and subjects.
Punctuation & Capitalization
All caps and punctuation are going to be ignored when it comes to AI understanding. The Hobbits were taken to Isengard. The hobbits were taken to isengard. The Hobbits were taken to Isengard!
These sentences are the same. Varying spelling and punctuation in order to generate new utterances will be rejected by the validators.