# Create Quality Utterances

Good quality utterances may have two characteristics: Typicality and Pattern Diversity.

### Typicality

Typicality is fulfilled when certain utterances cannot be interpreted other than within the context of the topic subject example. In other words, when builders create utterances they should consider whether their submissions are relevant and sound natural. An utterance is considered relevant when it clearly relates to the topic subject. Also, it is considered natural when a native speaker does not feel any awkwardness in accepting it.&#x20;

Note that what is ‘natural’ to each language user is different. While social and regional varieties are equally valid, we focus on collecting utterances that are in line with the most common linguistic standards.

<figure><img src="https://3840513626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJt9AkBI7BXJFcU5QCFLb%2Fuploads%2FTrTIciQKWZzl8AZP9zKO%2Fquality1.png?alt=media&#x26;token=b80bdf25-1e2d-4fec-b87c-455af4a32843" alt=""><figcaption></figcaption></figure>

Remember that all topic subjects are assigned to specific domains. Thus, whether an utterance is relevant and natural will be judged within the context of each related domain. Builders need to take into account the particular characteristics of the domains and natural language expressions when creating utterances.

<figure><img src="https://3840513626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJt9AkBI7BXJFcU5QCFLb%2Fuploads%2FJWjTbbkYZQymWw4RtPsY%2Fquality2.png?alt=media&#x26;token=0fb997f4-40a9-4bbc-a24c-63f84e83bb33" alt=""><figcaption></figcaption></figure>

**Pattern Diversity**

We say utterances have Pattern Diversity when they have varied syntactic patterns. Pattern Diversity is an important standard of good quality data since it means the utterances cover many different ways of expressing the same intent. Good data have pattern diversity without duplicates. That is, utterances with pattern diversity DO NOT include changes in one word or various types of specific entities. The example of bad pattern diversity shows no diversity in the sentence structure, but only swapping of words.

<figure><img src="https://3840513626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJt9AkBI7BXJFcU5QCFLb%2Fuploads%2F0ukxBoMaKNKvzRk3yfIR%2Fquality3.png?alt=media&#x26;token=fd87f8c5-6b63-4bb2-a168-5cb8b6a4709d" alt=""><figcaption></figcaption></figure>

<mark style="color:green;">**Types of Bad data**</mark>\
\
**Boolean and multiple intents**

Conversational AI (CAI) understands natural language in standardized forms. Despite the diverse form of sentences CAI values the core meaning of the sentence. In computer science, Boolean is a data type that has two possible values. In natural language, combinations of sentences are mostly considered Boolean. For CAI, sentences including booleans tend to cause confusion.

**Examples**

| Is it safe and effective to use a sheet mask twice? | <mark style="background-color:orange;">Is it safe to use a sheet mask twice?</mark>      |
| --------------------------------------------------- | ---------------------------------------------------------------------------------------- |
|                                                     | <mark style="background-color:orange;">Is it effective to use a sheet mask twice?</mark> |

<table data-header-hidden><thead><tr><th width="340.5"></th><th></th></tr></thead><tbody><tr><td>I need to have an eye check up, it's been hurting since I used the sheet mask</td><td><mark style="background-color:orange;">I need to have an eye check up</mark></td></tr><tr><td></td><td><mark style="background-color:orange;">Eyes have been hurting since I used the sheet mask</mark></td></tr></tbody></table>

There is a 50% chance of CAI to understand one of two sentences separated (shown above). Therefore, it’s suggested to include one intent per sentence.

#### Sentence Length

Builders and validators need to reflect carefully on length of utterances. Relatively long utterances warrant closer review unless they are necessary to express the intent due to adjunct (optional) elements. On the other hand, average-length utterances (3-8 tokens) promote efficient classification. Typically, longer utterances tend to have semantically redundant components. Builders and validators are advised to pay close attention to the naturalness of the given utterances.

<figure><img src="https://3840513626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJt9AkBI7BXJFcU5QCFLb%2Fuploads%2FmCRDzgw2h3Wkg5Nd5vPS%2Ftable1.jpg?alt=media&#x26;token=5cfcac94-b784-417d-bd33-0a271f428726" alt=""><figcaption></figcaption></figure>

<figure><img src="https://3840513626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJt9AkBI7BXJFcU5QCFLb%2Fuploads%2FVAzE9pCE59cxQ0lTodbp%2Ftable%202.jpg?alt=media&#x26;token=0010d894-d09b-4e36-b21e-d67dd90bcb4e" alt=""><figcaption></figcaption></figure>

Submitting average length sentences (consisting of 3 to 8 words) is recommended but not a strict rule. The ultimate goal of Synesis train2earn is to crowdsource a wide range of natural utterances that conversational AI can encounter across different domains and subjects.&#x20;

**Punctuation & Capitalization**

All caps and punctuation are going to be ignored when it comes to AI understanding. \
\
The Hobbits were taken to Isengard.\
The hobbits were taken to isengard.\
The Hobbits were taken to Isengard!

These sentences are the same. Varying spelling and punctuation in order to generate new utterances will be rejected by the validators.\ <br>
