1. What is Data Annotation?
First, let’s talk about what data annotation is. There are many types of data annotation, such as classification, drawing boxes, annotations, tags, etc. We will discuss them in detail below.
To understand data labeling, you must first understand that AI is actually a partial replacement of human cognitive functions. Think back to how we learn. For example, if we learn to recognize an apple, someone needs to bring an apple to you and tell you that this is an apple. Then, when you encounter an apple later, you know that this thing is called an "apple."
Analogous to machine learning, if we want to teach it to recognize an apple, and you give it a picture of an apple directly, it has no idea what it is. We first need to have a picture of an apple with the word "apple" labeled on it, and then the machine learns a lot of features from the pictures. At this time, if you give the machine any picture of an apple, it will be able to recognize it.
Here, we can also mention the concepts of training set and test set. Both the training set and the test set are labeled data. Taking apples as an example, suppose we have 1,000 labeled images of 'apples,' we can use 900 images as the training set and 100 as the test set. The machine learns a model from the 900 apple images, and then we use the remaining 100 images that the machine has not seen to let it recognize, allowing us to get the accuracy of the model. Think about when we were in school: the exam content is always different from our usual homework, and only in this way can we test the true effect of learning, so it’s not hard to understand why we need to divide a test set.
We know that machine learning is divided into supervised and unsupervised learning. The effect of unsupervised learning is uncontrollable and is often used for exploratory experiments. In practical product applications, supervised learning is usually employed. Supervised machine learning requires labeled data as prior experience.
Before conducting data labeling, we first need to clean the data to meet our requirements. Data cleaning includes removing invalid data, organizing it into a structured format, and so on. Specific data requirements can be confirmed with algorithm personnel.
II. Common Types of Data Labeling
1. Classification labeling: Classification labeling is what we commonly refer to as tagging. Generally, it involves selecting the corresponding label from a predefined set of labels, which is a closed set. As shown in the image below, a single image can have many categories/labels: adult, female, Asian, long hair, etc. For text, subjects, predicates, objects, nouns, and verbs can be labeled.
Applicable: Text, Images, Speech, Video
Applications: Age recognition, Emotion recognition, Gender recognition
2. Bounding Box Annotation: Bounding box annotation in machine vision is easy to understand; it involves selecting the objects to be detected. For example, in face recognition, we first need to determine the position of the face. For pedestrian recognition, see the image below.
Applicable: Images
Applications: Face recognition, Object recognition
3. Area Annotation: Compared to bounding box annotation, area annotation requires more precision. The edges can be flexible, such as in road recognition in autonomous driving.
Applicable: Images
Applications: Autonomous driving
4. Point Annotation: Some applications that require detailed feature requirements often need point annotation, such as face recognition and skeletal recognition.
Applicable: Images
Applications: Face recognition, Skeletal recognition
5. Other Annotations: In addition to the common types mentioned above, there are many personalized types of annotations. Different requirements need different annotations. For example, automatic summarization requires labeling the main points of articles, which strictly does not belong to any of the above types. (Or you can classify it as categorization; however, labeling the main points is not as objective a standard. If it is labeling apples, most people's results are likely to be similar.)
III. The Process of Data Labeling
1. Determining the Labeling Standards
Determining standards is a key step to ensure data quality; it is necessary to have a reference standard. Generally, you can:
Set labeling examples and templates. For example, a color standard color card. For ambiguous data, set a unified processing method, such as discarding or unified labeling.
Reference standards sometimes also need to consider the industry. For example, in text sentiment analysis, the term 'scar' may be a negative word in the psychology industry, while in the medical industry, it is a neutral word.
2. Determining the form of labeling
Labeling forms are generally determined by algorithm personnel. For example, certain text annotations, such as question identification, only require labeling sentences with 0 or 1. If it is a question, label it as 1; if not, label it as 0.
3. Choosing Labeling Tools
Once the form of labeling is determined, the next step is to choose the labeling tool. It is generally provided by algorithm personnel. Large companies may develop a dedicated visualization tool for data labeling internally. For example:
There are also open-source data labeling tools available, such as the small tool labelImg recommended on GitHub.
IV. Design of Data Labeling Products
Combining my experience in creating a data labeling tool, I will discuss some tips for designing data labeling tools.
A data labeling tool generally includes:
Progress bar: Used to indicate the progress of data labeling. Labelers generally have task quantity requirements, which helps them check progress and facilitates statistics. Labeling subject: This can be designed according to the labeling form, and in principle, the simpler and more user-friendly, the better. Depending on the attention required for labeling, it can be divided into single labeling and multiple labeling forms, which can be selected based on needs. Data import and export functions: If your labeling tool directly interfaces with the model, this may not be necessary. Favorite function: This may not be considered by those who have not been exposed to data labeling. A common situation for labelers is fatigue or encountering ambiguous data, so they can save it for later labeling. Quality inspection mechanism: When distributing data, some already labeled data can be randomly distributed to check the reliability of the labelers.