Labeling 10,000 Tickets Efficiently: Semi-Automated Labeling Strategies

Labeling thousands of support tickets manually is time-consuming and expensive. A semi-automated workflow leverages large language models (LLMs) like GPT to pre-label tickets (using zero-shot/few-shot prompts) and then uses human annotators to review and correct those labels. This hybrid approach dramatically cuts annotation effort: for example, one case study found GPT-generated “pre-annotations” “good enough to help us speed up the labeling process”. In practice, minimal labels from the model can reduce time and cost of annotation. In this article we explain how to set up such a pipeline, show Python examples (using GPT via OpenRouter or OpenAI), and discuss tools like Label Studio for review.

Using GPT for Zero-Shot/Few-Shot Pre-Labeling

Modern LLMs can classify text with zero or few examples. In zero-shot labeling, the model assigns categories without being explicitly trained on ticket data. As one tutorial puts it: “Zero-shot learning allows models to classify new instances without labeled examples”. In practice, you craft a prompt instructing GPT to tag a ticket. For example:

text

Ticket: "Cannot login to account."
Classify this ticket into one of {Bug, Feature Request, Question}.

The model then replies with a label. Few-shot labeling adds a couple of examples in the prompt to improve accuracy. This means we can generate initial labels directly via the API without any model training.

Tip: Use a structured prompt or ask for JSON output to make parsing easy. For example:
Ticket: "Password reset email bounced."
Respond in JSON like {"category": "..."}.
This helps integrate the response into your pipeline.

Example: Python Code for Pre-Labeling

Below is a Python example using OpenAI’s API via the openai library. It loops over a dummy ticket list, prompts GPT-4 to classify each ticket, and records the category. (You can also use OpenRouter similarly by setting base_url="https://openrouter.ai/api/v1" and changing the model parameter.)

python

import openai

openai.api_key = "YOUR_OPENAI_API_KEY"
tickets = [
    {"id": 1, "text": "User cannot login to account", "category": None},
    {"id": 2, "text": "Error 404 when uploading file", "category": None},
    {"id": 3, "text": "Request to add dark mode feature", "category": None},
    {"id": 4, "text": "Payment declined on checkout", "category": None},
]
categories = ["Bug", "Feature Request", "Question"]

for ticket in tickets:
    prompt = f"Ticket: \"{ticket['text']}\". Classify it as one of {categories}."
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )
    # Extract the category from GPT's reply
    ticket["category"] = response.choices[0].message.content.strip()

print(tickets)

After running this, tickets might become:

python

[
    {'id': 1, 'text': 'User cannot login to account', 'category': 'Bug'},
    {'id': 2, 'text': 'Error 404 when uploading file', 'category': 'Bug'},
    {'id': 3, 'text': 'Request to add dark mode feature', 'category': 'Feature Request'},
    {'id': 4, 'text': 'Payment declined on checkout', 'category': 'Bug'}
]

These are pre-labels that human reviewers will check. Note how OpenRouter makes switching models easy: by changing model="openai/gpt-4" to another provider (e.g. Claude or a lighter model), the same code works. In fact, OpenRouter’s unified API lets you try multiple providers or use fallback models if one is down. For example:

python

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_API_KEY",
)
client.chat.completions.create(
    model="openai/gpt-4o",  # try GPT-4 first
    extra_body={"models": ["anthropic/claude-3.5-sonnet", "google/palm-2"]},
    messages=[{"role": "user", "content": "Ticket: 'Login page error'. Classify it."}]
)

This will use GPT-4 if available, otherwise fallback to Claude or PaLM as shown in OpenRouter’s docs. Such flexibility is useful for companies needing high availability or comparing models.

Integrating Pre-Labels with Labeling Tools

Once GPT generates labels, the next step is importing them into a labeling interface for human review. One popular open-source solution is Label Studio. Label Studio supports importing model predictions as “pre-annotations” alongside the data. Annotators see the suggested label and only need to correct mistakes, not label from scratch. In effect, the team “shifts from the time-intensive task of data labeling to the far more efficient process of reviewing and refining the preliminary labels”.

Label Studio even offers an ML backend: you can write a small server using the LabelStudioMLBase class that calls GPT for each task. In their tutorial, Label Studio shows wrapping GPT-4 calls in this class to return predictions on the fly. Alternatively, you can import a JSON file of predictions. The required JSON format has a data field (the ticket text) and a predictions array (containing each label). For example (simplified):

json

[
    {
        "data": {
            "text": "User cannot login to account"
        },
        "predictions": [
            {
                "result": [
                    {
                        "value": {
                            "choices": [
                                {
                                    "text": "Bug"
                                }
                            ]
                        }
                    }
                ]
            }
        ]
    },
    {
        "data": {
            "text": "Add dark mode to settings"
        },
        "predictions": [
            {
                "result": [
                    {
                        "value": {
                            "choices": [
                                {
                                    "text": "Feature Request"
                                }
                            ]
                        }
                    }
                ]
            }
        ]
    }
]

After importing, Label Studio will display each ticket with the model’s label pre-filled. The annotator’s job is to review and correct. This semi-automated workflow has been shown to work well: a Kili Technology example demonstrated loading a GPT-pre-labeled dataset and noted “we have successfully pre-annotated our dataset” and that this approach “has the potential to save us a lot of time”. In practice, GPT’s accuracy on labeling might be ~80–90%, meaning humans correct only the remaining 10–20%.

Tools and Workflow Steps

To summarize, a typical semi-automated labeling pipeline looks like this:

Prepare the ticket dataset. Export your 10,000 unlabeled tickets (e.g. as JSON or CSV).
Generate pre-labels via LLM. Run code (like above) calling GPT-4 (or another model via OpenRouter) to classify each ticket. Save the responses.
Import predictions into a labeling tool. Use Label Studio (or similar) to load tickets and associate each with the GPT-generated label (the “prediction”). Label Studio docs explain how to import predictions with your data.
Human review. Annotators go through the tickets in Label Studio, accepting or correcting the labels. This is much faster than labeling from scratch. Label Studio’s interface highlights the model suggestion for each task, so the task becomes quick validation.
Export final labels. Once reviewed, export the corrected annotations for model training or analytics.

Key public tools that support this approach include:

OpenRouter – a unified LLM API gateway (openrouter.ai). It lets you easily switch between GPT-4, Anthropic Claude, Google PaLM, etc.. You can even specify a fallback list in one API call.
OpenAI API (GPT-4/3.5) – the core engine for generating labels with zero/few-shot prompts.
Label Studio – an open-source data labeling UI. It supports importing predictions and has an ML backend to call models.
Doccano – a simpler open-source tool for text annotation (classification, NER, etc.). It does not have built-in LLM integration, but you can still use GPT offline to generate labels and load them as initial choices.
Snorkel/Programmatic Labeling – for some rule-based or weak-supervision cases, tools like Snorkel can complement LLM labels, but modern LLMs often cover many cases out of the box.

Dummy Ticket Data Example

To illustrate, here is some dummy ticket data you might work with:

python

tickets = [
    {"id": 101, "text": "Error 500 when saving profile", "label": None},
    {"id": 102, "text": "How do I change my subscription plan?", "label": None},
    {"id": 103, "text": "Feature request: dark mode in settings", "label": None},
    {"id": 104, "text": "Application crashes on startup", "label": None},
]

You could feed each ticket['text'] to GPT with a prompt like:

text

Ticket: "Error 500 when saving profile."
Classify this issue as one of {Bug, Feature, Question}.

Suppose GPT returns "Bug", "Question", "Feature", "Bug" respectively. After the loop, tickets might be:

python

[
    {'id': 101, 'text': 'Error 500 when saving profile', 'label': 'Bug'},
    {'id': 102, 'text': 'How do I change my subscription plan?', 'label': 'Question'},
    {'id': 103, 'text': 'Feature request: dark mode in settings', 'label': 'Feature'},
    {'id': 104, 'text': 'Application crashes on startup', 'label': 'Bug'},
]

These labels would then be loaded into the review interface. Even if some are wrong (e.g. GPT might mislabel a tricky bug as a feature), the annotator only needs to fix them instead of starting from scratch. Empirically, GPT-generated labels often reach ~80–90% accuracy, so reviewing is much faster than full labeling.

Results and Takeaways

The semi-automated approach scales well. In a large project, human annotators might only need to fix a few hundred or thousand labels instead of 10,000. As the Kili tutorial observed after running GPT pre-labels: “Great! We have successfully pre-annotated our dataset. Looks like this solution has the potential to save us a lot of time in future projects.”. In other words, LLMs serve as a force multiplier. Even though the model isn’t 100% correct, it “speeds up the labeling process” by doing most of the work.

Best practices: Use a low temperature (e.g. 0.0–0.3) for consistent labels, and provide clear instructions or a small list of examples. Monitor GPT’s mistakes: you may need to adjust prompts or add a few shot examples for underperforming categories. Keep the prompt simple (e.g. “Classify the ticket text into A, B, or C”). You can also batch multiple tickets in one API call if the model and API allow it, to save cost. And always include human review – this ensures high quality and catches any LLM errors or drift.

Conclusion

Semi-automated labeling with GPT and tools like OpenRouter and Label Studio is a powerful strategy for quickly labeling large text datasets. By pre-labeling 10,000 tickets with an LLM and then reviewing, companies can jump-start their AI workflows with minimal initial data. This approach cuts costs and time dramatically while still ensuring quality through human oversight. As one implementation guide notes, shifting the workflow from “data labeling” to “reviewing and refining” of LLM-generated labels “significantly accelerates your workflow.”. In short, combining GPT-based pre-annotation with a friendly UI (Label Studio, Doccano, etc.) helps software/AI teams label massive ticket datasets efficiently and accurately.

Labeling 10,000 Tickets Efficiently: Semi-Automated Labeling Strategies ​

Using GPT for Zero-Shot/Few-Shot Pre-Labeling ​

Example: Python Code for Pre-Labeling ​

Integrating Pre-Labels with Labeling Tools ​

Tools and Workflow Steps ​

Dummy Ticket Data Example ​

Results and Takeaways ​

Conclusion ​

Labeling 10,000 Tickets Efficiently: Semi-Automated Labeling Strategies

Using GPT for Zero-Shot/Few-Shot Pre-Labeling

Example: Python Code for Pre-Labeling

Integrating Pre-Labels with Labeling Tools

Tools and Workflow Steps

Dummy Ticket Data Example

Results and Takeaways

Conclusion