How Small Language Models Analyze Unstructured Data: C&F

Table of contents

1. Why Analyzing Unstructured Data Matters
2. Our Approach to Testing SLMs
3. SLM Performance: Characteristics and Results
4. Quick Model Comparison
5. Key Findings and Ranking
6. Final Thoughts

The majority of digital data stored worldwide is unstructured. The ability to analyze it can translate into profits and business success. Unfortunately, this is quite complex, and using large language models (LLMs) hosted externally may raise privacy concerns. We examined how small language models (SLMs) handle this challenge.

Why Analyzing Unstructured Data Matters

Over 80% of the data stored on servers worldwide consists of unstructured data: text files, images, audio and video recordings, emails, and social media information, all saved in various formats. This data lacks a predefined format or organized schema, making it difficult to store, search, and analyze using traditional relational databases.

Unlike structured data stored in tables with rows and columns, unstructured data requires specialized tools for processing and analysis, such as natural language processing (NLP), machine learning, and big data technologies. Because of its complexity and volume, managing unstructured data often involves techniques such as indexing, metadata tagging, and AI-driven classification to extract meaningful insights.

SLMs perform remarkably well in these tasks. With fewer parameters than LLMs, they require less computational power, memory, and storage. This makes them more efficient, particularly for specific, well-defined tasks like extracting structured information or running chatbots on edge devices. Crucially, they are also more cost-effective and can often be run locally, which addresses privacy concerns.

Our Approach to Testing SLMs

To evaluate the usefulness of small language models SLMs in unstructured data analysis, we fed them data on workplace accidents and asked them to identify the type of incident, its severity, the type of risk, the location of occurrence, and the reason behind it. For some of these categories , the prompt included a list of possible values the model was supposed to assign; for others, we left the model with its (controlled) creativity.

Our primary goal was to obtain reliable results —correctly assigning cases to predefined categories without generating additional instances. For all cases, we set the response temperature to the lowest value—0. This helped reduce the models’ creativity and ensured the most precise answers possible. After all, we aimed to structure the obtained data for further analysis, which require it to be as accurate and consistent as possible.

All of described SLMs can be run locally (e.g., using Ollama or similar tools) or hosted on clusters in private networks, which resolves all potential privacy issues. Some of them are hosted by external cloud providers, making them easier to use.

One of the biggest challenges for the models turned out to be selecting the best value from those provided, represented here by “severity recognition”. Severity is an easy-to-define measure, the provided values don’t overlap, and they were delivered alongside descriptions. To assess performance, we calculated the accuracy rate as a percentage. However, it’s important to note that achieving 100% accuracy is unrealistic—some information simply cannot be systematized at the level of contextual text analysis.

Another issue arose when a model struggled with contextual searching, generating too many synonyms for incidents. This was the case, for example, with Phi-3.

Additionally, speed and hardware requirements were also crucial factors. Higher computational power means higher costs. A model’s hardware demands are directly linked to whether it can be tested locally (on a standard laptop) rather than in the cloud. This was one of the key aspects in comparing the efficiency of small models.

How did the different small AI models handle this challenge? Read on to find out.

SLM Performance: Characteristics and Results

Phi-3 – Medium 14B

Developer: Microsoft

Type: Lightweight AI

Hosting: Cloud & local (via Azure AI or local deployment without GPU)

Cost per million tokens*: Input – $0.17, output – $0.68 (cloud only, local use incurs only electricity costs)

Intended use: Language tasks

Test Results:

Severity recognition: Not defined (results too inconsistent to measure)

Pros

+ Maintains structure defined in the prompt

+ Fast when run locally

Cons

– Frequent typos

– Inconsistent, often ignored contextual rules

Gemma 2 – 27B

Developer: Google

Type: General NLP, multilingual

Hosting: Cloud (Vertex AI) & local (via Ollama)

Cost per million tokens*: Input – $0,27, output – $0,27

Intended use: Language tasks, multilingual reasoning,

Test Results:

Severity recognition: 63 %

Pros

+ Maintains structure defined in the prompt

+ Consistent; follows contextual rules

+ Can be run locally

Cons

– Slow when run locally

– Not available on Snowflake Cortex Platform

Mixtral 8x7B

Developer: Mixtral AI

Type: Scalable AI, open source

Hosting: Cloud (Hugging Face, AWS) & local (via Ollama)

Cost per million tokens*: Input – $0,6, output – $0,6

Intended use: Language tasks, reasoning,

Test Results:

Severity recognition: 83 %

Pros

+ Maintains structure defined in the prompt

Cons

– Inconsistent output, improper categorization

– High computational requirements for local use

Llama 3.1 – 70B

Developer: Meta

Hosting: Cloud (AWS, Azure, Google Cloud) & local (Hugging Face; requires high-end GPU)

Cost per million tokens*: Input – $0,12, output – $0,30

Intended use: Language tasks, deep reasoning, research, coding

Test Results:

Severity recognition: 86 %

Pros

+ Maintains structure defined in the prompt

Cons

– Consistency issues (created new categories unexpectedly)

– Requires powerful hardware for local deployment

Quick Model Comparison

Model	Severity recognition	Pros	Cons
Gemma 2 – 27B	63%	+ Structured output + Consistent + Local use	– Slow locally – Not in Snowflake Cortex
Llama 3.1 – 70B	86%	+ Structured output	– Inconsistencies – Hardware heavy
Mixtral 8x7B	83%	+ Keep the structure defined in prompt	– Inconsistencies – Requires strong hardware
Phi-3 – medium 14B	Undefined	+ Fast + Structured output	– Typos – Inconsistent categorization

Key Findings and Ranking

Each of the tested models had its strengths and weaknesses. All of them successfully maintained the structure defined in the prompt, which is crucial for further processing. The real challenge arose when they had to extract contextual data and assign it to the appropriate categories.
The main hurdle was consistency—choosing one from values provided in the context, without creating new terms. The second was accuracy,represented by severity recognition—the ability to assess the seriousness of an incident based on text analysis. Lastly, local deployment options and hardware requirements also played a key role in overall usability and cost-efficiency.

Based on these criteria, here is how the models ranked:

Gemma 2 – 27B
Despite lower accuracy in severity recognition, it performed exceptionally well on the tasks specified in the prompt and was nearly flawless in terms of consistency, avoiding the creation of new categories. It handled information extraction from incidents’ descriptions the best, resulting in the highest-quality outcomes. It can be hosted in the cloud, but only through manual setup, which is not very convenient, and processing results takes some time.
Llama 3.1 – 70B
The best in assessing incident severity based on context, achieving 86% accuracy. It maintained 97% consistency, slightly worse in data analysis than Gemma, but it has the advantage of being available in the cloud.
Phi-3 – Medium 14B
Microsoft’s model stands out for its speed. However, its drawbacks include average accuracy in category selection and frequent typos, requiring additional cleaning of extracted data. This, in turn, slows down further analysis.
Mixtral 8x7B
While it performed well in severity assessment, it was overly creative with its outputs, even with the temperature set to 0. Unlike the other models, which worked with nearly identical prompts, Mixtral required multiple modifications to maintain the defined structure. Additionally, it requires a very powerful computer to be run locally.

Final Thoughts

Choosing the right SLM for unstructured data analysis ultimately depends on your priorities—whether it’s consistency, accuracy, speed, or local deployment. While no single model excels in every area, SLMs offer powerful and flexible tools that help organizations process complex, text-heavy data efficiently and securely.

As these models become increasingly capable and accessible, they are being applied to a growing range of business use cases—including classification, search, or information extraction—often as part of broader AI/machine learning solutions tailored to specific data challenges.

To explore a structured approach to applying AI across business functions—including the role of models like these—download our white paper: The BIG AI Framework.

Would you like more information about this topic?

Complete the form below.

Stop manual and tedious auditing

Optimize your pricing and rebates

How Small Language Models Analyze Unstructured Data

Why Analyzing Unstructured Data Matters

Our Approach to Testing SLMs

SLM Performance: Characteristics and Results

Phi-3 – Medium 14B

Gemma 2 – 27B

Mixtral 8x7B

Llama 3.1 – 70B

Quick Model Comparison

Key Findings and Ranking

Final Thoughts

Would you like more information about this topic?

This post tags