A Comprehensive Guide to Evaluating LLM-Generated Knowledge Graphs: From PoC to Production

12/8/2024 • Ezhil Vendhan

TLDR; Creating Knowledge Graphs using Large Language Models (LLMs) with Graphora is powerful but requires robust evaluation. This guide shows you how to:

Build a production-ready evaluation pipeline for LLM-driven knowledge graphs
Implement automated quality checks using LLMs as judges
Set up comprehensive tracing and monitoring
Use industry-standard metrics to measure graph quality
Handle entity resolution and graph integration

Key tools covered: Instructor, LangSmith, Lilac.
Complete implementation available in the accompanying Jupyter notebook.
Target audience: ML Engineers and Data Scientists working on Knowledge Graph creation and maintenance.

Introduction

Creating Knowledge Graphs—structured representations of entities and their relationships—from unstructured data has become increasingly valuable beyond just Graph Retrieval Augmented Generation (GraphRAG). While Large Language Models (LLMs) have simplified this process, the critical step of evaluation is often overlooked or missing entirely. Drawing inspiration from comprehensive works on Applied LLM evaluations by Hamel Husain and Eugene Yan, this article provides a practical guide for evaluating your LLM-driven Knowledge Graph creation process.
With this foundation in mind, let's explore how to build these knowledge graphs efficiently, particularly using modern LLM approaches.

High-Level Architecture

Before diving into implementation details, let's understand the overall architecture of our LLM-driven Knowledge Graph evaluation pipeline. Here is the accompanying Jupyter notebook.

Building Knowledge Graphs

Processing unstructured data as graphs offers significant advantages, particularly in quick traversal and insight generation from connected information. Consider fraud detection: by analyzing a person's connections and associations derived from social media posts, news events, and transaction logs, a graph-based system can assess fraud risk in a low latency manner.

While traditionally building such graphs required substantial time and expertise, LLMs can accelerate this process through three main approaches:

Domain-specific Ontology Extraction: Using predefined rules and relationships to structure information
Hierarchical Text Chunk Extraction: Breaking down and storing information as connected, vectorized chunks
Hybrid Approach: Combining ontology-driven structure with flexible text extraction

The ontology-driven approach, though requiring initial investment in taxonomy development, provides precise control. However, building the graph is only half the battle - ensuring its quality and reliability is equally crucial.

Why Evals?

Regardless of your chosen graph-building approach, robust evaluation is crucial—similar to writing test cases for software applications. This evaluation serves multiple critical purposes:

Quality Assurance:

Ensures correctness, factuality, semantic accuracy, and knowledge completeness
Validates proper handling of sensitive and Personally Identifiable Information (PII) in regulated industries
Verifies compliance with ontology rules for cleaner, more reliable data

Risk Mitigation:

Prevents costly errors from incorrect graph relationships
Reduces negative impacts on downstream applications
Protects against data drift and model decay

With the rapid pace of model releases and updates, implementing thorough evaluation processes becomes critical. To move from theory to practice, let's explore these concepts through a concrete example.

Evaluating your Graph: A Real-World Example

Let's apply these evaluation concepts to a practical use case: building a Knowledge Graph from Securities and Exchange Commission (SEC) Form 10K Annual reports. We'll focus specifically on Part I of Form 10-K, with particular emphasis on the Business section.

For context, Form 10-K is a comprehensive report mandated by the SEC that provides a complete summary of a company's financial performance. Part I of Form 10-K contains critical narrative information including:

Business Overview - Company operations and primary activities - Revenue streams and business model - Market presence and competition - Key products and services
Risk Factors
Properties
Legal Proceedings
Mine Safety Disclosures (if applicable)

We'll concentrate on the Business section as it provides rich, structured information about a company's operations, making it an ideal candidate for knowledge graph creation. To effectively capture this information, we need a well-defined structure for our graph.

Building Knowledge Graphs with Domain-Specific Ontologies

Knowledge graph creation begins with defining clear rules for entity extraction and relationships. This process typically involves collaboration with Subject Matter Experts (SMEs) who understand both the domain (in our case, financial documents) and the specific information we need to capture.

For our Form 10-K Part I analysis, we'll use a Pydantic-based ontology structure. While other formats like RDF, Turtle, or JSON-LD are equally valid, Pydantic offers several advantages:

Type safety and validation
Easy integration with Python ecosystems
Clear, maintainable code structure
IDE support for better development experience

Here's our ontology definition:

1class Business(BaseModel):  
2   """Item 1: Business description and operations"""  
3   id: str = Field(..., description="Unique identifier for the business section")  
4   description: str = Field(..., description="General description of the business")  
5   segments: List[str] = Field(default_factory=list, description="List of Business segments. Do no club segments together")  
6   products: List[str] = Field(default_factory=list, description="List of Main products and services")  
7   competition: str = Field(..., description="Competitive environment description")  
8   raw_materials: Optional[str] = Field(None, description="Description of raw materials used")  
9   intellectual_property: Optional[str] = Field(None, description="IP rights and portfolio")  
10   seasonality: Optional[str] = Field(None, description="Business seasonality patterns")  
11   employees: Optional[int] = Field(None, description="Number of employees")  
12   regulatory_environment: Optional[str] = Field(None, description="Regulatory framework description")
13
14class RiskFactor(BaseModel):  
15   """Item 1A: Risk Factors"""  
16   id: str = Field(..., description="Unique identifier for the risk factor")  
17   risk_category: str = Field(..., description="Category of the risk")  
18   risk_description: str = Field(..., description="Detailed description of the risk")  
19   potential_impact: str = Field(..., description="Potential impact on business")  
20   mitigation_strategy: Optional[str] = Field(None, description="Risk mitigation strategy")
21
22class Property(BaseModel):  
23   """Item 2: Properties"""  
24   id: str = Field(..., description="Unique identifier for the property")  
25   location: str = Field(..., description="Property location")  
26   property_type: str = Field(..., description="Type of property (Owned/Leased)")  
27   square_footage: Optional[float] = Field(None, description="Property size in square feet")  
28   primary_use: str = Field(..., description="Primary use of the property")  
29   expiration_date: Optional[date] = Field(None, description="Lease expiration date if applicable")
30
31class LegalProceeding(BaseModel):  
32   """Item 3: Legal Proceedings"""  
33   id: str = Field(..., description="Unique identifier for the legal proceeding")  
34   case_description: str = Field(..., description="Description of the legal case")  
35   jurisdiction: str = Field(..., description="Legal jurisdiction")  
36   status: str = Field(..., description="Current status of the proceeding")  
37   potential_liability: Optional[float] = Field(None, description="Potential financial liability")  
38   expected_resolution: Optional[str] = Field(None, description="Expected resolution timeline")
39
40class MineSafety(BaseModel):  
41   """Item 4: Mine Safety Disclosures"""  
42   id: str = Field(..., description="Unique identifier for mine safety record")  
43   violations: Optional[int] = Field(None, description="Number of safety violations")  
44   citations: List[str] = Field(default_factory=list, description="Safety citations received")  
45   assessments: Optional[float] = Field(None, description="Financial assessments or penalties")
46
47class PartI(BaseModel):  
48   """Part I of Form 10-K"""  
49   id: str = Field(..., description="Unique identifier for Part I")  
50   business: Business = Field(..., description="Item 1: Business")  
51   risk_factors: List[RiskFactor] = Field(default_factory=list, description="Item 1A: Risk Factors")  
52   properties: List[Property] = Field(default_factory=list, description="Item 2: Properties")  
53   legal_proceedings: List[LegalProceeding] = Field(default_factory=list, description="Item 3: Legal Proceedings")  
54   mine_safety: Optional[MineSafety] = Field(None, description="Item 4: Mine Safety Disclosures")
55
56class Form10K(BaseModel):  
57   """Form 10-K Document"""  
58   id: str = Field(..., description="Unique identifier for the Form 10-K")  
59   fiscal_year: int = Field(..., description="Fiscal year of the report")  
60   filing_date: date = Field(..., description="Date the form was filed")  
61   company_name: str = Field(..., description="Name of the company")  
62   cik: str = Field(..., description="Central Index Key (CIK) number")  
63   part_i: PartI = Field(..., description="Part I of Form 10-K")

Developing Evaluation Data

In our case, Form 10-K reports provide an excellent source of training and evaluation data due to their:

Standardized structure
Rich business information
Public availability
Consistent formatting

However, if you're working with limited data or need to test edge cases, synthetic data generation using LLMs can be valuable. The Instructor library by Jason Liu provides an elegant solution for this:

Enforces ontology constraints during generation
Maintains data consistency
Simplifies prompt engineering

Implementation example:

1import instructor  
2from anthropic import Anthropic  
3from langsmith import wrappers, traceable
4
5# Set up the client with prompt caching  
6client = instructor.from_anthropic(Anthropic())
7
8@traceable(run_type="llm")  
9def extract_form10k_part1(all_form10k):  
10 results = []  
11 for content in all_form10k[0:3]:  
12   resp, completion = client.chat.completions.create_with_completion(  
13         model="claude-3-5-haiku-20241022",  
14         messages=[  
15             {  
16                 "role": "user",  
17                 "content": [  
18                     {  
19                         "type": "text",  
20                         "text": "<Form10K-PART-1>" + content + "</Form10K-PART-1>"  
21                     },  
22                     {  
23                         "type": "text",  
24                         "text": "Extract Form 10K from the Part I text given above",  
25                     },  
26                 ],  
27             },  
28         ],  
29         response_model=Form10K,  
30         max_tokens=4000,  
31     )  
32   results.append((resp, completion, content))  
33 return results
34
35results = extract_form10k_part1(all_form10k)

This approach generates structured outputs like:

1Form10K(id='AE-2023-10K', fiscal_year=2023, filing_date=datetime.date(2024, 2, 29), company_name='Adams Resources & Energy, Inc.', cik='0000032945', part_i=PartI(id='PART-I-2023', business=Business(id='AE-BUSINESS-2023', description='Adams Resources & Energy, Inc. is a publicly traded Delaware corporation organized in 1973. Through its subsidiaries, the company is primarily engaged in crude oil marketing, truck and pipeline transportation of crude oil, terminalling and storage in various crude oil and natural gas basins in the lower 48 states of the United States. The company also conducts tank truck transportation of liquid chemicals, pressurized gases, asphalt and dry bulk, and recycles and repurposes off-specification fuels, lubricants, crude oil and other chemicals from producers in the U.S.', segments=['crude oil marketing, transportation and storage', 'tank truck transportation of liquid chemicals, pressurized gases, asphalt and dry bulk', 'pipeline transportation, terminalling and storage of crude oil', 'interstate bulk transportation logistics of crude oil, condensate, fuels, oils and other petroleum products and recycling and repurposing of off-specification fuels, lubricants, crude oil and other chemicals'], products=['Crude oil marketing', 'Truck transportation services', 'Pipeline transportation', 'Chemical and fuel recycling'], competition='Strong competition from regional and national entities with significant financial resources and more expansive geographic footprints. Competition is primarily in establishing trade credit, pricing, quality of service, and location of service.', raw_materials=None, intellectual_property=None, seasonality='Trucking revenue historically follows seasonal patterns, with peak freight demand in September, October, and November. Winter months typically have lower freight volumes and higher operating expenses.', employees=741, regulatory_environment='Subject to extensive federal, state, and local regulations including environmental laws, transportation regulations, pipeline safety rules, and emerging climate change legislation.'), risk_factors=[RiskFactor(id='RF-DRIVER-SHORTAGE', risk_category='Operational Risk', risk_description='Difficulty in attracting and retaining qualified truck drivers', potential_impact='Could limit growth, increase compensation costs, and reduce operational efficiency', mitigation_strategy='Competitive wages, benefits, modern equipment, and driver training programs'), RiskFactor(id='RF-CREDIT-RISK', risk_category='Financial Risk', risk_description='Dependence on ability to obtain trade and other credit', potential_impact='Could limit marketing businesses, improvements, and future growth', mitigation_strategy='Evaluating financial strength of counterparties and maintaining credit arrangements')], properties=[Property(id='HQ-PROPERTY', location='17 South Briar Hollow Lane, Suite 100, Houston, Texas 77027', property_type='Leased', square_footage=22197.0, primary_use='Headquarters Office', expiration_date=None)], legal_proceedings=[], mine_safety=None))

Development Tips

Experiment with different chunk sizes for optimal extraction
Iterate on ontology fragments based on extraction results
Validate generated data against your original ontology rules
Keep track of edge cases and common failure modes

Tracing and Monitoring

Tracing is crucial for understanding and improving your knowledge graph creation pipeline. It helps capture not just user queries, but the entire extraction and validation process, enabling continuous improvement. Here's how you can implement effective tracing:

Query and Response Tracking

Tools like LangSmith provide comprehensive tracing capabilities:

Capture input queries and outputs
Monitor latency and performance
Track token usage and costs
Experiment with different prompts and models
Store examples for in-context learning

In our code above, tracing is enabled using @traceable annotation:

1from langsmith import wrappers, traceable
2
3….
4
5@traceable(run_type="llm")  
6def extract_form10k_part1(all_form10k):  
7….

Performance Analysis

Lilac offers powerful analytics capabilities:

Cluster similar queries for pattern recognition
Identify common failure modes
Analyze semantic similarity in responses
Track data drift over time
Monitor extraction quality metrics

While tracing and monitoring give us insights into system performance, we still need a reliable way to evaluate the quality of our extractions. This brings us to an innovative approach: using LLMs themselves as evaluation judges.

LLM as Judge: Building Reliable Evaluation Systems

While successful extraction is crucial, a production-ready system requires robust evaluation mechanisms to ensure accuracy and prevent hallucinations. Let's explore how to build a reliable evaluation system using LLMs.

Evaluation Approaches

There are several approaches to evaluate extracted data:
1. Fine-tuned Custom Models

Trained on domain-specific data
Optimized for specific evaluation tasks
Requires significant training data

2. High-Performance LLMs as Judges

Uses more capable models for evaluation
More cost-effective than fine-tuning
Flexible and adaptable to changes

Building a Reliable LLM Judge

The key challenge is ensuring the LLM judge itself is reliable. Based on research from this paper, we implement an iterative alignment process:

1. Initial Evaluation

Create baseline evaluation prompts
Gather LLM judgments and rationales
Document evaluation criteria

2. SME Alignment

Have domain experts review LLM evaluations
Compare LLM decisions with expert judgment
Identify gaps and inconsistencies

3. Prompt Refinement

Iterate on prompts based on feedback
Target specific failure modes
Maintain clear evaluation criteria

Implementation Example:

1JUDGE_PROMPT_FOR_BUSINESS_SECTION = f"""You are a Named Entity & Relationships Extraction evaluator with advanced capabilities to judge if the extraction is good or not.  
2You understand the nuances of the Financial Industry - more specifically SEC's Form 10K and Neo4j Knowledge Graph, including what is likely to be most useful from a Graph perspective.
3
4Here are some guidelines for evaluating queries:  
5- ensure that the extraction is accurate and do so only from the text provided  
6- For long text descriptions, try to summarize them and capture only what is necessary from a finance perspective
7
8Example evaluations:
9
10<examples>
11
12<example-1>  
13<form10k>  
14ITEM 1. BUSINESS  
15Overview  
16GreenTech Solutions, Inc. ("GreenTech" or the "Company") is a leading provider of renewable energy technologies and sustainable infrastructure solutions. Founded in 1995, we design, manufacture, and install solar power systems, energy storage solutions, and smart grid technologies for residential, commercial, and utility-scale applications. Through continued innovation and strategic expansion, we have established ourselves as a trusted partner in the global transition to sustainable energy.  
17Business Operations and Segments  
18Our operations are structured to address the diverse needs of the renewable energy market. In the residential sector, we provide comprehensive solar solutions to homeowners through our Residential Solar Solutions division, which handles everything from initial energy assessment to system design, installation, and ongoing monitoring. Our Commercial & Industrial Systems division serves businesses, schools, and industrial facilities with larger-scale solar implementations and energy management solutions tailored to reduce operational costs and achieve sustainability goals. Through our Utility Infrastructure segment, we partner with power utilities to develop and maintain large-scale solar farms and energy storage facilities that feed directly into the power grid.  
19Products, Services and Technology  
20The backbone of our business consists of integrated solar and energy management solutions. We provide complete solar panel systems supported by our advanced installation methodologies. Our energy storage solutions incorporate cutting-edge battery technology to maximize energy efficiency. The Company's smart energy management software enables real-time monitoring and optimization of energy consumption. We also supply essential grid infrastructure components and provide comprehensive maintenance and monitoring services to ensure optimal system performance. Our technical support team offers 24/7 monitoring and rapid response services to maintain system efficiency and reliability.  
21Competition and Market Position  
22The renewable energy sector remains highly competitive, characterized by rapid technological advancement and evolving customer needs. We face competition from traditional energy companies that are expanding their renewable energy portfolios, pure-play solar providers, and international manufacturers. Key competitive factors in our market include technology efficiency, price, warranty terms, and installation quality. Major competitors include SunPower Corporation, Tesla, and First Solar, who generally have greater financial resources than us. Despite this competitive landscape, we maintain our market position through technological innovation, superior customer service, and comprehensive solution offerings.  
23Raw Materials and Supply Chain Management  
24The production and installation of our systems depend on various raw materials and components. Primary materials include silicon wafers, solar glass, aluminum frames, and lithium-ion battery components. We have established strategic relationships with multiple suppliers across Asia and North America to ensure supply chain resilience and maintain competitive pricing. Our diversified supplier base helps mitigate risks associated with supply chain disruptions and geopolitical uncertainties.  
25</form10k>
26
27<query>  
28Form10K(part_i=PartI(  
29   id="GTECH_2023_PART1",  
30   business=Business(  
31       id="GTECH_2023_BUSINESS",  
32       description="GreenTech Solutions, Inc. ('GreenTech' or the 'Company') is a leading provider of renewable energy technologies and sustainable infrastructure solutions. Founded in 1995, we design, manufacture, and install solar power systems, energy storage solutions, and smart grid technologies for residential, commercial, and utility-scale applications. Through continued innovation and strategic expansion, we have established ourselves as a trusted partner in the global transition to sustainable energy.",  
33       segments=[  
34           "Residential Solar Solutions",  
35           "Commercial & Industrial Systems",  
36           "Utility Infrastructure"  
37       ],  
38       products=[  
39           "Solar panel systems",  
40           "Energy storage solutions",  
41           "Smart energy management software",  
42           "Grid infrastructure components",  
43           "Maintenance and monitoring services",  
44           "24/7 monitoring and rapid response services"  
45       ],  
46       competition="The renewable energy sector remains highly competitive, characterized by rapid technological advancement and evolving customer needs. We face competition from traditional energy companies that are expanding their renewable energy portfolios, pure-play solar providers, and international manufacturers. Key competitive factors in our market include technology efficiency, price, warranty terms, and installation quality. Major competitors include SunPower Corporation, Tesla, and First Solar, who generally have greater financial resources than us.",  
47       raw_materials="Primary materials include silicon wafers, solar glass, aluminum frames, and lithium-ion battery components. We have established strategic relationships with multiple suppliers across Asia and North America to ensure supply chain resilience and maintain competitive pricing. Our diversified supplier base helps mitigate risks associated with supply chain disruptions and geopolitical uncertainties.",  
48       intellectual_property=None,  # Not mentioned in the provided text  
49       seasonality=None,  # Not mentioned in the provided text  
50       employees=None,  # Not mentioned in the provided text  
51       regulatory_environment=None  # Not mentioned in the provided text  
52   )  
53))  
54</query>  
55<critique>  
56{{  
57 "critique": "Extraction is accurate. Unspecified sections are left out. However, description could have been shorter and simpler",  
58 "is_pass": "True"  
59}}  
60</critique>  
61</example-1>  
62"""

Best Practices for LLM Evaluation

1. Binary Decisions

Use PASS/FAIL instead of scales
Reduces ambiguity
Easier to measure alignment

2. Structured Feedback / Critique

Require specific rationales
Document decision criteria
Track improvement areas

This is our Evaluation Response Structure:
class LLMEvalResults(BaseModel):
is_pass: bool = Field(..., description="Is Business section extraction is accurate in the Form10K object? i.e Business content represented by Form10K.part_i.business")
critique: str = Field(..., description="Reason why you think the Business section extraction is bad")

This is how we call the prompt:

1llm_evals = []  
2for resp, completion, text in results:  
3 eval_res, c = client.chat.completions.create_with_completion(  
4   model="claude-3-5-haiku-20241022",  
5   messages=[  
6       {  
7           "role": "user",  
8           "content": [  
9               {  
10                   "type": "text",  
11                   "text": "<form10k>" + text + "</form10k>"  
12               },  
13               {  
14                   "type": "text",  
15                   "text": JUDGE_PROMPT_FOR_BUSINESS_SECTION,  
16               },  
17           ],  
18       },  
19   ],  
20   response_model=LLMEvalResults,  
21   max_tokens=1000,  
22 )  
23 llm_evals.append((resp, eval_res, completion, text))

Alignment Process:

Data Collection
- Convert evaluations to CSV format
- Include LLM rationales
- Track evaluation metrics
SME Review
- Expert validation of results
- Critique of LLM reasoning
- Documentation of discrepancies
Threshold Achievement
- Target 80-90% alignment with SME
- Iterate until the threshold is met

Production Integration

Once aligned, integrate the evaluation system:

Set up LangSmith Auto-Evaluation
Configure expert-vetted prompts
Monitor evaluation metrics
Set up alerting for misalignments

Performance Metrics and Evaluation

When evaluating LLM-driven graph creation, we're essentially dealing with a classification task. Understanding the following key metrics is crucial:

Precision: Accuracy of positive predictions
- Crucial for reducing false positives in entity extraction
- Helps measure hallucination rate
Recall: Ability to find all relevant entities/relationships
- Important for knowledge completeness
- Helps identify missing information
ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
- Evaluates model performance across different thresholds
- Useful for comparing different LLM models
PR-AUC (Precision-Recall - Area Under Curve)
- Particularly valuable for imbalanced datasets
- Better metric when missing data is costly
Distribution Separation
- Helps understand model confidence
- Identifies potential edge cases

If you use multiple models, metrics can help you to compare each other over time.

Graph Integration and Quality Assurance

Once confidence in extraction quality is established through metrics, integration with existing graph systems requires careful consideration of several factors:

Entity Resolution and Deduplication
Schema conformity
Versioning & Provenance
Rollback capability
Performance evaluation

These metrics provide a quantitative foundation for assessing and improving your graph creation pipeline. With this comprehensive evaluation framework in place, you're well-equipped to build production-grade knowledge graphs.

Conclusion & Summary

Building Knowledge Graphs using Large Language Models opens up exciting possibilities for processing unstructured data, but like any production system, it requires robust evaluation mechanisms. Through this guide, we've explored a practical approach to make your LLM-driven graph creation production-ready:

Key Takeaways

Structured Approach
1. Start with clear ontology definitions
2. Collaborate with domain experts
3. Focus on specific use cases (like Form 10-K analysis)
4. Build incrementally with continuous evaluation
Evaluation Framework
1. Use LLMs as judges with expert alignment
2. Implement comprehensive tracing
3. Monitor key metrics consistently
4. Maintain feedback loops for improvement
Quality Assurance
1. Validate against ontology rules
2. Prevent hallucinations through rigorous checking
3. Ensure data completeness and accuracy
4. Maintain clear versioning and provenance
Production Readiness
1. Integrate robust evaluation pipelines
2. Implement entity resolution strategies
3. Set up monitoring and alerting
4. Enable rollback capabilities

Remember, there's no one-size-fits-all solution. Your evaluation needs will depend on your specific use case, data sensitivity, and accuracy requirements. The key is to start with a basic evaluation framework and iteratively improve it based on real-world feedback and performance metrics.

Next Steps

Review the accompanying notebook for implementation details
Start with small, focused evaluations
Gradually expand your evaluation coverage
Keep iterating on your prompts and processes

By following these practices, you can build reliable, production-grade knowledge graphs that provide real value while maintaining high quality standards.