一架梯子,一头程序猿,仰望星空!
使用LangChain提取结构化数据 > 内容正文

处理长文本


处理长文本

在处理文件(如PDF)时,你可能会遇到超出语言模型上下文窗口的文本。为了处理这些文本,考虑以下策略:

  1. 更换LLM 选择支持更大上下文窗口的不同LLM。
  2. 暴力方法 将文档分块,并从每个块中提取内容。
  3. RAG 分块文档,索引这些块,并仅从看起来“相关”的部分块中提取内容。

请记住,这些策略有不同的权衡,最佳策略可能取决于你正在设计的应用程序!

设置

我们需要一些测试数据!这里下载一个关于汽车的维基百科文章,并将其加载为LangChain Document

import re
import requests
from langchain_community.document_loaders import BSHTMLLoader

response = requests.get("https://en.wikipedia.org/wiki/Car")
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
document.page_content = re.sub("\n\n+", "\n", document.page_content)
print(len(document.page_content))
78967

定义模式(定义需要提取的数据结构)

在这里,我们将定义模式以从文本中提取关键发展信息。

from typing import List, Optional
from langchain.chains import create_structured_output_runnable
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

class KeyDevelopment(BaseModel):
    """关于汽车历史发展的信息。"""

    year: int = Field(
        ..., description="发生重要历史发展的年份。"
    )
    description: str = Field(
        ..., description="这一年发生了什么?出现了什么样的发展?"
    )
    evidence: str = Field(
        ...,
        description="重复这些年份和描述信息的原话。",
    )
class ExtractionData(BaseModel):
    """关于汽车历史发展的提取信息。"""

    key_developments: List[KeyDevelopment]

# 设计提示词
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "你擅长识别文本中的重要历史发展信息。仅提取重要的历史发展信息。如果在文本中找不到重要信息,则不提取任何内容。",
        ),
        ("human", "{text}"),
    ]
)
# 模型定义
llm = ChatOpenAI(
    model="gpt-4-0125-preview",
    temperature=0,
)

# 通过LCEL表达式定义一个chain
extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    method="function_calling",
    include_raw=False,
)

暴力方法

将文档分成适合LLM上下文窗口的块, 也就是采用文档分片的方式,分块提取信息。

from langchain_text_splitters import TokenTextSplitter

# 下面是定义文本拆分器,拆分文本
text_splitter = TokenTextSplitter(
    chunk_size=2000,
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

使用.batch功能在每个块上并行运行数据提取逻辑!

batch在后台使用线程池来帮助并行处理LLM调用。

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # 通过传递最大并发数来限制并发量!
)

合并结果

从各个数据块中提取数据后,我们希望将这些提取结果合并在一起。

key_developments = []

for extraction in extractions:
    key_developments.extend(extraction.key_developments)

key_developments[:20]

提取结果示例

[KeyDevelopment(year=1966, description="The Toyota Corolla began production, recognized as the world's best-selling automobile.", evidence="The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile."),
 KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
 KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),
 KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),
 KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),
 KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),
 KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),
 KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),
 KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
 KeyDevelopment(year=1913, description="Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant."),
 KeyDevelopment(year=1914, description="Ford's assembly line worker could buy a Model T with four months' pay.", evidence="In 1914, an assembly line worker could buy a Model T with four months' pay."),
 KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]

基于 RAG 的方法

另一个简单的想法是将文本分块,但不是从每个块中提取信息,而是只关注最相关的块,因为不是每一个文本块都包含我们想要提取的信息。

虽然这个思路可行,但是存在一个问题,就是很难准确的确定哪些块是包含我们需要提取的信息。

例如,在这里我们使用的car文章中,大部分文章包含关键的发展信息。因此,通过使用RAG,我们很可能会丢弃大量相关的信息。

这里我们只是提供一种思路,供大家测试,具体情况大家根据自己的业务进行测试是否可靠。

下面是基于FAISS向量存储的例子。

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # k = 1 代表提取相似度最高的一个文档块

在这种情况下,RAG 提取器只查看顶部文档。

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # 获取顶部文档的内容
} | extractor
results = rag_extractor.invoke("与汽车相关的主要发展")
for key_development in results.key_developments:
    print(key_development)
year=1924 description="德国首款大规模生产的汽车Opel 4PS Laubfrosch诞生,使Opel成为德国最大的汽车制造商,市场份额达到37.5%。" evidence="德国首款大规模生产的汽车Opel 4PS Laubfrosch(树蛙)于1924年在吕塞尔海姆下线,很快使Opel成为德国最大的汽车制造商,市场份额为37.5%。"
year=1925 description='莫里斯占据英国汽车总产量的41%,主导市场。' evidence='1925年,莫里斯占据英国汽车总产量的41%。'
year=1925 description='雪铁龙、雷诺和标致在法国生产了55万辆汽车,主导市场。' evidence="雪铁龙于1919年开始在法国生产汽车;它们与其他廉价汽车(如雷诺的10CV和标致的5CV)一起,于1925年生产了55万辆汽车。"
year=2017 description='汽油车的生产达到峰值。' evidence='2017年汽油车的生产达到峰值。'

常见问题

不同的方法各有利弊,涉及成本、速度和准确性。

请注意以下问题:

  • 对内容进行分块意味着如果信息分布在多个块中,则 LLM 可能提取不到信息。
  • 大块之间的重叠可能导致相同的信息被提取两次,因此要准备进行去重!
  • LLM 可能会虚构数据。如果在大文本中寻找单个事实并采用蛮力方法,可能会得到更多虚构的数据。


关联主题