一架梯子,一头程序猿,仰望星空!
LangChain教程(Python版本) > 内容正文

LangChain 按字符拆分


按字符拆分

LangChain 最简单的文本拆分方法。它基于字符(默认情况下是 “\n\n”)进行拆分,并通过字符数来测量块的长度。

  1. 文本如何拆分:按单个字符拆分
  2. 块大小如何测量:按字符数测量
# 读取一个待拆分的文档内容
with open('../../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
# 定义文本拆分器
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
# 切割文本,打印第一个片段
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
page_content='Madam Speaker, Madam Vice President 忽略文本....' lookup_str='' metadata={} lookup_index=0

这是将元数据与文档一起传递的示例,请注意它与文档一起分割。

metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])
# 返回内容
page_content='..忽略文本..' lookup_str='' metadata={'document': 1} lookup_index=0
text_splitter.split_text(state_of_the_union)[0]


关联主题