核心模块(Modules)
数据连接(DataConnection)
文档加载器(DocumentLoaders)
文件目录(FileDirectory)

文件目录

这篇文章介绍了如何加载一个目录中的所有文档。

底层默认使用UnstructuredLoader

from langchain_community.document_loaders import DirectoryLoader

我们可以使用glob参数来控制加载哪些文件。注意这里不加载.rst.html文件。

loader = DirectoryLoader('../', glob="**/*.md")
docs = loader.load()
len(docs)

显示进度条

默认情况下不会显示进度条。要显示进度条,需要安装tqdm库(例如pip install tqdm),并将show_progress参数设置为True

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True)
docs = loader.load()
    Requirement already satisfied: tqdm in /Users/jon/.pyenv/versions/3.9.16/envs/microbiome-app/lib/python3.9/site-packages (4.65.0)


    0it [00:00, ?it/s]

使用多线程

默认情况下,加载是在一个线程中完成的。为了利用多个线程,将use_multithreading标志设置为true。

loader = DirectoryLoader('../', glob="**/*.md", use_multithreading=True)
docs = loader.load()

更改加载类

默认情况下,使用UnstructuredLoader类。不过,你可以很容易地更改加载器的类型。

from langchain_community.document_loaders import TextLoader
loader = DirectoryLoader('../', glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
len(docs)

    1

如果需要加载Python源代码文件,请使用PythonLoader

from langchain_community.document_loaders import PythonLoader
loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
docs = loader.load()
len(docs)
    691

使用TextLoader自动检测文件编码

在这个例子中,我们将看到在使用TextLoader类加载目录中的大量任意文件时,一些有用的策略。

首先我们来说明问题,试着加载多个使用不同编码的文件。

path = '../../../../../tests/integration_tests/examples'
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)

A. 默认行为

loader.load()
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="color: #800000; text-decoration-color: #800000">╭─────────────────────────────── </span><span style="color: #800000; text-decoration-color: #800000; font-weight: bold">Traceback </span><span style="color: #bf7f7f; text-decoration-color: #bf7f7f; font-weight: bold">(most recent call last)</span><span style="color: #800000; text-decoration-color: #800000"> ────────────────────────────────╮</span>
<span style="color: #800000; text-decoration-color: #800000">│</span> <span style="color: #bfbf7f; text-decoration-color: #bfbf7f">/data/source/langchain/langchain/document_loaders/</span><span style="color: #808000; text-decoration-color: #808000; font-weight: bold">text.py</span>:<span style="color: #0000ff; text-decoration-color: #0000ff">29</span> in <span style="color: #00ff00; text-decoration-color: #00ff00">load</span>                             <span style="color: #800000; text-decoration-color: #800000">│</span>
<span style="color: #800000; text-decoration-color: #800000">│</span>                                                                                                  <span style="color: #800000; text-decoration-color: #800000">│</span>
<span style="color: #800000; text-decoration-color: #800000">│</span>   <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">26 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│   │   </span>text = <span style="color: #808000; text-decoration-color: #808000">""</span>                                                                           <span style="color: #800000; text-decoration-color: #800000">│</span>
<span style="color: #800000; text-decoration-color: #800000">│</span>   <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">27 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│   │   </span><span style="color: #0000ff; text-decoration-color: #0000ff">with</span> <span style="color: #00ffff; text-decoration-color: #00ffff">open</span>(<span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.file_path, encoding=<span style="color: #00ffff; text-decoration-color: #00ffff">self</span>.encoding) <span style="color: #0000ff; text-decoration-color: #0000ff">as</span> f:                             <span style="color: #800000; text-decoration-color: #800000">│</span>
<span style="color: #800000; text-decoration-color: #800000">│</span>   <span style="color: #7f7f7f; text-decoration-color: #7f7f7f">28 </span><span style="color: #7f7f7f; text-decoration-color: #7f7f7f">│   │   │   </span><span style="color: #0000ff; text-decoration-color: #0000ff">try</span>:                                                                            <span style="color: #800000; text-decoration-color: #800000">│</span>
...
...
...

example-non-utf8.txt文件使用了不同的编码,所以load()函数会失败,并给出一个有帮助的提示,指示哪个文件解码失败。

使用TextLoader的默认行为是,如果任何一个加载失败,整个加载过程都会失败,并且不会加载任何文档。

B. 静默失败

我们可以将silent_errors参数传递给DirectoryLoader,以跳过无法加载的文件,并继续加载过程。

loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
docs = loader.load()
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

    ['../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
     '../../../../../tests/integration_tests/examples/example-utf8.txt']

C. 自动检测编码

我们还可以要求TextLoader在解码失败之前自动检测文件的编码,通过传递autodetect_encoding参数给加载器类。

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

    ['../../../../../tests/integration_tests/examples/example-non-utf8.txt',
     '../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
     '../../../../../tests/integration_tests/examples/example-utf8.txt']