PythonとStreamlitを活用したドキュメント検索システムの構築方法 (ChatGPT API 不要)

企業内では、膨大なドキュメントやファイルが生成されますが、それらから必要な情報を効率的に検索するのは困難です。この記事では、PythonとStreamlitを使用して、ChatGPT API を使わずに効率的なドキュメント検索システムを構築する方法について解説します。このシステムは、PDF、Word、Excel、PowerPoint などの複数のファイル形式に対応し、ユーザーの検索クエリに基づいて関連性の高いドキュメントを自動的に抽出します。

システムの概要

システムの特徴

ドキュメントの再帰的読み込み
- 指定されたディレクトリ内の全ての PDF、Word、Excel、PowerPoint ファイルを再帰的に検索し、内容を読み取ります。
クエリ検索：
- ユーザーのクエリに基づいて、ドキュメント内のテキストを検索し、関連度の高いドキュメントをリストアップします。
TF-IDF とコサイン類似度を使用：
- クエリとドキュメントの類似度を計算し、検索結果をランキング形式で表示します。

必要なツールとセットアップ

1. 必要なライブラリ

以下のライブラリを使用します：

streamlit（Web アプリケーション構築）
PyMuPDF（PDF ファイルからのテキスト抽出）
python-docx（Word ファイルからのテキスト抽出）
openpyxl（Excel ファイルからのテキスト抽出）
python-pptx（PowerPoint ファイルからのテキスト抽出）
scikit-learn（TF-IDF とコサイン類似度計算）

2. ライブラリのインストール

以下のコマンドを実行して必要なライブラリをインストールしてください。

pip install streamlit pymupdf python-docx openpyxl python-pptx scikit-learn

ディレクトリ構成

document_search_app/
│
├── app.py                   # メインアプリケーション
├── requirements.txt         # 必要なライブラリ一覧
└── utils/
    ├── file_loader.py       # ファイル読み込みとテキスト抽出
    └── text_search.py       # 検索アルゴリズム

コード実装

1. ファイル読み込みとテキスト抽出 (`file_loader.py`)

import os
import fitz  # PyMuPDF for PDF
import docx
import openpyxl
from pptx import Presentation

def extract_text_from_pdf(file_path):
    text = ""
    with fitz.open(file_path) as pdf:
        for page in pdf:
            text += page.get_text()
    return text

def extract_text_from_word(file_path):
    doc = docx.Document(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

def extract_text_from_excel(file_path):
    workbook = openpyxl.load_workbook(file_path)
    text = []
    for sheet in workbook.sheetnames:
        worksheet = workbook[sheet]
        for row in worksheet.iter_rows(values_only=True):
            text.append(" ".join(str(cell) for cell in row if cell))
    return "\n".join(text)

def extract_text_from_powerpoint(file_path):
    presentation = Presentation(file_path)
    text = []
    for slide in presentation.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                text.append(shape.text)
    return "\n".join(text)

def extract_text(file_path):
    if file_path.lower().endswith('.pdf'):
        return extract_text_from_pdf(file_path)
    elif file_path.lower().endswith('.docx'):
        return extract_text_from_word(file_path)
    elif file_path.lower().endswith('.xlsx'):
        return extract_text_from_excel(file_path)
    elif file_path.lower().endswith('.pptx'):
        return extract_text_from_powerpoint(file_path)
    else:
        return None

def load_documents_from_directory(directory):
    documents = {}
    for root, _, files in os.walk(directory):
        for file in files:
            if file.startswith('~$'):  # 一時ファイルを無視
                continue
            file_path = os.path.join(root, file)
            if file.lower().endswith(('.pdf', '.docx', '.xlsx', '.pptx')):
                try:
                    text = extract_text(file_path)
                    if text:
                        documents[file_path] = text
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
    return documents

2. 検索アルゴリズム (`text_search.py`)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_documents(query, documents):
    doc_texts = list(documents.values())
    file_paths = list(documents.keys())
    
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(doc_texts + [query])
    
    query_vector = tfidf_matrix[-1]
    doc_vectors = tfidf_matrix[:-1]
    similarities = cosine_similarity(query_vector, doc_vectors).flatten()
    
    sorted_indices = np.argsort(-similarities)
    results = [(file_paths[i], similarities[i]) for i in sorted_indices if similarities[i] > 0]
    
    return results

3. Streamlit アプリケーション (`app.py`)

import streamlit as st
from utils.file_loader import load_documents_from_directory
from utils.text_search import search_documents

st.title("ドキュメント検索システム")

# ディレクトリパスの入力
directory = st.text_input("検索対象のディレクトリを入力してください")

if directory:
    if st.button("読み込み"):
        documents = load_documents_from_directory(directory)
        if documents:
            st.success(f"{len(documents)} 件のドキュメントを読み込みました。")
            
            query = st.text_input("検索クエリを入力してください")
            
            if st.button("検索"):
                if query:
                    results = search_documents(query, documents)
                    if results:
                        st.subheader("検索結果")
                        for file_path, score in results:
                            st.write(f"ファイル: {file_path} - 類似度: {score:.2f}")
                    else:
                        st.info("該当する結果が見つかりませんでした。")
        else:
            st.error("ドキュメントが見つかりませんでした。")

システムの動作確認

1. アプリケーションの起動

ターミナルから以下のコマンドを実行してください。

streamlit run app.py

2. アプリケーションの使用方法

検索対象のディレクトリを入力し、「読み込み」ボタンを押すと、指定されたディレクトリ内のドキュメントが読み込まれます。ディレクトリは手入力なので、対象のディレクトリをコピー＆ペーストで入力をしてください。
検索クエリを入力し、「検索」ボタンを押すと、クエリに関連するドキュメントが類似度に基づいて表示されます。