Files
sure/db/migrate/20260316120000_create_vector_store_chunks.rb
Dream 6d22514c01 feat(vector-store): Implement pgvector adapter for self-hosted RAG (#1211)
* Add conditional migration for vector_store_chunks table

Creates the pgvector-backed chunks table when VECTOR_STORE_PROVIDER=pgvector.
Enables the vector extension, adds store_id/file_id indexes, and uses
vector(1024) column type for embeddings.

* Add VectorStore::Embeddable concern for text extraction and embedding

Shared concern providing extract_text (PDF via pdf-reader, plain-text as-is),
paragraph-boundary chunking (~2000 chars, ~200 overlap), and embed/embed_batch
via OpenAI-compatible /v1/embeddings endpoint using Faraday. Configurable via
EMBEDDING_MODEL, EMBEDDING_URI_BASE, with fallback to OPENAI_* env vars.

* Implement VectorStore::Pgvector adapter with raw SQL

Replaces the stub with a full implementation using
ActiveRecord::Base.connection with parameterized binds. Supports
create_store, delete_store, upload_file (extract+chunk+embed+insert),
remove_file, and cosine-similarity search via the <=> operator.

* Add registry test for pgvector adapter selection

* Configure pgvector in compose.example.ai.yml

Switch db image to pgvector/pgvector:pg16, add VECTOR_STORE_PROVIDER,
EMBEDDING_MODEL, and EMBEDDING_DIMENSIONS env vars, and include
nomic-embed-text in Ollama's pre-loaded models.

* Update pgvector docs from scaffolded to ready

Document env vars, embedding model setup, pgvector Docker image
requirement, and Ollama pull instructions.

* Address PR review feedback

- Migration: remove env guard, use pgvector_available? check so it runs
  on plain Postgres (CI) but creates the table on pgvector-capable servers.
  Add NOT NULL constraints on content/embedding/metadata, unique index on
  (store_id, file_id, chunk_index).
- Pgvector adapter: wrap chunk inserts in a DB transaction to prevent
  partial file writes. Override supported_extensions to match formats
  that extract_text can actually parse.
- Embeddable: add hard_split fallback for paragraphs exceeding CHUNK_SIZE
  to avoid overflowing embedding model token limits.

* Bump schema version to include vector_store_chunks migration

CI uses db:schema:load which checks the version — without this bump,
the migration is detected as pending and tests fail to start.

* Update 20260316120000_create_vector_store_chunks.rb

---------

Co-authored-by: sokiee <sokysrm@gmail.com>
2026-03-20 17:01:31 +01:00

44 lines
1.5 KiB
Ruby

class CreateVectorStoreChunks < ActiveRecord::Migration[7.2]
def up
return unless pgvector_available?
enable_extension "vector" unless extension_enabled?("vector")
create_table :vector_store_chunks, id: :uuid do |t|
t.string :store_id, null: false
t.string :file_id, null: false
t.string :filename
t.integer :chunk_index, null: false, default: 0
t.text :content, null: false
t.column :embedding, "vector(#{ENV.fetch('EMBEDDING_DIMENSIONS', '1024')})", null: false
t.jsonb :metadata, null: false, default: {}
t.timestamps null: false
end
add_index :vector_store_chunks, :store_id
add_index :vector_store_chunks, :file_id
add_index :vector_store_chunks, [ :store_id, :file_id, :chunk_index ], unique: true,
name: "index_vector_store_chunks_on_store_file_chunk"
end
def down
drop_table :vector_store_chunks, if_exists: true
disable_extension "vector" if extension_enabled?("vector")
end
private
# Check if the pgvector extension is installed in the PostgreSQL server,
# not just whether it is enabled in this database. This lets the migration
# run harmlessly on plain Postgres (CI, dev without pgvector) while still
# creating the table on pgvector-capable servers.
def pgvector_available?
result = ActiveRecord::Base.connection.execute(
"SELECT 1 FROM pg_available_extensions WHERE name = 'vector' LIMIT 1"
)
result.any?
rescue
false
end
end