Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?
Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.
For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.
This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.
It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?
I found this example in the README quite confusing:
SELECT * FROM documents
WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
ORDER BY content <@> 'search terms'
LIMIT 10;
That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.
I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.
Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.
All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.
TigerData:
> pg_textsearch v1.0 is freely available via open source (Postgres license)
They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.
Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?
Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.
For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.
FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/
In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.
This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.
It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?
I found this example in the README quite confusing:
That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.
Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.
All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.
> ParadeDB, is guarded behind AGPL
What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.
You: > "TigerData" is a pernicious entity
TigerData: > pg_textsearch v1.0 is freely available via open source (Postgres license)
They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.
https://malus.sh/
Okay then!
https://malus.sh/