README.md

   1 # Wildspeed - fast wildcard search for LIKE operator
   2
   3 Wildspeed extension provides GIN index support for wildcard search
   4 for LIKE operator.
   5
   6 Online version of this document is available
   7 http://www.sai.msu.su/~megera/wiki/wildspeed
   8
   9 ## Authors
  10
  11 * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
  12 * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Moscow University,Russia
  13
  14 ## License
  15
  16 Stable version, included into PostgreSQL distribution, released under
  17 BSD license. Development version, available from this site, released
  18 under the GNU General Public License, version 2 (June 1991)
  19
  20 ## Downloads
  21
  22 Stable version of wildspeed is available from
  23 http://www.sigaev.ru/cvsweb/cvsweb.cgi/wildspeed/
  24
  25 ## Installation
  26
  27     % make USE_PGXS=1
  28     % make install
  29     % make installcheck
  30     % psql DB -c 'CREATE EXTENSION wildspeed'
  31
  32 Wildspeed provides opclass (wildcard_ops) and uses partial match
  33 feature of GIN, available since 8.4. Also, it supports full index scan.
  34
  35 The size of index can be very big, since it contains entries for all
  36 permutations of the original word, see [1] for details. For example,
  37 word hello will be indexed as well as its all permutations:
  38
  39     =# select permute('hello');
  40                    permute
  41     --------------------------------------
  42      {hello$,ello$h,llo$he,lo$hel,o$hell}
  43
  44        Notice, symbol '$' is used only for visualization, in actual
  45        implementation null-symbol '\0' is used.
  46
  47        Search query rewritten as prefix search:
  48     *X  -> X$*
  49     X*Y -> Y$X*
  50     *X* -> X*
  51
  52 For example, search for 'hel*o' will be rewritten as 'o$hel'.
  53
  54 Special function `permute(TEXT)`, which returns all permutations of
  55 argument, provided for test purposes.
  56
  57 Performance of wildspeed depends on search pattern. Basically,
  58 wildspeed is less effective than btree index with text_pattern_ops for
  59 prefix search (the difference is greatly reduced for long prefixes) and
  60 much faster for wildcard search.
  61
  62 Wildspeed by default uses optimization (skip short patterns if there
  63 are long one), which can be turned off in Makefile by removing define
  64 `-DOPTIMIZE_WILDCARD_QUERY`.
  65
  66 ## References
  67
  68 * http://www.cs.wright.edu/~tkprasad/courses/cs499/L05TolerantIR.ppt
  69 * http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html
  70
  71 ## Examples
  72
  73 Table words contains 747358 records, w1 and w2 columns contains the
  74 same data in order to test performance of Btree (w1) and GIN (w2)
  75 indexes:
  76
  77        Table "public.words"
  78      Column | Type | Modifiers
  79     --------+------+-----------
  80      w1     | text |
  81      w2     | text |
  82
  83     words=# create index bt_idx on words using btree (w1 text_pattern_ops);
  84     CREATE INDEX
  85     Time: 1885.195 ms
  86     words=# create index gin_idx on words using gin (w2 wildcard_ops);
  87     vacuum analyze;
  88     CREATE INDEX
  89     Time: 530351.223 ms
  90
  91 Size:
  92
  93     words=# select pg_relation_size('words');
  94      pg_relation_size
  95     ------------------
  96              43253760
  97
  98     words=# select pg_relation_size('gin_idx');
  99      pg_relation_size
 100     ------------------
 101             417816576
 102     (1 row)
 103
 104     words=# select pg_relation_size('bt_idx');
 105      pg_relation_size
 106     ------------------
 107              23437312
 108     (1 row)
 109
 110        Prefix search:
 111     words=# select count(*) from words where w1 like 'a%';
 112      count
 113     -------
 114      15491
 115     (1 row)
 116
 117     Time: 7.502 ms
 118     words=# select count(*) from words where w2 like 'a%';
 119      count
 120     -------
 121      15491
 122     (1 row)
 123
 124     Time: 31.152 ms
 125
 126 Wildcard search:
 127
 128     words=# select count(*) from words where w1 like '%asd%';
 129      count
 130     -------
 131         26
 132     (1 row)
 133
 134     Time: 147.308 ms
 135     words=# select count(*) from words where w2 like '%asd%';
 136      count
 137     -------
 138         26
 139     (1 row)
 140
 141     Time: 0.339 ms
 142
 143 Full index scan:
 144
 145     words=# set enable_seqscan to off;
 146     words=# explain analyze select count(*) from words where w2 like '%';
 147                                                                     QUERY PLAN
 148
 149     --------------------------------------------------------------------------------------------------------------
 150     -----------------------------
 151      Aggregate  (cost=226274.98..226274.99 rows=1 width=0) (actual time=2218.709..2218.709 rows=1 loops=1)
 152        ->  Bitmap Heap Scan on words  (cost=209785.73..224406.77 rows=747283 width=0) (actual time=1510.516..1913.
 153     430 rows=747358 loops=1)
 154              Filter: (w2 ~~ '%'::text)
 155              ->  Bitmap Index Scan on gin_idx  (cost=0.00..209598.91 rows=747283 width=0) (actual time=1509.358..1
 156     509.358 rows=747358 loops=1)
 157                    Index Cond: (w2 ~~ '%'::text)
 158      Total runtime: 2218.747 ms
 159     (6 rows)