README.wildspeed

   1 WildSpeed - fast wildcard search for LIKE operator
   2
   3    Wildspeed extension provides GIN index support for wildcard search
   4    for LIKE operator.
   5
   6    Online version of this document is available
   7    http://www.sai.msu.su/~megera/wiki/wildspeed
   8
   9 Authors
  10
  11      * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
  12      * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Moscow University,Russia
  13
  14 License
  15
  16    Stable version, included into PostgreSQL distribution, released under
  17    BSD license. Development version, available from this site, released
  18    under the GNU General Public License, version 2 (June 1991)
  19
  20 Downloads
  21
  22    Stable version of wildspeed is available from
  23    http://www.sigaev.ru/cvsweb/cvsweb.cgi/wildspeed/
  24
  25 Installation
  26
  27 % cd PGSQLSRC/contrib
  28 % tar xzvf wildspeed.tar.gz
  29 % make
  30 % make install
  31 % make installcheck
  32 % psql DB < wildspeed.sql
  33
  34    Wildspeed provides opclass (wildcard_ops) and uses partial match
  35    feature of GIN, available since 8.4. Also, it supports full index scan.
  36
  37    The size of index can be very big, since it contains entries for all
  38    permutations of the original word, see [1] for details. For example,
  39    word hello will be indexed as well as its all permutations:
  40 =# select permute('hello');
  41                permute
  42 --------------------------------------
  43  {hello$,ello$h,llo$he,lo$hel,o$hell}
  44
  45    Notice, symbol '$' is used only for visualization, in actual
  46    implementation null-symbol '\0' is used.
  47
  48    Search query rewritten as prefix search:
  49 *X  -> X$*
  50 X*Y -> Y$X*
  51 *X* -> X*
  52
  53    For example, search for 'hel*o' will be rewritten as 'o$hel'.
  54
  55    Special function permute(TEXT), which returns all permutations of
  56    argument, provided for test purposes.
  57
  58    Performance of wildspeed depends on search pattern. Basically,
  59    wildspeed is less effective than btree index with text_pattern_ops for
  60    prefix search (the difference is greatly reduced for long prefixes) and
  61    much faster for wildcard search.
  62
  63    Wildspeed by default uses optimization (skip short patterns if there
  64    are long one), which can be turned off in Makefile by removing define
  65    -DOPTIMIZE_WILDCARD_QUERY.
  66
  67 References
  68
  69    1.  http://www.cs.wright.edu/~tkprasad/courses/cs499/L05TolerantIR.ppt, see also,
  70    http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html
  71
  72 Examples
  73
  74    Table words contains 747358 records, w1 and w2 columns contains the
  75    same data in order to test performance of Btree (w1) and GIN (w2)
  76    indexes:
  77    Table "public.words"
  78  Column | Type | Modifiers
  79 --------+------+-----------
  80  w1     | text |
  81  w2     | text |
  82
  83 words=# create index bt_idx on words using btree (w1 text_pattern_ops);
  84 CREATE INDEX
  85 Time: 1885.195 ms
  86 words=# create index gin_idx on words using gin (w2 wildcard_ops);
  87 vacuum analyze;
  88 CREATE INDEX
  89 Time: 530351.223 ms
  90
  91 Size:
  92
  93 words=# select pg_relation_size('words');
  94  pg_relation_size
  95 ------------------
  96          43253760
  97
  98 words=# select pg_relation_size('gin_idx');
  99  pg_relation_size
 100 ------------------
 101         417816576
 102 (1 row)
 103
 104 words=# select pg_relation_size('bt_idx');
 105  pg_relation_size
 106 ------------------
 107          23437312
 108 (1 row)
 109
 110    Prefix search:
 111 words=# select count(*) from words where w1 like 'a%';
 112  count
 113 -------
 114  15491
 115 (1 row)
 116
 117 Time: 7.502 ms
 118 words=# select count(*) from words where w2 like 'a%';
 119  count
 120 -------
 121  15491
 122 (1 row)
 123
 124 Time: 31.152 ms
 125
 126    Wildcard search:
 127 words=# select count(*) from words where w1 like '%asd%';
 128  count
 129 -------
 130     26
 131 (1 row)
 132
 133 Time: 147.308 ms
 134 words=# select count(*) from words where w2 like '%asd%';
 135  count
 136 -------
 137     26
 138 (1 row)
 139
 140 Time: 0.339 ms
 141
 142    Full index scan:
 143 words=# set enable_seqscan to off;
 144 words=# explain analyze select count(*) from words where w2 like '%';
 145                                                                 QUERY PLAN
 146
 147 --------------------------------------------------------------------------------------------------------------
 148 -----------------------------
 149  Aggregate  (cost=226274.98..226274.99 rows=1 width=0) (actual time=2218.709..2218.709 rows=1 loops=1)
 150    ->  Bitmap Heap Scan on words  (cost=209785.73..224406.77 rows=747283 width=0) (actual time=1510.516..1913.
 151 430 rows=747358 loops=1)
 152          Filter: (w2 ~~ '%'::text)
 153          ->  Bitmap Index Scan on gin_idx  (cost=0.00..209598.91 rows=747283 width=0) (actual time=1509.358..1
 154 509.358 rows=747358 loops=1)
 155                Index Cond: (w2 ~~ '%'::text)
 156  Total runtime: 2218.747 ms
 157 (6 rows)
 158