Search Engine Optimization > Webmaster World > Suggestions for a database management system
Suggestions for a database management system
Posted by Matt Probert on March 23rd, 2006

Suggestions requested, please.

For a database management system to handle a two record, free format
text data of highly variable record length (between 2 and 10,000
bytes), with an expected maximum of 500,000 record pairs.

Indexed within one record of the record pair on all 'words' not
contained within a 'stop word' list or perhaps comprising individual
records within another data file. Index of 'words' and also phrases to
correspond to the other record of the record pair.

Expected access to be around 50,000 requests per 24 hour period,
response time is important, disk space is at a premium and as such
solutions with fixed block text fields may be overly inefficient,
platform prefered Linux.

Sensible, thought out, suggestions by experienced database developers
should be welcomed please.

Matt


Posted by David Cary Hart on March 23rd, 2006

On Thu, 23 Mar 2006 12:09:45 GMT
www@probertencyclopaedia.com (Matt Probert) opined:
http://www.corpit.ru/mjt/tinycdb.html
--
Displayed Email Address is a SPAM TRAP
Our DNSRBL -
Eliminate Spam: http://www.TQMcube.com
Multi-RBL Check: http://www.TQMcube.com/rblcheck.php
Zombie Graphs: http://www.TQMcube.com/zombies.php

Posted by Paul Ding on March 23rd, 2006

On Thu, 23 Mar 2006 12:09:45 GMT, www@probertencyclopaedia.com (Matt
Probert) posted something that included:

I am making some assumptions here. One is that your average article
runs about 2kb. Another is that your file system was set up so that
there's a performance hit when you get more than about 250 files in a
directory. A third assumption is that changes are relatively
infrequent. Some of the articles on Wikipedia change 20 times a day. I
am assuming that it's satisfactory if you can update weekly.

A good custom database application will almost always conserve disk
space and response time, compared to a good RDBMS solution. Mostly
what you lose is the flexibility to rapidly redesign the system.

Set up 10 data directories, with 100 files in each. Store an equal
number of articles in each file, so you have a maximum of 500 articles
per file, and file size is a piddlin' little 100kb. That'll keep the
memory requirements low and performance high. Articles are stored as a
series of ASCIIZ strings, each article having the article title as one
string, and the article text as another.

You then create a "table of contents" database. Each record contains a
unique "article number", the title string, the file in which the
article is found, and the offset of the article within the file.

If you replace an article with a longer article, you need to rebuild
the data file in which it appears. Since you have the titles within
that file, you can record the new offsets as you write the new file,
then save the new file under the old file name and write the changes
to the table of contents database.

Next, you need a dictionary database. When you add an article to the
data files, the article is tokenized into words, the words are
lowercased, and duplicates removed. Then, the words (if not on your
list of stopwords) are entered into the dictionary database. If you
are replacing an article, a list of deletions as well as a list of
additions would be generated. A record in that database might have
"mandelbrot" in the word field, and "291,99072,266969" in the found
field.

If someone searches for "mandelbrot AND fingernail NOT pyramid", then
you simply add together the dictionary responses for mandelbrot and
for fingernail, eliminate dupes, and eliminate the responses for
pyramid, then convert the article numbers into titles for the user's
benefit.

I would tend to put the TOC and dictionary database into MySQL in
order to give you flexibility, and write an Apache handler that would
read a template file, and the TOC database, then insert the article
found in file 0232.dat at offset 23B7 within the template.

If space is more of a problem and speed is less of a problem, you
could store both offset and length of the stories, and store a zipped
version of each story.



--
AmishHosting.com


Posted by hug on March 23rd, 2006

www@probertencyclopaedia.com (Matt Probert) wrote:

Roughly how much real storage do you have on your server (I assume it
is dedicated and not some old 486 dog, nyet?)

This paragraph is a little brief for me to understand it Matt.

What language is your site implemented in? How quickly do you need to
get this thing up and running? How do you expect to access the data?

--
http://www.ren-prod-inc.com/hug_soft...action=contact

Posted by Toby Inkster on March 23rd, 2006

Matt Probert wrote:

I would expect PostgreSQL 8.x with the TSearch2 extension to cope with
this no trouble.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact


Posted by GreyWyvern on March 23rd, 2006

And lo, Toby Inkster didst speak in alt.www.webmaster:

50,000 requests / 24 hours
= 2084 requests / hour
= 35 requests / minute
= < 1 request / second

What database system, running on a decent server, *can't* cope with that
load? I'm only being partly rhetorical; I am mostly curious.

Grey

--
The technical axiom that nothing is impossible sinisterly implies the
pitfall corollary that nothing is ridiculous.
- http://www.greywyvern.com/orca#search - Orca Search: Full-featured
spider and site-search engine

Posted by David Cary Hart on March 23rd, 2006

On Thu, 23 Mar 2006 21:56:24 +0000
Toby Inkster <usenet200603@tobyinkster.co.uk> opined:
pair set, particularly with variable lenghts. It cannot be optimized.
BTW, I have NEVER understood why anyone would use PostgreSQL over
MySQL but that's a different food fight.

--
Displayed Email Address is a SPAM TRAP
Our DNSRBL -
Eliminate Spam: http://www.TQMcube.com
Multi-RBL Check: http://www.TQMcube.com/rblcheck.php
Zombie Graphs: http://www.TQMcube.com/zombies.php

Posted by Toby Inkster on March 24th, 2006

GreyWyvern wrote:

I would expect them to be clustered rather than evenly spaced though; but
you're right: it's not very much.

A site I'm working on right now will probably make about half a million
queries per day.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact


Posted by Toby Inkster on March 24th, 2006

David Cary Hart wrote:

MySQL is fairly speedy, but PostgreSQL is catching up in that department.

However, much of MySQL's speed is an illusion. If MySQL takes 0.05s per
query, and PostgreSQL takes 0.10s per query; PostgreSQL could still end up
the faster database overall: PostgreSQL is capable of performing more
complicated queries than MySQL, so you may need to perform fewer queries.

e.g. in MySQL you might need to retrieve all the rows from a particular
table, find some value from each row; and then for each of those values
perform some query on another table. If the initial table had 20 rows,
then that's 21 rows at 0.05s = 1.05s. If PostgreSQL can manage your
request in just one query, then it will only be 0.10s.

As someone who does need to write database stuff that uses a lot of
aggregate functions, cross-table joins, subqueries and so forth, I
appreciate this.

MySQL is catching up with PostgreSQL in this area, but it's still miles
behind.

Transaction support in PostgreSQL is a big plus for data integrity. MySQL
has recently added transaction support, but only in very limited
circumstances. In general, PostgreSQL far outdoes MySQL at ACID:

* Atomicity (either the whole transaction completes, or none
of it does -- there are no half-performed operations)
* Consistancy (before and after a transaction the database is
in a consistant state -- no integrity constraints are
violated)
* Isolation (if I'm only half-way through a transaction, you
can't see my changes)
* Durability (once an application is told that a change has
been made, then it's never going to be unmade)

Triggers are handy -- a way of specifying that when a particular table is
changed/queried, then a certain piece of code should be run.

Stored procedures. (Perhaps MySQL has them now though?)

In general, PostgreSQL is better than MySQL in the same way that something
like SciTE or Metapad or BBEdit is better than Windows Notepad. Sure,
Notepad will normally do the job, but a more advanced text editor might
help you do the job better/faster.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact


Posted by David Cary Hart on March 24th, 2006

On Fri, 24 Mar 2006 08:17:18 +0000
Toby Inkster <usenet200603@tobyinkster.co.uk> opined:
Yes as well as foreign keys. BTW, I tried PostgreSQL about a year
ago. Our data is constantly and rapidly changing. While PostgreSQL is
getting faster at queries, it is remarkably slow at add, update and
delete. Vacuum drove me nuts.

LAMP is still the gold standard. If, for no other reason, the huge
difference in the user base makes MySQL more consistent and safer.
OTOH, with relatively static data and very complex queries, PostgreSQL
might offer a superior solution.

Getting back to the OP, I still think that a large set value pairs
are better handled by a Berkeley-like db (except with run-time
modification capability). These inlcude Tokarev's tinycdb and
Dan Bernstein's cdb. IIRC CDB is not open source.
--
Displayed Email Address is a SPAM TRAP
Our DNSRBL -
Eliminate Spam: http://www.TQMcube.com
Multi-RBL Check: http://www.TQMcube.com/rblcheck.php
Zombie Graphs: http://www.TQMcube.com/zombies.php

Funbolt.com - Entertainment portal, wallpapers, sexy celebs