Search Engine Optimization > Webmaster World > Are you regular?
Are you regular?
Posted by William Tasso on March 22nd, 2006

Greetings One & All

Regular expressions are far from regular it seems to me :|

I'd be grateful if some kind soul could show a r/e that matches every html
entity expression e.g.   < etc.

Also looking for a r/e that matches anything not between unordered list
tags.

Of course I may be attacking this problem with entirely the wrong hammer,
but while I'm on this road I'll use the tools I know about.

Thanks for reading.
--
William Tasso

Posted by hug on March 22nd, 2006

"William Tasso" <SpamBlocked@tbdata.com> wrote:

William, I really don't think RE's are the hammer to crack this nut.
It sounds like you need some custom code that performs partial
parsing. More specifics about your goal here could help someone help
you.

--
http://www.ren-prod-inc.com/hug_soft...action=contact

Posted by Justin Koivisto on March 22nd, 2006

William Tasso wrote:
Why? What are you attempting to do?

Would something generic work for you like this:

\&[0-9a-z]+;

This leads to me next question below...

What language are you using (or type of regex)? That will be the first
thing someone will need to know...

--
Justin Koivisto, ZCE - justin@koivi.com
http://koivi.com

Posted by Dylan Parry on March 22nd, 2006

Pondering the eternal question of "Hobnobs or Rich Tea?", Justin
Koivisto finally proclaimed:

That won't match them all, but:

\&(#[0-9]+|#x[0-9a-fA-F]+|[a-zA-Z0-9]+);

Should match decimal, hexadecimal and named character entities. It
should also ignore invalid ones (like &#x00G, which isn't valid hex). In
your example, &Eacute; wouldn't have been matched for example, and
neither would numerical entities.

--
Dylan Parry
http://electricfreedom.org -- Where the Music Progressively Rocks!

Posted by William Tasso on March 22nd, 2006

Fleeing from the madness of the jungle
Justin Koivisto <justin@koivi.com> stumbled into news:alt.www.webmaster
and said:

in this case, strip some garbage out of a string. It's gonna sit in my
code right next to the line that converts multiple spaces to a single
space.

no idea - looks a little /too/ generic to me.

well - this is a VB app, but I guess it could call up the result from any
available component that exposes the necessary functionality.

FWIW the app uses data that so far I have only found available on a
last-century style web page - yes the data is freely available and free to
use especially for the use I am putting it to - unfortunately this appears
to be the primary source of data. I can't simply cut/paste the data as it
is updated from time to time, albeit infrequently.

The data itself is a list of links surrounded by other data some of which
is useful/relevant/necessary, some is simply noise.

--
William Tasso

whither a trophy?

Posted by Justin Koivisto on March 22nd, 2006

William Tasso wrote:
ah.. I'll jump ship right now.

I've never had luck with regex in VB, especially since I can't use
perl-style to take advantage of negative look-aheads and such...

--
Justin Koivisto, ZCE - justin@koivi.com
http://koivi.com

Posted by mbstevens on March 22nd, 2006

William Tasso wrote:
A long regular expression is possible, but
I would run the strings to be tested through a number
of shorter regular expressions. This might or might not
be a bit slower, but would certainly be easier to read
and understand.

Also look at Perl's and Python's HTML modules for
your task. I don't know exactly what you want to
do, but they're safer to use than regexes in most
cases because they've been tested by many programmers.
--
mbstevens
http://www.mbstevens.com/





Posted by Justin Koivisto on March 22nd, 2006

Dylan Parry wrote:
DOH! I knew I should have copied it from my file... That's what I get
for doing it off the top of my head without thinking first.

In fact, that is very similar to what I have (pretty much exact) except
the | parts are in a different order, and I have it set for case
insensitive, so it's a little shorter.

--
Justin Koivisto, ZCE - justin@koivi.com
http://koivi.com

Posted by Dylan Parry on March 22nd, 2006

Pondering the eternal question of "Hobnobs or Rich Tea?", Justin
Koivisto finally proclaimed:

I wrote mine off the top of my head, but I did sit there for around 10
minutes making sure I wrote it right


At least I now know that I didn't write a load of crap!

--
Dylan Parry
http://webpageworkshop.co.uk -- FREE Web tutorials and references

Posted by John Bokma on March 22nd, 2006

Dylan Parry <usenet@dylanparry.com> wrote:


You also might want to limit the length, e.g. (Perl)

[a-zA-Z0-9]{1,8}

Note 1 and 8 are examples. Best to look up what is in *your* case a
sensible upper and lower bound.

Also, I guess that with XML unicode can be used in the reference (if I
have the naming correct).

Probably the best might be to create a look up table with things you want
to allow, and make a regexp that matches this, and then do a look up for
each match. If it's in the table, use the value in the table, otherwise
reject it or ignore it.


--
John Experienced (web) developer: http://castleamber.com/

Perl RSS Builder: http://johnbokma.com/perl/rss-web-feed-builder.html

Funbolt.com - Entertainment portal, wallpapers, sexy celebs