Commit Graph

17 Commits

Author SHA1 Message Date
Nigel Tao e5f3dc8bc5 html: refactor the tokenizer; parse "</>" correctly.
Previously, Next would call either nextText or nextTag, but nextTag
could also call nextText. Both nextText and nextTag were responsible
for detecting "</a" end tags and "<!" comments. This change simplifies
the call chain and puts that responsibility in a single place.

R=andybalholm
CC=golang-dev
https://golang.org/cl/5263050
2011-10-18 09:42:16 +11:00
Nigel Tao 1887907fee html: tokenize "a < b" as one whole text token.
R=andybalholm
CC=golang-dev
https://golang.org/cl/5284042
2011-10-16 20:50:11 +11:00
Andrew Balholm b770c9e9a2 html: improve parsing of comments and "bogus comments"
R=nigeltao
CC=golang-dev
https://golang.org/cl/5279044
2011-10-15 12:22:08 +11:00
Nigel Tao b82a8e7c22 html: fix some tokenizer bugs with attribute key/values.
The relevant spec sections are 13.2.4.38-13.2.4.40.
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#attribute-value-(double-quoted)-state

R=andybalholm
CC=golang-dev
https://golang.org/cl/5262044
2011-10-14 15:22:02 +11:00
Nigel Tao a49b8b9875 html: rewrite the tokenizer to be more consistent.
Previously, the tokenizer made two passes per token. The first pass
established the token boundary. The second pass picked out the tag name
and attributes inside that boundary. This was problematic when the two
passes disagreed. For example, "<p id=can't><p id=won't>" caused an
infinite loop because the first pass skipped everything inside the
single quotes, and recognized only one token, but the second pass never
got past the first '>'.

This change rewrites the tokenizer to use one pass, accumulating the
boundary points of token text, tag names, attribute keys and attribute
values as it looks for the token endpoint.

It should still be reasonably efficient: text, names, keys and values
are not lower-cased or unescaped (and converted from []byte to string)
until asked for.

One of the token_test test cases was fixed to be consistent with
html5lib. Three more test cases were temporarily disabled, and will be
re-enabled in a follow-up CL. All the parse_test test cases pass.

R=andybalholm, gri
CC=golang-dev
https://golang.org/cl/5244061
2011-10-14 09:58:39 +11:00
Nigel Tao bca65e395e html: parse more malformed tags.
This continues the work in revision 914a659b44ff, now passing more test
cases. As before, the new tokenization tests match html5lib's behavior.

Fixes #2124.

R=dsymonds, r
CC=golang-dev
https://golang.org/cl/4867042
2011-08-11 18:49:09 +10:00
Nigel Tao 37afff2978 html: parse malformed tags missing a '>', such as `<p id=0</p>`.
The additional token_test.go cases matches html5lib behavior.

Fixes #2124.

R=gri
CC=golang-dev
https://golang.org/cl/4844055
2011-08-10 13:39:07 +10:00
Andrew Balholm 816c972ff0 html: handle character entities without semicolons
Fix the TODO: unescape("&notit;") should be "¬it;"

Also accept digits in entity names.

R=nigeltao
CC=golang-dev, rsc
https://golang.org/cl/4781042
2011-07-21 09:10:49 +10:00
Rob Pike ebb1566a46 strings.Split: make the default to split all.
Change the signature of Split to have no count,
assuming a full split, and rename the existing
Split with a count to SplitN.
Do the same to package bytes.
Add a gofix module.

R=adg, dsymonds, alex.brainman, rsc
CC=golang-dev
https://golang.org/cl/4661051
2011-06-28 09:43:14 +10:00
Brad Fitzpatrick 5e03143c1a html: improve attribute parsing, note package status
Fixes #1890

R=nigeltao
CC=golang-dev
https://golang.org/cl/4528102
2011-06-06 15:56:15 -07:00
Robert Hencke c8727c81bb pkg: spelling tweaks, A-H
R=ality, bradfitz, rsc, dsymonds, adg, qyzhai, dchest
CC=golang-dev
https://golang.org/cl/4536063
2011-05-18 13:14:56 -04:00
Brad Fitzpatrick f4e5f364c7 html: parse empty, unquoted, and single-quoted attribute values
Fixes #1391

R=nigeltao
CC=golang-dev
https://golang.org/cl/4453054
2011-05-12 16:11:35 -07:00
Nigel Tao a5ff8ad9db html: tokenize HTML comments.
I'm not sure if it's 100% correct wrt the HTML5 specification,
but the test suite has plenty of HTML comment test cases, and
we'll shake out any tokenization bugs as the parser improves its
coverage.

R=gri
CC=golang-dev
https://golang.org/cl/4186055
2011-02-17 10:45:30 +11:00
Ryan Hitchman f503e26379 html: unescape numeric entities, and complete the named entities table, including two-character entities.
Fixes #1233.

R=nigeltao
CC=golang-dev
https://golang.org/cl/3445041
2010-12-07 12:13:47 +11:00
Nigel Tao 08a47d6f60 html: first cut at a parser.
R=gri
CC=golang-dev
https://golang.org/cl/3355041
2010-12-07 12:02:36 +11:00
Robert Griesemer 3478891d12 gofmt -s -w src misc
R=r, rsc
CC=golang-dev
https://golang.org/cl/2662041
2010-10-22 10:06:33 -07:00
Nigel Tao 56b989f1b9 First cut of an HTML tokenizer (and eventually a parser).
R=r, rsc, gri, rsc1
CC=golang-dev
https://golang.org/cl/1814044
2010-08-10 16:08:21 +10:00