mirror/go - go - Git Fam. Sieh

Commit Graph

Author	SHA1	Message	Date
Nigel Tao	e5f3dc8bc5	html: refactor the tokenizer; parse "</>" correctly. Previously, Next would call either nextText or nextTag, but nextTag could also call nextText. Both nextText and nextTag were responsible for detecting "</a" end tags and "<!" comments. This change simplifies the call chain and puts that responsibility in a single place. R=andybalholm CC=golang-dev https://golang.org/cl/5263050	2011-10-18 09:42:16 +11:00
Nigel Tao	1887907fee	html: tokenize "a < b" as one whole text token. R=andybalholm CC=golang-dev https://golang.org/cl/5284042	2011-10-16 20:50:11 +11:00
Andrew Balholm	b770c9e9a2	html: improve parsing of comments and "bogus comments" R=nigeltao CC=golang-dev https://golang.org/cl/5279044	2011-10-15 12:22:08 +11:00
Nigel Tao	b82a8e7c22	html: fix some tokenizer bugs with attribute key/values. The relevant spec sections are 13.2.4.38-13.2.4.40. http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#attribute-value-(double-quoted)-state R=andybalholm CC=golang-dev https://golang.org/cl/5262044	2011-10-14 15:22:02 +11:00
Nigel Tao	a49b8b9875	html: rewrite the tokenizer to be more consistent. Previously, the tokenizer made two passes per token. The first pass established the token boundary. The second pass picked out the tag name and attributes inside that boundary. This was problematic when the two passes disagreed. For example, "<p id=can't><p id=won't>" caused an infinite loop because the first pass skipped everything inside the single quotes, and recognized only one token, but the second pass never got past the first '>'. This change rewrites the tokenizer to use one pass, accumulating the boundary points of token text, tag names, attribute keys and attribute values as it looks for the token endpoint. It should still be reasonably efficient: text, names, keys and values are not lower-cased or unescaped (and converted from []byte to string) until asked for. One of the token_test test cases was fixed to be consistent with html5lib. Three more test cases were temporarily disabled, and will be re-enabled in a follow-up CL. All the parse_test test cases pass. R=andybalholm, gri CC=golang-dev https://golang.org/cl/5244061	2011-10-14 09:58:39 +11:00
Nigel Tao	bca65e395e	html: parse more malformed tags. This continues the work in revision 914a659b44ff, now passing more test cases. As before, the new tokenization tests match html5lib's behavior. Fixes #2124. R=dsymonds, r CC=golang-dev https://golang.org/cl/4867042	2011-08-11 18:49:09 +10:00
Nigel Tao	37afff2978	html: parse malformed tags missing a '>', such as `<p id=0</p>`. The additional token_test.go cases matches html5lib behavior. Fixes #2124. R=gri CC=golang-dev https://golang.org/cl/4844055	2011-08-10 13:39:07 +10:00
Andrew Balholm	816c972ff0	html: handle character entities without semicolons Fix the TODO: unescape("&notit;") should be "¬it;" Also accept digits in entity names. R=nigeltao CC=golang-dev, rsc https://golang.org/cl/4781042	2011-07-21 09:10:49 +10:00
Rob Pike	ebb1566a46	strings.Split: make the default to split all. Change the signature of Split to have no count, assuming a full split, and rename the existing Split with a count to SplitN. Do the same to package bytes. Add a gofix module. R=adg, dsymonds, alex.brainman, rsc CC=golang-dev https://golang.org/cl/4661051	2011-06-28 09:43:14 +10:00
Brad Fitzpatrick	5e03143c1a	html: improve attribute parsing, note package status Fixes #1890 R=nigeltao CC=golang-dev https://golang.org/cl/4528102	2011-06-06 15:56:15 -07:00
Robert Hencke	c8727c81bb	pkg: spelling tweaks, A-H R=ality, bradfitz, rsc, dsymonds, adg, qyzhai, dchest CC=golang-dev https://golang.org/cl/4536063	2011-05-18 13:14:56 -04:00
Brad Fitzpatrick	f4e5f364c7	html: parse empty, unquoted, and single-quoted attribute values Fixes #1391 R=nigeltao CC=golang-dev https://golang.org/cl/4453054	2011-05-12 16:11:35 -07:00
Nigel Tao	a5ff8ad9db	html: tokenize HTML comments. I'm not sure if it's 100% correct wrt the HTML5 specification, but the test suite has plenty of HTML comment test cases, and we'll shake out any tokenization bugs as the parser improves its coverage. R=gri CC=golang-dev https://golang.org/cl/4186055	2011-02-17 10:45:30 +11:00
Ryan Hitchman	f503e26379	html: unescape numeric entities, and complete the named entities table, including two-character entities. Fixes #1233. R=nigeltao CC=golang-dev https://golang.org/cl/3445041	2010-12-07 12:13:47 +11:00
Nigel Tao	08a47d6f60	html: first cut at a parser. R=gri CC=golang-dev https://golang.org/cl/3355041	2010-12-07 12:02:36 +11:00
Robert Griesemer	3478891d12	gofmt -s -w src misc R=r, rsc CC=golang-dev https://golang.org/cl/2662041	2010-10-22 10:06:33 -07:00
Nigel Tao	56b989f1b9	First cut of an HTML tokenizer (and eventually a parser). R=r, rsc, gri, rsc1 CC=golang-dev https://golang.org/cl/1814044	2010-08-10 16:08:21 +10:00

17 Commits