Commit Graph

29 Commits

Author SHA1 Message Date
Andrew Balholm 74db9d298b exp/html: don't treat SVG <title> like HTML <title>
The content of an HTML <title> element is RCDATA, but the content of an SVG
<title> element is parsed as tags. Now the parser doesn't go into RCDATA
mode in foreign content.

Pass 4 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6448111
2012-08-05 22:32:35 +10:00
Andrew Balholm eff32f573b exp/html: replace NUL with U+FFFD in text in foreign content
Pass 5 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6452055
2012-07-29 16:29:49 +10:00
Andrew Balholm a1f340fa1a exp/html: parse CDATA sections in foreign content
Also convert NUL to U+FFFD in comments.

Pass 23 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6446055
2012-07-27 16:05:25 +10:00
Andrew Balholm 899be50991 exp/html: don't insert empty text nodes
Pass 1 additional test.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6443048
2012-07-26 10:32:24 +10:00
Andrew Balholm 4d22519678 exp/html: allow frameset if body contains whitespace
If the body of an HTML document contains text, the <frameset> tag is
ignored. But not if the text is only whitespace.

Pass 4 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6442043
2012-07-25 12:09:58 +10:00
Nigel Tao 66429dcf75 exp/html: simplify some of the parser's internal methods.
benchmark          old ns/op    new ns/op    delta
BenchmarkParser      4006888      3950604   -1.40%

R=r, andybalholm
CC=golang-dev
https://golang.org/cl/6301070
2012-06-13 10:13:05 +10:00
Nigel Tao 6c204982e0 exp/html: check the context node for consistency when parsing fragments.
R=rsc
CC=golang-dev
https://golang.org/cl/6303053
2012-06-08 13:55:15 +10:00
Nigel Tao c8fac7b967 exp/html: when parsing, compare atoms (ints) instead of strings.
This is the mechanical part of the 2-part change that started with
https://golang.org/cl/6305053/

R=rsc
CC=andybalholm, golang-dev, r
https://golang.org/cl/6295055
2012-06-07 13:46:57 +10:00
Nigel Tao cd21eff705 exp/html: make the tokenizer return atoms for tag tokens.
This is part 1 of a 2 part changelist. Part 2 contains the mechanical
change to parse.go to compare atoms (ints) instead of strings.

The overall effect of the two changes are:
benchmark                      old ns/op    new ns/op    delta
BenchmarkParser                  4462274      4058254   -9.05%
BenchmarkRawLevelTokenizer        913202       912917   -0.03%
BenchmarkLowLevelTokenizer       1268626      1267836   -0.06%
BenchmarkHighLevelTokenizer      1947305      1968944   +1.11%

R=rsc
CC=andybalholm, golang-dev, r
https://golang.org/cl/6305053
2012-06-07 13:05:35 +10:00
Andrew Balholm 9c14184e25 exp/html: implement Noah's Ark clause
Implement the (3-per-family) Noah's Ark clause (i.e. don't put
more than three identical elements on the list of active formatting
elements.

Also, when running tests, sort attributes by name before dumping
them.

Pass 4 additional tests with Noah's Ark clause (including one
that needs attributes to be sorted).

Pass 5 additional, unrelated tests because of sorting attributes.

R=nigeltao, rsc
CC=golang-dev
https://golang.org/cl/6247056
2012-05-29 13:39:54 +10:00
Andrew Balholm c23041efd9 exp/html: adjust parseForeignContent to match spec
Remove redundant checks for integration points.

Ignore null bytes in text.

Don't break out of foreign content for a <font> tag unless it
has a color, face, or size attribute.

Check for MathML text integration points when breaking out of
foreign content.

Pass two new tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6256045
2012-05-25 10:03:59 +10:00
Andrew Balholm 82e2272566 exp/html: detect "integration points" in SVG and MathML content
Detect HTML integration points and MathML text integration points.
At these points, process tokens as HTML, not as foreign content.

Pass 33 more tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6249044
2012-05-24 13:46:41 +10:00
Andrew Balholm 33a89b5fda exp/html: adjust the last few insertion modes to match the spec
Handle text, comment, and doctype tokens in afterBodyIM, afterAfterBodyIM,
and afterAfterFramesetIM.

Pass three more tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6231043
2012-05-23 11:11:34 +10:00
Andrew Balholm 8f66d7dc32 exp/html: adjust inSelectIM to match spec
Simplify the flow of control.

Handle EOF, null bytes, <html>, <input>, <keygen>, <textarea>, <script>.

Pass 5 more tests.

R=golang-dev, rsc, nigeltao
CC=golang-dev
https://golang.org/cl/6220062
2012-05-22 15:30:13 +10:00
Andrew Balholm 7648f61c7d exp/html: adjust inCellIM to match spec
Clean up flow of control.

Ignore </table>, </tbody>, </tfoot>, </thead>, </tr> if there is not
an appropriate element in table scope.

Pass 3 more tests.

R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6206093
2012-05-22 10:31:08 +10:00
Andrew Balholm 4973c1fc7e exp/html: adjust inRowIM to match spec
Delete cases that just fall down to "anything else" action.

Handle </tbody>, </tfoot>, and </thead>.

R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6203061
2012-05-20 14:26:20 +10:00
Andrew Balholm a09e9811dc exp/html: adjust inTableBodyIM to match spec
Clean up flow of control.

Handle </tbody>, </tfoot>, and </thead>.

Pass 5 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6117057
2012-04-26 11:48:35 +10:00
Andrew Balholm dde8358a1c exp/html: adjust inTableIM to match spec
Don't foster-parent text nodes that consist only of whitespace.
(I implemented this entirely in inTableIM instead of creating an
inTableTextIM, because the sole purpose of inTableTextIM seems to be
to combine character tokens into a string, which our tokenizer does
already.)

Use parseImpliedToken to clarify a couple of cases.

Handle <style>, <script>, <input>, and <form>.

Ignore doctype tokens.

Pass 20 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6117048
2012-04-25 10:49:27 +10:00
Andrew Balholm b885633d62 exp/html: make inBodyIM match spec
This CL corrects the remaining differences that I could find between the
implementation of inBodyIM and the spec:

Handle <rp> and <rt>.

Adjust SVG and MathML attributes.

Reconstruct active formatting elements in the "any other start tag" case.

Pass 7 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6101055
2012-04-24 15:27:48 +10:00
Andrew Balholm 0cc8ee9808 exp/html: add more cases to inBodyIM
Don't set framesetOK to false for hidden input elements.

Handle <param>, <source>, <track>, <textarea>, <iframe>, <noembed>,
and <noscript>

Pass 7 additional tests.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6094045
2012-04-22 16:19:21 +10:00
Andrew Balholm 904c7c8e99 exp/html: more work on inBodyIM
Reorder some cases.
Handle <pre>, <listing>, </form>, </li>, </dd>, </dt>, </h1>, </h2>,
</h3>, </h4>, </h5>, and </h6> tags.

Pass 6 additional tests.

R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6089043
2012-04-21 09:20:38 +10:00
Andrew Balholm eea5a432cb exp/html: start making inBodyIM match the spec
Reorder some start tags.

Improve handling of </body>.
Handle </html>.

Pass 2 additional tests (by handling </html>).

R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6082043
2012-04-20 15:48:13 +10:00
Andrew Balholm 6791057296 exp/html: ignore null bytes in text
pass one additional test

R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6048051
2012-04-20 14:25:42 +10:00
Andrew Balholm 7d63ff09a5 exp/html: improve afterHeadIM
Clean up the flow of control.
Fix the TODO for handling <html> tags.
Add a case to ignore doctype declarations.

Pass one additional test.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6072047
2012-04-20 10:48:10 +10:00
Andrew Balholm fca32f02e9 exp/html: improve InHeadIM
Clean up the flow of control, and add a case for doctype tokens (to
ignore them).

R=nigeltao
CC=golang-dev
https://golang.org/cl/6069045
2012-04-20 09:08:58 +10:00
Andrew Balholm c88ca5906c exp/html: add parseImpliedToken method to parser
This method will allow us to be explicit about what we're doing when
we insert an implied token, and avoid repeating the logic involved in
multiple places.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6060048
2012-04-19 11:48:17 +10:00
Andrew Balholm b65c9a633e exp/html: improve beforeHeadIM
Add a case to ignore doctype tokens.

Clean up the flow of control to more clearly match the spec.

Pass one more test.

R=nigeltao
CC=golang-dev
https://golang.org/cl/6062047
2012-04-18 22:45:36 +10:00
Andrew Balholm b39bbf1e5b exp/html: adjust beforeHTMLIM to match spec
Add case for doctype tokens (which are ignored).

This CL does not change the status of any tests.

R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6061047
2012-04-18 13:26:35 +10:00
Nigel Tao 324513bc5f html: move the HTML parser to an exp/html package. The parser is a
work in progress, and we are not ready to freeze its API for Go 1.

Package html still exists, containing just two functions: EscapeString
and UnescapeString.

Both the packages at exp/html and html are "package html". The former
is a superset of the latter.

At some point in the future, the exp/html code will move back into
html, once we have finalized the parser API.

R=rsc, dsymonds
CC=golang-dev
https://golang.org/cl/5571059
2012-01-25 10:54:59 +11:00