Commit Graph

27 Commits

Author SHA1 Message Date
Joe Tsai e7fed7fa35 encoding/csv: forbid certain Comma and Comment runes
The record delimiter (not configurable by user) is "\r\n" or "\n".
It is insensible for the user to set Comma or Comment delimiters
to be some character that conflicts with the record delimiter.
Furthermore, it is insensible for Comma or Comment to be the same rune.
Allowing this leaks implementation details to the user in regards to
the evaluation order of which rune is checked for first.

Fixes #22404

Change-Id: I31e86abc9b3a8fb4584e090477795587740970ae
Reviewed-on: https://go-review.googlesource.com/72793
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-10-25 01:43:46 +00:00
Joe Tsai 8c532f5fc4 encoding/csv: update ErrQuote message
The ErrQuote variable is only returned when a parsing error
occurs within a quoted string. Make that clear in the message.

Change-Id: I06ad5a9edb41afedde193c4f8b93551bb8342bbb
Reviewed-on: https://go-review.googlesource.com/72794
Reviewed-by: Avelino <t@avelino.xxx>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-10-24 07:10:28 +00:00
Joe Tsai 744da64947 encoding/csv: fix error documentation
We should be referring to ParseError.Err, which is the underlying error,
not ParseError.Error, which is the error method.

Change-Id: Ic3cef5ecbe1ada5fa14b9573222f29da8fc9a8d5
Reviewed-on: https://go-review.googlesource.com/72450
Reviewed-by: Tim Cooper <tim.cooper@layeh.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-10-24 07:10:10 +00:00
Joe Tsai 29ea82d072 encoding/csv: add ParseError.RecordLine
CL 72150 fixes #22352 by reverting the problematic parts of that CL
where the line number and column number were inconsistent with each other.
This CL adds back functionality to address the issue that CL 72150
was trying to solve in the first place. That is, it reports the starting
line of the record, so that users have a frame of reference to start with
when debugging what went wrong.

In the event of gnarly CSV files with multiline quoted strings, a parse
failure likely occurs somewhere between the start of the record and
the point where the parser finally detected an error.
Since ParserError.{Line,Column} reports where the *error* occurs, we
add a RecordLine field to report where the record starts.

Also take this time to cleanup and modernize TestRead.

Fixes #19019
Fixes #22352

Change-Id: I16cebf0b81922c35f75804c7073e9cddbfd11a04
Reviewed-on: https://go-review.googlesource.com/72310
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-10-21 01:32:28 +00:00
Joe Tsai 89ccfe4962 encoding/csv: simplify and optimize Reader
The Reader implementation is slow because it operates on a rune-by-rune
basis via bufio.Reader.ReadRune. We speed this up by operating on entire
lines that we read from bufio.Reader.ReadSlice.

In order to ensure that we read the full line, we augment ReadSlice
in our Reader.readLine method to automatically expand the slice if
bufio.ErrBufferFull is every hit.

This change happens to fix #19410 because it no longer relies on
rune-by-rune parsing and only searches for the relevant delimiter rune.

In order to keep column accounting simple and consistent, this change
reverts parts of CL 52830.

This CL is an alternative to CL 36270 and builds on some of the ideas
from that change by Diogo Pinela.

name                                     old time/op    new time/op    delta
Read-8                                   3.12µs ± 1%    2.54µs ± 2%  -18.76%   (p=0.000 n=10+9)
ReadWithFieldsPerRecord-8                3.12µs ± 1%    2.53µs ± 1%  -18.91%    (p=0.000 n=9+9)
ReadWithoutFieldsPerRecord-8             3.13µs ± 0%    2.57µs ± 3%  -18.07%  (p=0.000 n=10+10)
ReadLargeFields-8                        52.3µs ± 1%     5.3µs ± 2%  -89.93%   (p=0.000 n=10+9)
ReadReuseRecord-8                        2.05µs ± 1%    1.40µs ± 1%  -31.48%   (p=0.000 n=10+9)
ReadReuseRecordWithFieldsPerRecord-8     2.05µs ± 1%    1.41µs ± 0%  -31.03%   (p=0.000 n=10+9)
ReadReuseRecordWithoutFieldsPerRecord-8  2.06µs ± 1%    1.40µs ± 1%  -31.70%   (p=0.000 n=9+10)
ReadReuseRecordLargeFields-8             50.9µs ± 0%     4.1µs ± 3%  -92.01%  (p=0.000 n=10+10)

name                                     old alloc/op   new alloc/op
Read-8                                       664B ± 0%      664B ± 0%
ReadWithFieldsPerRecord-8                    664B ± 0%      664B ± 0%
ReadWithoutFieldsPerRecord-8                 664B ± 0%      664B ± 0%
ReadLargeFields-8                          3.94kB ± 0%    3.94kB ± 0%
ReadReuseRecord-8                           24.0B ± 0%     24.0B ± 0%
ReadReuseRecordWithFieldsPerRecord-8        24.0B ± 0%     24.0B ± 0%
ReadReuseRecordWithoutFieldsPerRecord-8     24.0B ± 0%     24.0B ± 0%
ReadReuseRecordLargeFields-8               2.98kB ± 0%    2.98kB ± 0%

name                                     old allocs/op  new allocs/op
Read-8                                       18.0 ± 0%      18.0 ± 0%
ReadWithFieldsPerRecord-8                    18.0 ± 0%      18.0 ± 0%
ReadWithoutFieldsPerRecord-8                 18.0 ± 0%      18.0 ± 0%
ReadLargeFields-8                            24.0 ± 0%      24.0 ± 0%
ReadReuseRecord-8                            8.00 ± 0%      8.00 ± 0%
ReadReuseRecordWithFieldsPerRecord-8         8.00 ± 0%      8.00 ± 0%
ReadReuseRecordWithoutFieldsPerRecord-8      8.00 ± 0%      8.00 ± 0%
ReadReuseRecordLargeFields-8                 12.0 ± 0%      12.0 ± 0%

Updates #22352
Updates #19019
Fixes #16791
Fixes #19410

Change-Id: I31c27cfcc56880e6abac262f36c947179b550bbf
Reviewed-on: https://go-review.googlesource.com/72150
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-10-20 23:20:46 +00:00
Justin Nuß 9fbc06e6aa encoding/csv: preserve \r\n in quoted fields
The parser mistakenly assumed it could always fold \r\n into \n, which
is not true since a \r\n inside a quoted fields has no special meaning
and should be kept as is.

Fix this by not folding \r\n to \n inside quotes fields.

Fixes #21201

Change-Id: Ifebc302e49cf63e0a027ee90f088dbc050a2b7a6
Reviewed-on: https://go-review.googlesource.com/52810
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-08-14 18:42:20 +00:00
Justin Nuß 5d14ac74f6 encoding/csv: report line start line in errors
Errors returned by Reader contain the line where the Reader originally
encountered the error. This can be suboptimal since that line does not
always correspond with the line the current record/field started at.

This can easily happen with LazyQuotes as seen in #19019, but also
happens for example when a quoted fields has no closing quote and
the parser hits EOF before it finds another quote.

When this happens finding the erroneous field can be somewhat
complicated and time consuming, and in most cases it would be better to
report the line where the record started.

This change updates Reader to keep track of the line on which a record
begins and uses it for errors instead of the current line, making it
easier to find errors.

Although a user-visible change, this should have no impact on existing
code, since most users don't explicitly work with the line in the error
and probably already expect the new behaviour.

Updates #19019

Change-Id: Ic9bc70fad2651c69435d614d537e7a9266819b05
Reviewed-on: https://go-review.googlesource.com/52830
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-08-14 04:45:38 +00:00
Justin Nuß 2181653be6 encoding/csv: add option to reuse slices returned by Read
In many cases the records returned by Reader.Read will only be used between calls
to Read and become garbage once a new record is read. In this case, instead of
allocating a new slice on each call to Read, we can reuse the last allocated slice
for successive calls to avoid unnecessary allocations.

This change adds a new field ReuseRecord to the Reader struct to enable this reuse.

ReuseRecord is false by default to avoid breaking existing code which dependss on
the current behaviour.

I also added 4 new benchmarks, corresponding to the existing Read benchmarks, which
set ReuseRecord to true.

Benchstat on my local machine (old is ReuseRecord = false, new is ReuseRecord = true)

name                          old time/op    new time/op    delta
Read-8                          2.75µs ± 2%    1.88µs ± 1%  -31.52%  (p=0.000 n=14+15)
ReadWithFieldsPerRecord-8       2.75µs ± 0%    1.89µs ± 1%  -31.43%  (p=0.000 n=13+13)
ReadWithoutFieldsPerRecord-8    2.77µs ± 1%    1.88µs ± 1%  -32.06%  (p=0.000 n=15+15)
ReadLargeFields-8               55.4µs ± 1%    54.2µs ± 0%   -2.07%  (p=0.000 n=15+14)

name                          old alloc/op   new alloc/op   delta
Read-8                            664B ± 0%       24B ± 0%  -96.39%  (p=0.000 n=15+15)
ReadWithFieldsPerRecord-8         664B ± 0%       24B ± 0%  -96.39%  (p=0.000 n=15+15)
ReadWithoutFieldsPerRecord-8      664B ± 0%       24B ± 0%  -96.39%  (p=0.000 n=15+15)
ReadLargeFields-8               3.94kB ± 0%    2.98kB ± 0%  -24.39%  (p=0.000 n=15+15)

name                          old allocs/op  new allocs/op  delta
Read-8                            18.0 ± 0%       8.0 ± 0%  -55.56%  (p=0.000 n=15+15)
ReadWithFieldsPerRecord-8         18.0 ± 0%       8.0 ± 0%  -55.56%  (p=0.000 n=15+15)
ReadWithoutFieldsPerRecord-8      18.0 ± 0%       8.0 ± 0%  -55.56%  (p=0.000 n=15+15)
ReadLargeFields-8                 24.0 ± 0%      12.0 ± 0%  -50.00%  (p=0.000 n=15+15)

Fixes #19721

Change-Id: I79b14128bb9bb3465f53f40f93b1b528a9da6f58
Reviewed-on: https://go-review.googlesource.com/41730
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-04-26 15:55:56 +00:00
Russ Cox 30651b3bbb encoding/csv: document Read error behavior
Fixes #17342.

Change-Id: I76af756d7aff464554c5564d444962a468d0eccc
Reviewed-on: https://go-review.googlesource.com/32172
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Quentin Smith <quentin@golang.org>
2016-10-28 19:38:12 +00:00
Justin Nuß bd06d4827a encoding/csv: avoid allocations when reading records
This commit changes parseRecord to allocate a single string per record,
instead of per field, by using indexes into the raw record.

Benchstat (done with f69991c17)

name                          old time/op    new time/op    delta
Read-8                          3.17µs ± 0%    2.78µs ± 1%  -12.35%  (p=0.016 n=4+5)
ReadWithFieldsPerRecord-8       3.18µs ± 1%    2.79µs ± 1%  -12.23%  (p=0.008 n=5+5)
ReadWithoutFieldsPerRecord-8    4.59µs ± 0%    2.77µs ± 0%  -39.58%  (p=0.016 n=4+5)
ReadLargeFields-8               57.0µs ± 0%    55.7µs ± 0%   -2.18%  (p=0.008 n=5+5)

name                          old alloc/op   new alloc/op   delta
Read-8                            660B ± 0%      664B ± 0%   +0.61%  (p=0.008 n=5+5)
ReadWithFieldsPerRecord-8         660B ± 0%      664B ± 0%   +0.61%  (p=0.008 n=5+5)
ReadWithoutFieldsPerRecord-8    1.14kB ± 0%    0.66kB ± 0%  -41.75%  (p=0.008 n=5+5)
ReadLargeFields-8               3.86kB ± 0%    3.94kB ± 0%   +1.86%  (p=0.008 n=5+5)

name                          old allocs/op  new allocs/op  delta
Read-8                            30.0 ± 0%      18.0 ± 0%  -40.00%  (p=0.008 n=5+5)
ReadWithFieldsPerRecord-8         30.0 ± 0%      18.0 ± 0%  -40.00%  (p=0.008 n=5+5)
ReadWithoutFieldsPerRecord-8      50.0 ± 0%      18.0 ± 0%  -64.00%  (p=0.008 n=5+5)
ReadLargeFields-8                 66.0 ± 0%      24.0 ± 0%  -63.64%  (p=0.008 n=5+5)

For a simple application that I wrote, which reads in a CSV file (via
ReadAll) and outputs the number of rows read (15857625 rows), this change
reduces the total time on my notebook from ~58 seconds to ~48 seconds.

This reduces time and allocations (bytes) each by ~6% for a real world
CSV file at work (~230000 rows, 13 colums).

Updates #16791

Change-Id: Ia07177c94624e55cdd3064a7d2751fb69322d3e4
Reviewed-on: https://go-review.googlesource.com/24723
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-10-05 16:57:44 +00:00
Brad Fitzpatrick efaa36017e encoding/csv: update and add CSV reading benchmarks
Benchmarks broken off from https://golang.org/cl/24723 and modified to
allocate less in the places we're not trying to measure.

Updates #16791

Change-Id: I508e4cfeac60322d56f1d71ff1912f6a6f183a63
Reviewed-on: https://go-review.googlesource.com/30357
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-10-05 04:29:07 +00:00
Jess Frazelle 8042bfe347 encoding/csv: update doc about comments whitespace
This patch updates the doc about comments whitespace for the
encoding/csv package to reflect that leading whitespace before
the hash will treat the line as not a comment.

Fixes #13775.

Change-Id: Ia468c75b242a487b4b2b4cd3d342bfb8e07720ba
Reviewed-on: https://go-review.googlesource.com/23302
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-06-10 01:00:09 +00:00
Ian Lance Taylor fa3f484800 encoding/csv: clarify that this package supports RFC 4180
The intent of this comment is to reduce the number of issues opened
against the package to add support for new kinds of CSV formats, such as
issues #3150, #8458, #12372, #12755.

Change-Id: I452c0b748e4ca9ebde3e6cea188bf7774372148e
Reviewed-on: https://go-review.googlesource.com/23401
Reviewed-by: Andrew Gerrand <adg@golang.org>
2016-05-25 01:47:53 +00:00
Robert Griesemer 93e8e70499 all: fixed a handful of typos
Change-Id: Ib0683f27b44e2f107cca7a8dcc01d230cbcd5700
Reviewed-on: https://go-review.googlesource.com/23404
Reviewed-by: Alan Donovan <adonovan@google.com>
2016-05-24 21:18:03 +00:00
Brad Fitzpatrick 5fea2ccc77 all: single space after period.
The tree's pretty inconsistent about single space vs double space
after a period in documentation. Make it consistently a single space,
per earlier decisions. This means contributors won't be confused by
misleading precedence.

This CL doesn't use go/doc to parse. It only addresses // comments.
It was generated with:

$ perl -i -npe 's,^(\s*// .+[a-z]\.)  +([A-Z]),$1 $2,' $(git grep -l -E '^\s*//(.+\.)  +([A-Z])')
$ go test go/doc -update

Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7
Reviewed-on: https://go-review.googlesource.com/20022
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Dave Day <djd@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-02 00:13:47 +00:00
Brad Fitzpatrick 351c15f1ce all: remove public named return values when useless
Named returned values should only be used on public funcs and methods
when it contributes to the documentation.

Named return values should not be used if they're only saving the
programmer a few lines of code inside the body of the function,
especially if that means there's stutter in the documentation or it
was only there so the programmer could use a naked return
statement. (Naked returns should not be used except in very small
functions)

This change is a manual audit & cleanup of public func signatures.

Signatures were not changed if:

* the func was private (wouldn't be in public godoc)
* the documentation referenced it
* the named return value was an interesting name. (i.e. it wasn't
  simply stutter, repeating the name of the type)

There should be no changes in behavior. (At least: none intended)

Change-Id: I3472ef49619678fe786e5e0994bdf2d9de76d109
Reviewed-on: https://go-review.googlesource.com/20024
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Andrew Gerrand <adg@golang.org>
2016-02-29 03:31:19 +00:00
Eric Lagergren 4feb47bc76 encoding/csv: clarify that TrimLeadingSpace can trim the delimiter
Fixes #14464

Change-Id: Iafc21641cca7d35b7a5631cfc94742ee8e7d5042
Reviewed-on: https://go-review.googlesource.com/19861
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-02-24 18:42:09 +00:00
Nathan VanBenschoten b04f3b06ec all: replace strings.Index with strings.Contains where possible
Change-Id: Ia613f1c37bfce800ece0533a5326fca91d99a66a
Reviewed-on: https://go-review.googlesource.com/18120
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Robert Griesemer <gri@golang.org>
2016-02-19 01:06:05 +00:00
Katrina Owen a6f69b31e0 encoding/csv: indicate package of EOF in docs
The documentation listing err == EOF can be confusing to newcomers
to the language who are looking for the relevant documentation for
that error.

Change-Id: I301885950d0e1d0fbdf3a1892fca86eac7a0c616
Reviewed-on: https://go-review.googlesource.com/15806
Reviewed-by: Andrew Gerrand <adg@golang.org>
2015-10-14 00:46:21 +00:00
Carlos C 1be335b608 encoding/csv: add examples for package
Change-Id: I3463826aa760aa5984dec4fc043b95fd2a5120ac
Reviewed-on: https://go-review.googlesource.com/11240
Reviewed-by: Andrew Gerrand <adg@golang.org>
2015-06-22 11:11:37 +00:00
Damien Neil ab89378cb7 encoding/csv: skip blank lines when FieldsPerRecord >= 0
Fixes #11050.

Change-Id: Ie5d16960a1f829af947d82a63fe414924cd02ff6
Reviewed-on: https://go-review.googlesource.com/10666
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
2015-06-12 18:21:12 +00:00
Ainar Garipov 7f9f70e5b6 all: fix misprints in comments
These were found by grepping the comments from the go code and feeding
the output to aspell.

Change-Id: Id734d6c8d1938ec3c36bd94a4dbbad577e3ad395
Reviewed-on: https://go-review.googlesource.com/10941
Reviewed-by: Aamir Khan <syst3m.w0rm@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2015-06-11 14:18:57 +00:00
Ainar Garipov 4da658cf96 encoding/csv: fix comment in parseRecord
Change-Id: I82edd9364e1b4634006f5e043202a69f383dcdbe
Reviewed-on: https://go-review.googlesource.com/10826
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2015-06-10 14:26:10 +00:00
Justin Nuß 2db58f8f2d encoding/csv: Preallocate records slice
Currently parseRecord will always start with a nil
slice and then resize the slice on append. For input
with a fixed number of fields per record we can preallocate
the slice to avoid having to resize the slice.

This change implements this optimization by using
FieldsPerRecord as capacity if it's > 0 and also adds a
benchmark to better show the differences.

benchmark         old ns/op     new ns/op     delta
BenchmarkRead     19741         17909         -9.28%

benchmark         old allocs     new allocs     delta
BenchmarkRead     59             41             -30.51%

benchmark         old bytes     new bytes     delta
BenchmarkRead     6276          5844          -6.88%

Change-Id: I7c2abc9c80a23571369bcfcc99a8ffc474eae7ab
Reviewed-on: https://go-review.googlesource.com/8880
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-04-26 16:28:51 +00:00
Josh Bleecher Snyder 2adc4e8927 all: use "reports whether" in place of "returns true if(f)"
Comment changes only.

Change-Id: I56848814564c4aa0988b451df18bebdfc88d6d94
Reviewed-on: https://go-review.googlesource.com/7721
Reviewed-by: Rob Pike <r@golang.org>
2015-03-18 15:14:06 +00:00
Russ Cox 6ad2749dcd encoding/csv: for Postgres, unquote empty strings, quote \.
In theory both of these lines encode the same three fields:

        a,,c
        a,"",c

However, Postgres defines that when importing CSV, the unquoted
version is treated as NULL (missing), while the quoted version is
treated as a string value (empty string). If the middle field is supposed to
be an integer value, the first line can be imported (NULL is okay), but
the second line cannot (empty string is not).

Postgres's import command (COPY FROM) has an option to force
the unquoted empty to be interpreted as a string but it does not
have an option to force the quoted empty to be interpreted as a NULL.

From http://www.postgresql.org/docs/9.0/static/sql-copy.html:

        The CSV format has no standard way to distinguish a NULL
        value from an empty string. PostgreSQL's COPY handles this
        by quoting. A NULL is output as the NULL parameter string
        and is not quoted, while a non-NULL value matching the NULL
        parameter string is quoted. For example, with the default
        settings, a NULL is written as an unquoted empty string,
        while an empty string data value is written with double
        quotes (""). Reading values follows similar rules. You can
        use FORCE_NOT_NULL to prevent NULL input comparisons for
        specific columns.

Therefore printing the unquoted empty is more flexible for
imports into Postgres than printing the quoted empty.

In addition to making the output more useful with Postgres, not
quoting empty strings makes the output smaller and easier to read.
It also matches the behavior of Microsoft Excel and Google Drive.

Since we are here and making concessions for Postgres, handle this
case too (again quoting the Postgres docs):

        Because backslash is not a special character in the CSV
        format, \., the end-of-data marker, could also appear as a
        data value. To avoid any misinterpretation, a \. data value
        appearing as a lone entry on a line is automatically quoted
        on output, and on input, if quoted, is not interpreted as
        the end-of-data marker. If you are loading a file created by
        another application that has a single unquoted column and
        might have a value of \., you might need to quote that value
        in the input file.

Fixes #7586.

LGTM=bradfitz
R=bradfitz
CC=golang-codereviews
https://golang.org/cl/164760043
2014-10-23 23:44:47 -04:00
Russ Cox c007ce824d build: move package sources from src/pkg to src
Preparation was in CL 134570043.
This CL contains only the effect of 'hg mv src/pkg/* src'.
For more about the move, see golang.org/s/go14nopkg.
2014-09-08 00:08:51 -04:00