Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
Clear All
new posts

    BatchUploader: Support files with non UTF-8 encoding (e.g Excel's Save As-CSV format)

    Hi Isomorphic,

    please see the online showcase BatchUploader sample (now on v11.0p_2016-10-13, but also happening locally with latest 5.1p).
    I downloaded the supplyItemTest.csv file and opened it in LibreOffice (v 5.1.5) and saved it as XLSX. This works and is portable (file can be opened in Excel as well).

    I then closed LibreOffice and reopened the file in it and exported it with different encodings.
    I did the same in Windows Excel (latest Office 365, but older versions and MacOS versions show the same behavior).

    I uploaded these exported CSVs to the sample. This is the result (files included in this thread):
    1. supplyItemTest_LibreOffice_UTF8.csv: OK
    2. supplyItemTest_LibreOffice_CP1252.csv: Questionmark-Umlauts
    3. supplyItemTest_Excel-CSV.csv: Questionmark-Umlauts (Default delimiter: Semicolon, needed to search-replace here, then the same as supplyItemTest_LibreOffice_CP1252.csv according to WinMerge)
    4. supplyItemTest_Excel-CSVMAC.csv: Questionmark-Umlauts (Default delimiter: Semicolon, needed to search-replace here)
    5. supplyItemTest_Excel-CSVMSDOS.csv: Questionmark-Umlauts (Default delimiter: Semicolon, needed to search-replace here)


    1) is OK obviously. Let's ignore 4) / 5) for now, as I don't know what the encoding here is, either.
    3) is the same as 2) after taking care of the different default delimiters (or one could add a SelectItem to configure the delimiter for the BatchUploader), and the encoding is ISO 8859-1 / CP1252 / WinLatin1.

    This is how uploads of 2) / 3) look for me.
    Click image for larger version

Name:	Unbenannt.PNG
Views:	67
Size:	12.3 KB
ID:	240753



    My problem is with 2/3. Our customers all use excel, no one uses LibreOffice. All end user can do Excel save-as-csv, but when they want to import the CSV in our tool, we have Umlaut-problems.

    Could you add an encoding setting for the BatchUploader like you do with setDefaultDelimiter() and setDefaultQuoteString() (also to 5.1p, switch to 6.0p delayed because of this one)?
    That encoding setting should define the encoding of the input file. The result should of course still be UTF-8.

    This one is a pretty serious one for us, as now end users are using the Batch Upload on a regular basis, and they start complaining about the umlaut-errors. Before, we used it mainly for setup where we did the export in LibreOffice and could control the export encoding.
    IMHO the BatchUploader should be able to handle different encodings regardless of our problem here, because different tools will create different exports (until one day, maybe, all are using the now already 10 year old UTF-8...).
    I don't know if support for all possible different encodings is needed, but UTF-8 and ISO 8859-1 / CP1252 / WinLatin1 should definitely work and will make 99% of all uploads, I'd guess.

    Thank you & Best regards
    Blama


    PS: This is not related to this report.
    Attached Files

    #2
    What are you planning to do here, have your end users specify the encoding in the UI? Or are you hoping for server-side auto-detection of some kind?

    Comment


      #3
      Hi Isomorphic,

      I'd let the end user specify the encoding in the UI - I already do the same for the delimiter and the quote string.
      Should be feasible for the end user with list of the encodings and an explanation.

      Best regards
      Blama

      Comment


        #4
        Hi Isomorphic,

        most likely needless so say, but the encoding should also be settable when using DataImport on the serverside without clientside BatchUploader.

        So clientside BatchUploader will tell serverside BatchUploader about the selected encoding and serverside BatchUploader will configure the DataImport instance accordingly - most likely just the same that you already do for delimiter and quote string.

        Best regards
        Blama

        Comment


          #5
          We've added uploadEncoding setting to the client-side BatchUploader, use it to set uploaded file encoding. This is available for download in latest nightly builds.
          Note, that defaultQuoteString and defaultDelimiter properties were renamed to uploadQuoteString and uploadDelimiter, although old names are also supported.

          On server-side, DataImport documented APIs take java.io.Reader, which can have encoding configured. See standalone code sample below:
          Code:
          InputStream is = ServletTools.loadWebRootFile("files/portfolios.csv");
          Reader reader = [b]new InputStreamReader(is, "CP1252")[/b];
          List portfolios = new DataImport().importToRows(reader);

          Comment


            #6
            Hi Isomorphic,

            thanks for adding. I'll test it immediately and let you know how it worked out.
            I already looked it up in the docs. Small flaw: The 2nd time default is mentioned there it says the default is ",".

            Thanks again & Best regards
            Blama

            Comment


              #7
              Hi Isomorphic,

              the feature is working as expected for me. Bigger problem is to find out the encoding Excel actually uses when exporting :(
              SO suggests that this depends on your Windows Regional settings.

              For anyone finding this thread, these are my observations (latest Office 365 Excel with German Win10):
              • Best export from Excel is not CSV, but Unicode text. This results in a tab-delimited, "-quoted UTF-16 file.
              • Exports done as CSV are ;-delimited, "-quoted ANSI encoded files.
              • Exports done as CSV (MS DOS) are ;-delimited, "-quoted Codepage 850 / DOS-Latin-1 encoded files.
              • LibreOffice lets you configure the wanted codepage when exporting.
              Best regards
              Blama

              Comment


                #8
                Hi Isomorphic,

                this came up again because of the wonderful Excel program and it ever changing CSV export between versions and platforms.

                1st: I had to search for this thread as I was sure this came up before and I did not find anything in the normal docs. Can you add there the information from #5 that this is just passed though to this InputStreamReader constructor?

                My real problem is with Excel CSV exports. Excel creates these files as UTF-8-BOM, which of course is stupid, because the whole world is using UTF-8 without BOM.
                To solve this there would be the Apache Common IO 2.5 (which you already ship) BOMInputStream class. But I get that you might be reluctant to add an "if" and then use the normal Java class or this one.
                But according to this SO answer and this linked Java (wont-fix) bug report the solution is simply to read and skip the 1st character if it is the BOM. Could you do that? See here for the BOM in different UTF encodings (solution from the SO post).

                The issue here is clearly with Excel using this silly export format, but here is not much we can do about this. We do recommend using Libre Office's configurable CSV export in order to import in our software, but not every customer can install software.
                Otherwise the workaround is to have a dummy column that the BatchImporter then can ignore because it does not recognize the column "col1" which looks like it's name is "[BOM]col1". Again, not something you want to have to recommend customers.

                Could you add this special BOM handling to 12.0p+, so that everyone can properly import Excel-CSVs?

                Thank you & Best regards
                Blama

                Comment


                  #9
                  We've updated the docs with more information right away. We'll get back to you on the rest in a bit.

                  Comment


                    #10
                    On your second request (about BOMs): we are receiving CSV files from customers, and exporting our own, and uploading them and processing them using DataImport, and we're not seeing this reported issue with BOMs. How can you explain this? Is this BOM behavior perhaps something that Excel only does if your Windows machine is set to the German locale? Even then, we'd assume we'd see it occasionally, we never have.

                    Comment


                      #11
                      Hi Isomorphic,

                      thank you, I saw the docs update.

                      W.r.t. the BOM issue:
                      I don't have a real explanation, but sometimes MS deliberately makes stepping away from it's products difficult, see e.g. here for OOXML.
                      I don't know what versions of Excel do produce the BOM and what don't, but for us / our customer files this is happening regularly, if not always.
                      Also google is full of this problem.

                      Perhaps it is the non-US versions, perhaps it is related to some locale setting. It definitely is annoying and hard to explain when it happens.
                      Could you just remove the BOM if you hit it as 1st character? I don't think that this could cause any harm with legitimate CSV files.

                      If you need some demo files, my colleague can surly export some for you. Unfortunately I can't do this myself, as I don't have Excel.

                      Best regards
                      Blama

                      Comment


                        #12
                        Yes, it would be great if you could post a sample file as we do not seem to be able to create one with a BOM.

                        Comment


                          #13
                          Hi Isomorphic,

                          here is an example file, saved on German MacOS Catalina 10.15.7 and Excel v16.5.0. Selected save-as option is "UTF-8-durch Trennzeichen getrennt".

                          This is what this file looks like in notepad++:
                          Click image for larger version

Name:	Display in notepad++.png
Views:	38
Size:	72.4 KB
ID:	265936

                          Best regards
                          Blama
                          Attached Files

                          Comment


                            #14
                            Hi Isomorphic,

                            could you reproduce this one with the example file?

                            Best regards
                            Blama

                            Comment


                              #15
                              Hi Blama
                              Sorry for the silence on this. We have someone looking at the issue and we'll follow up as soon as we have more information.

                              Comment

                              Working...
                              X