Xlsxtract

From BATCOM-IT Services Wiki
Jump to: navigation, search

Extracts cell data from .xlsx files

Introduction

This page is about xlsxtract, a command line tool which can be used to extract cell data from .xlsx files.

It is based on the great and easy-to-use QtXlsx Qt 5 library/module by Debao Zhang (distributed & built together with the xlsxtract source code).

Downloads

Source code

All platforms:

Pre-built binaries

Win32 x64:

Win32 x86:

Building from source

Requirements

  • A working GNU C/C++ compiler (gcc/g++) tool-chain installation
  • Qt 5 development packages (including Qt 5 private headers)

For instance, on an openSUSE 13.1 system you would install these packages (and all their dependencies):

# zypper in libqt5-qtbase-devel libqt5-qttools-devel libqt5-qtbase-private-headers-devel

Compile & link

Unpack the downloaded source code archive to some directory:

$ mkdir -p ~/src
$ cd ~/src
$ bzcat ~/Downloads/xlsxtract-0.10.tar.bz2 | tar xvf -
$ cd xlsxtract

Configure it:

$ qmake-qt5 CONFIG+=release
Project MESSAGE: This project is using private headers and will therefore be tied to this specific Qt module build version.
Project MESSAGE: Running this project against other versions of the Qt modules may crash at any arbitrary point.
Project MESSAGE: This is not a bug, but a result of using Qt internals. You have been warned!

This warning is no problem. However, it tells us that the binary we're going to build will only work with this specific version of Qt. It's not (guaranteed to be) binary compatible with any other Qt version and needs to be rebuilt for them specifically. So if you intend to create binary packages yourself, make sure they depend on the Qt version / packages in question or redistribute the required shared objects / DLLs with your package. Qt 5 isn't really designed to be built statically so we're not recommending it, but theoretically that's another possible solution to this.

Build it:

$ make -j5
...

When building is complete the executable file will be found in the current working directory on Linux/UNIX and Mac OS X (or in the release/ sub-folder on Windows):

$ ./xlsxtract -v
xlsxtract v0.10
...

Clean up

Cleaning up the build tree without removing the xlsxtract binary and Makefile:

$ make clean

Cleaning up the build tree completely including the xlsxtract binary and Makefile:

$ make distclean

Usage

Xlsxtract was designed to be used on the command line and/or in scripts (i.e. for batch processing).

The available command line arguments are as follows:

$ ./xlsxtract -h
Usage: ./xlsxtract [-r <start_row>] [-R <end_row>] [-c <start_column>] [-C <end_column>] [-s <separator>] [-d <delimiter>]
                   [-e <empty_rows>] [-S <sheet>] [-i] [-v] [-h] <file1> [<file2> .. <fileN>]

Arguments in detail:

Argument

Default value

Meaning / Description

-r <start_row> 1 / A First row for data extraction specified as a numerical or alphabetical index ("1 2 3 ..." or "A B C ...")
-R <end_row> 1 / A Last row for data extraction specified as a numerical or alphabetical index ("1 2 3 ..." or "A B C ...")
-c <start_column> 1 / A First column for data extraction specified as a numerical or alphabetical index ("1 2 3 ..." or "A B C ...")
-C <end_column> 1 / A Last column for data extraction specified as a numerical or alphabetical index ("1 2 3 ..." or "A B C ...")
-s <separator>  ; Separator character or string
-d <delimiter> empty Delimiter character or string
-e <empty_rows> 10 Number of consecutive empty rows after which xlsxtract will stop data extraction (regardless of <end_row>)
-S <sheet> empty Name of the sheet to select for data extraction (the first sheet is auto-selected if <sheet> is empty) -- can be specified multiple times, supports UNIX wild-card matching
-i -- Disables case-sensitivity on sheet name matching
-v -- Prints the xlsxtract version and exits
-h -- Prints the usage help and exits
<file> empty Any number of file names or file name patterns with shell wild-cards -- at least one file must be specified (after possible shell expansion)

Usage notes:

  • All cell range arguments (-r <start_row>, -R <end_row>, -c <start_column> and -C <end_column>) can be specified multiple times, that is you can extract multiple cell ranges at once.
  • In case you're only interested in the data from a single row or column, specifying the -r <start_row> and/or -c <start_column> arguments is actually sufficient. Conversely, when just -R <end_row> and/or -C <end_column> are specified it means extract from index 1/A until this index (inclusively).
  • If an <end_row> is smaller than its <start_row> or an <end_column> is smaller than its <start_column> no data will be extracted (obviously).
  • Extracted cell data are printed on stdout, all other messages go to stderr.

Examples

Suppose we had this example .xlsx file which consists of two sheets (named Sheet-1 and Sheet-2), each holding a small table of 8 columns and 6 rows (including the header row) ranging from A1 to H6. To fully extract the data from both sheets we'd do the following:

$ for s in Sheet-1 Sheet-2; do ./xlsxtract -r 1 -R 100 -c A -C H -S ${s} ~/tmp/example-sheet.xlsx; done
Extracting cell data from '/home/rene/tmp/example-sheet.xlsx'
Processing sheet 'Sheet-1'
Column-1 1;Column-1 2;Column-1 3;Column-1 4;Column-1 5;Column-1 6;Column-1 7;Column-1 8
Data-1 11;Data-1 21;Data-1 31;Data-1 41;Data-1 51;Data-1 61;Data-1 71;Data-1 81
Data-1 12;Data-1 22;Data-1 32;Data-1 42;Data-1 52;Data-1 62;Data-1 72;Data-1 82
Data-1 13;Data-1 23;Data-1 33;Data-1 43;Data-1 53;Data-1 63;Data-1 73;Data-1 83
Data-1 14;Data-1 24;Data-1 34;Data-1 44;Data-1 54;Data-1 64;Data-1 74;Data-1 84
Data-1 15;Data-1 25;Data-1 35;Data-1 45;Data-1 55;Data-1 65;Data-1 75;Data-1 85
Done (extracting cell data from '/home/rene/tmp/example-sheet.xlsx')
Extracting cell data from '/home/rene/tmp/example-sheet.xlsx'
Processing sheet 'Sheet-2'
Column-2 1;Column-2 2;Column-2 3;Column-2 4;Column-2 5;Column-2 6;Column-2 7;Column-2 8
Data-2 11;Data-2 21;Data-2 31;Data-2 41;Data-2 51;Data-2 61;Data-2 71;Data-2 81
Data-2 12;Data-2 22;Data-2 32;Data-2 42;Data-2 52;Data-2 62;Data-2 72;Data-2 82
Data-2 13;Data-2 23;Data-2 33;Data-2 43;Data-2 53;Data-2 63;Data-2 73;Data-2 83
Data-2 14;Data-2 24;Data-2 34;Data-2 44;Data-2 54;Data-2 64;Data-2 74;Data-2 84
Data-2 15;Data-2 25;Data-2 35;Data-2 45;Data-2 55;Data-2 65;Data-2 75;Data-2 85
Done (extracting cell data from '/home/rene/tmp/example-sheet.xlsx')

Sheet names also support wild-card matching, so let's use this instead of a script:

$ ./xlsxtract -r 1 -R 100 -c A -C H -S '*' ~/tmp/example-sheet.xlsx
Extracting cell data from '/home/rene/tmp/example-sheet.xlsx'
Processing sheet 'Sheet-1'
Column-1 1;Column-1 2;Column-1 3;Column-1 4;Column-1 5;Column-1 6;Column-1 7;Column-1 8
Data-1 11;Data-1 21;Data-1 31;Data-1 41;Data-1 51;Data-1 61;Data-1 71;Data-1 81
Data-1 12;Data-1 22;Data-1 32;Data-1 42;Data-1 52;Data-1 62;Data-1 72;Data-1 82
Data-1 13;Data-1 23;Data-1 33;Data-1 43;Data-1 53;Data-1 63;Data-1 73;Data-1 83
Data-1 14;Data-1 24;Data-1 34;Data-1 44;Data-1 54;Data-1 64;Data-1 74;Data-1 84
Data-1 15;Data-1 25;Data-1 35;Data-1 45;Data-1 55;Data-1 65;Data-1 75;Data-1 85
Processing sheet 'Sheet-2'
Column-2 1;Column-2 2;Column-2 3;Column-2 4;Column-2 5;Column-2 6;Column-2 7;Column-2 8
Data-2 11;Data-2 21;Data-2 31;Data-2 41;Data-2 51;Data-2 61;Data-2 71;Data-2 81
Data-2 12;Data-2 22;Data-2 32;Data-2 42;Data-2 52;Data-2 62;Data-2 72;Data-2 82
Data-2 13;Data-2 23;Data-2 33;Data-2 43;Data-2 53;Data-2 63;Data-2 73;Data-2 83
Data-2 14;Data-2 24;Data-2 34;Data-2 44;Data-2 54;Data-2 64;Data-2 74;Data-2 84
Data-2 15;Data-2 25;Data-2 35;Data-2 45;Data-2 55;Data-2 65;Data-2 75;Data-2 85
Done (extracting cell data from '/home/rene/tmp/example-sheet.xlsx')

Or if we wanted to extract the actual data of the first sheet (auto-selected) in clean CSV format, we'd pass " as delimiter:

$ ./xlsxtract -d '"' -r2 -R6 -C8 ~/tmp/example-sheet.xlsx > example.csv
Extracting cell data from '/home/rene/tmp/example-sheet.xlsx'
Processing sheet 'Sheet-1'
Done (extracting cell data from '/home/rene/tmp/example-sheet.xlsx')
$ cat example.csv
"Data-1 11";"Data-1 21";"Data-1 31";"Data-1 41";"Data-1 51";"Data-1 61";"Data-1 71";"Data-1 81"
"Data-1 12";"Data-1 22";"Data-1 32";"Data-1 42";"Data-1 52";"Data-1 62";"Data-1 72";"Data-1 82"
"Data-1 13";"Data-1 23";"Data-1 33";"Data-1 43";"Data-1 53";"Data-1 63";"Data-1 73";"Data-1 83"
"Data-1 14";"Data-1 24";"Data-1 34";"Data-1 44";"Data-1 54";"Data-1 64";"Data-1 74";"Data-1 84"
"Data-1 15";"Data-1 25";"Data-1 35";"Data-1 45";"Data-1 55";"Data-1 65";"Data-1 75";"Data-1 85"

All data from the first sheet (auto-selected), column A (or 1), starting at row 2:

$ ./xlsxtract -r2 -R100 ~/tmp/example-sheet.xlsx 2>/dev/null
Data-1 11
Data-1 12
Data-1 13
Data-1 14
Data-1 15

The contents of a single cell (C3 in this case) on every sheet whose name starts with an upper- or lower-case s could be extracted this way:

$ ./xlsxtract -r3 -cC -S 's*' -i ~/tmp/example-sheet.xlsx 2>/dev/null
Data-1 32
Data-2 32

You can even pass special characters as delimiter or separator. This (bash) example uses the tab-character (\t) as separator:

$ ./xlsxtract -s $'\t' -r4 -R6 -C8 -S '*' ~/tmp/example-sheet.xlsx 2>/dev/null
Data-1 13       Data-1 23       Data-1 33       Data-1 43       Data-1 53       Data-1 63       Data-1 73       Data-1 83
Data-1 14       Data-1 24       Data-1 34       Data-1 44       Data-1 54       Data-1 64       Data-1 74       Data-1 84
Data-1 15       Data-1 25       Data-1 35       Data-1 45       Data-1 55       Data-1 65       Data-1 75       Data-1 85
Data-2 13       Data-2 23       Data-2 33       Data-2 43       Data-2 53       Data-2 63       Data-2 73       Data-2 83
Data-2 14       Data-2 24       Data-2 34       Data-2 44       Data-2 54       Data-2 64       Data-2 74       Data-2 84
Data-2 15       Data-2 25       Data-2 35       Data-2 45       Data-2 55       Data-2 65       Data-2 75       Data-2 85

Using the new-line special character (\n) as separator allows for automatic output arrangement so that each cell's data is printed on a new line, thus serializing the data:

$./xlsxtract -s $'\n' -r4 -R6 -C4 ~/tmp/example-sheet.xlsx 2>/dev/null
Data-1 13
Data-1 23
Data-1 33
Data-1 43
Data-1 14
Data-1 24
Data-1 34
Data-1 44
Data-1 15
Data-1 25
Data-1 35
Data-1 45

License

Xlsxtract, Copyright © 2014 René Reucher, is free software distributed under the terms of the GNU General Public License v3.