Coding Pieces

Monday, August 8, 2016

Installing 7-Zip without Administrator access

I often need to install software on my work computer, where I do not have admin access. Most programs don't actually require admin access at all provided you install into your user profile directory (C:\Users\...) instead of Program Files. Unfortunately, many installers don't have support for per-user installations and request elevation anyway.

7-Zip is an example of one such program. But you can work around this issue if you download the MSI installer and use the following command line to open it.

msiexec /i 7z1602-x64.msi INSTALLDIR=%USERPROFILE%\7-Zip\ MSIINSTALLPERUSER=1

As you might have guessed, the MSIINSTALLPERUSER=1 flag on the command line is doing the work here by requesting a per-user install. The installer will succeed without attempting elevation.

Thursday, May 19, 2016

What's the fastest way to concatenate worksheet cells into a string?

I often store string resources (e.g. SQL queries, XML documents) in Excel worksheets. This makes the strings easy to update during development and avoids hassles associated with hard-coding the strings in the VBA code, or using external text files which must be distributed with the workbook.

In order to use the strings in the code, a function is needed that concatenates worksheet cells and returns a string. To keep things simple, I'm going to assume the worksheet only contains data in the first column. The most naive way to write this function is to concatenate in a loop, like this.

Function GetText(ws As Worksheet) As String
    Dim CellData As Range
    Dim Result As String
    Dim i As Long

    Set CellData = Intersect(ws.[A:A], ws.UsedRange)
    For i = 1 To CellData.Cells.Count
        Result = Result & CellData(i) & vbCrLf
    Next

    GetText = Result
End Function

The performance of this method is unacceptable even for a relatively small number of cells, with one call taking up to several minutes to run. At each iteration of the loop, the previous string is copied into a new allocated string and then left as garbage. As the loop continues, the intermediate string gets longer, more data has to be copied, and more and more garbage is generated. This is a textbook instance of a Schlemiel the painter's algorithm.

A better method is to copy the cells into an array and use the Join function to generate the final result.

Function GetText(ws As Worksheet) As String
    Dim CellData As Range
    Dim Result() As String
    Dim i As Long

    Set CellData = Intersect(ws.[A:A], ws.UsedRange)
    ReDim Result(1 To CellData.Cells.Count)
    For i = 1 To UBound(result)
        Result(i) = CellData(i)
    Next

    GetText = Join(Result, vbCrLf)
End Function

Performance is much better, but you'll notice that even this version takes several seconds to process a few thousand lines of data. It would be better if we could just concatenate the cells directly without having to employ an intermediate array. Luckily, it's possible to do this using the WorksheetFunction.Transpose function, which is smart enough to generate the 1-dimensional array we need for Join when a single column 2-dimensonal array is provided as input.

Function GetText(ws As Worksheet) As String
    Dim CellData As Range

    Set CellData = Intersect(ws.[A:A], ws.UsedRange)
    GetText = Join(WorksheetFunction.Transpose(CellData), vbCrLf)
End Function

This function will run in just a second or two even with 10s of thousands of lines of data. It's always worth it to keep an eye out for a better method even when you think you've already optimized your solution.

Friday, May 13, 2016

Extracting a lot of data from a SQL Server database

Often when working in SQL Server Management Studio, you will end up with a complicated query that results in a huge number of records that need to be exported for processing by another tool. When working with more than a few megabytes of data, simply copying and pasting it into Excel isn't an option.

One potential solution is use the bcp utility to export the results into a tab-delimited CSV file using a temporary table. In order for the temporary table to be visible to bcp, we need to use a global temp table. This is achieved by using an INTO clause and prepending two hashes to the temp table name.

SELECT table1.col1, table2.col2, table3.col3
INTO ##tempbcp
FROM table1
INNER JOIN table2 ON table1.key1 = table2.key1
LEFT JOIN table3 ON table1.key2 = table3.key2;

After the temporary table is populated, invoke bcp from a command prompt to extract the data.

bcp ##tempbcp out tempbcp.csv -S sqlhost -T -w

The -T option uses your network credentials to log on to the SQL Server. -w specifies that we want our results in a tab-delimited, unicode format.

Using bcp in this way to move data around is usually the best solution because, most of the time, the ability to create tables in tempdb is unrestricted, so no special database permissions are needed. Also, bcp is much faster and more memory efficient than most other methods.

However, a major drawback is that bcp is unable to include header titles in the output file. In order to get the field names, you can query the system views. From there you can use a variety of methods to get the field names into the CSV file or wherever the data is ultimately being transferred.

SELECT name FROM tempdb.sys.columns
WHERE [object_id] = OBJECT_ID('tempdb..##tempbcp')
ORDER BY column_id;

Wednesday, May 4, 2016

Getting a stored procedure definition in MS SQL Server

The UI in SQL Server Management Studio is so sluggish and disorganized, it usually ends up being faster and easier to directly query the system views to get schema information instead of navigating to the appropriate objects using SSMS. You can do this by preparing your schema queries as needed and saving them in a text file, then cutting and pasting as needed, changing the query parameters to inspect the schema objects that you're interested in as you work. Or you might even develop a simple GUI app to get this information, which would be very handy to use while working with a complicated database.

There's a lot of stuff you might want to query in the system views. Just to get started, let's look at how to print out the SQL definition of a stored procedure. T-SQL's PRINT statement will truncate a string at about 3K characters, so we'll need to PRINT substrings in a loop to make sure that the text for long stored procedures is not being cut off. We also need to make sure that our substrings begin and end at line boundaries; otherwise the output will get messed up with additional newlines sprinkled throughout the output.

DECLARE @proc sysname = 'usp_myproc';
DECLARE @i int = 1, @j int, @eol nchar(2) = CHAR(13) + CHAR(10),
    @minlen int = 99, @def nvarchar(max) =
        (SELECT [definition] FROM sys.sql_modules
        WHERE [object_id] = OBJECT_ID(@proc));
WHILE @i < LEN(@def) BEGIN
    SET @j = ISNULL(NULLIF(CHARINDEX(@eol, @def, @i + @minlen),
                           0),
                    LEN(@def)) - @i;
    PRINT SUBSTRING(@def, @i, @j);
    SET @i = @i + @j + LEN(@eol);
END;

Tuesday, May 3, 2016

Iterative development using the Python interpreter

Because Python is an interpreted language, a productive coding style is to write relatively self-contained functions while repeatedly testing them during development using the interactive interpreter. This is a form of unit testing, and you could almost call this Test-driven development if you copy your Python session into your docstrings using the doctest module. (But you'd be writing the tests after writing the code, which wouldn't be considered TDD by strict adherants.)

This following method might seem obvious to people with a lot of experience in Python, but I've never seen it written down anywhere. An easy way to establish this workflow is to use the reload function to reload your Python script as a module as you make changes in your text editor.

First, you start up a Python session and import your Python script as a module using the import statement. In this example, my script is called test.py.

$ python
Python 2.7.3 (default, Mar 18 2014, 05:13:23)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>import test

Now you can make changes to test.py in your text editor and test them in your Python session.

test.py
def example(*args): return map(lambda x: x + 1, args)

>>>reload(test).example(1, 2, 3)
[2, 3, 4]

After calling your function the first time, you can re-test with the same arguments by pressing ↑ to pull the reload statement from the command history in the interpreter.

test.py
def example(*args): return map(lambda x: x + 2, args)

>>>reload(test).example(1, 2, 3)
[3, 4, 5]

Wednesday, October 28, 2015

Concatenating in T-SQL without abusing XML

Sometimes when you're writing an aggregate query in T-SQL you want to concatenate strings in a group together, instead of applying an operation like min, max, or count. The common suggestion is to use FOR XML PATH in a magical way that concatenates for a reason that no one understands (unless you actually take the time to learn the T-SQL XML functionality, which I highly recommend).

I'm going to demonstrate an alternate approach that involves the use of a PIVOT query. In order to use this method, you have to define the maximum number of values that will be combined. This may or may not be a drawback. (In my experience, it's actually a useful sanity check.) If you do it this way, you have a lot more flexibility in join delimiters and the syntax is more straightforward.

-- Prepare example tables. @result is an ordinary data table, and @msg is a
-- related table that stores comments for the records in @result.
DECLARE @result TABLE
    (record_id int IDENTITY(1, 1) PRIMARY KEY,
    col1 int, col2 int);
 
DECLARE @msg TABLE
    (msg_id int IDENTITY(1, 1) PRIMARY KEY,
    record_id int NOT NULL,
    msg nvarchar(300) NOT NULL);
 
INSERT INTO @result (col1, col2)
VALUES (1, 2), (2, 4), (3, 6), (4, 8);
 
INSERT INTO @msg (record_id, msg)
VALUES
    (2, N'first message'),
    (2, N'second message'),
    (4, N'another message');
 
-- Use pivoting to concatenate messages for the records together into a
-- single column. A CTE is used to create a temporary result set that joins
-- @result and @msg and includes a column that indicates the message
-- sequence. A pivot query flattens the result set by concatenating the
-- messages together.
WITH q AS
    (SELECT [@result].record_id, [@result].col1, [@result].col2,
        [@msg].msg,
        'Msg' + CONVERT(varchar, ROW_NUMBER() OVER
            (PARTITION BY [@msg].record_id
            ORDER BY [@msg].msg_id)) AS rn
    FROM @result
    LEFT JOIN @msg
    ON [@result].record_id = [@msg].record_id)
SELECT record_id, col1, col2,
    COALESCE(Msg1, '') +
    COALESCE(', ' + Msg2, '') +
    COALESCE(', ' + Msg3, '') AS msgs
FROM q
PIVOT (MAX(msg) FOR rn IN (Msg1, Msg2, Msg3)) p;
 
-- Results:
-- +-----------+------+------+-------------------------------+
-- | record_id | col1 | col2 | msgs                          |
-- +-----------+------+------+-------------------------------+
-- | 1         | 1    | 2    |                               |
-- | 2         | 2    | 4    | first message, second message |
-- | 3         | 3    | 6    |                               |
-- | 4         | 4    | 8    | another message               |
-- +---------------------------------------------------------+

Wednesday, August 19, 2015

Merging data into an MS Access table

One of my favorite commands in T-SQL is the MERGE statement, which allows you to merge a table expression into a table, INSERTing records if they don't exist, or UPDATEing records when they do in a single command. Unfortunately, it's a little verbose, since you still have to write the full INSERT and UPDATE statements that you would be using instead of the MERGE statement.

If you're using MS Access, you don't have MERGE. If you want to merge data into an MS Access table, you need to write the INSERT and UPDATE statements separately, and you'll have to qualify your INSERT statement in some way to prevent INSERTing keys that already exist. There are two alternatives that have some big drawbacks:

You could run a DELETE beforehand to get rid of the keys that you're inserting; then you would just have to write the INSERT statement. This will cause problems if you have defined relationships on the table you're merging into: the DELETE will either fail or cascade in an inappropriate way.
You could INSERT just the keys that are missing and then run the UPDATE statement to fill in the rest of the columns. This requires you to define all the non-key columns as nullable. Nullable columns cause all kinds of issues, so I prefer to avoid them unless NULLs are truly required.

I don't think there's a pure SQL solution to this problem, but there is a solution if you use ADO to merge the records. ADO Recordsets support a Seek operation, which will efficiently locate a record when you provide a key. Merging can be achieved by using Seek to search for each key you are merging, and if the key is not found, adding a new record.

I've written up an implementation of this technique in Python. Before you can use Seek, you must specify the unique index name you are using in the Index property, and you must use a server-side cursor. By default, MS Access will name a table's primary key "PrimaryKey", so this code assumes that the key you are using has this name.

import logging, os.path
from win32com.client.gencache import EnsureDispatch
from win32com.client import constants as c

log = logging.getLogger(__name__)

class AccessDb(object):
    def __init__(self, dbpath):
        self.dbpath = os.path.abspath(dbpath)

    def __enter__(self):
        self.db = EnsureDispatch('ADODB.Connection')
        self.db.Open('Provider=Microsoft.ACE.OLEDB.12.0;'
                     'Data source=%s' % self.dbpath)
        return self

    def __exit__(self, etyp, einst, etb):
        self.db.Close()

    def bulkload(self, table, data, key=None):
        """Bulk merge data into a table.

        Args:
            table (str): Name of table where data is to be loaded.
            data: Data rows to be merged in (sequence of sequences).
            key (int sequence): List of indexes of key columns. If not
                provided, all rows will be appended.
        """
        rs = EnsureDispatch('ADODB.Recordset')
        if key:
            rs.Index = 'PrimaryKey'
            rs.CursorLocation = c.adUseServer

        rs.Open(table, self.db, c.adOpenKeyset, c.adLockOptimistic,
                c.adCmdTableDirect)
        try:
            for record in data:
                if key:
                    rs.Seek([record[i] for i in key], c.adSeekFirstEQ)
                    if rs.EOF:
                        rs.AddNew()
                else:
                    rs.AddNew()

                for index, field in enumerate(record):
                    rs.Fields(index).Value = field

                rs.Update()

        finally:
            if rs.EditMode == c.adEditAdd:
                log.info(record)
                rs.CancelUpdate()

            rs.Close()

Example usage:

# TestTable is defined as:
# CREATE TABLE TestTable
#     (id int PRIMARY KEY,
#      c1 VARCHAR(5),
#      c2 VARCHAR(5));
with AccessDb('test.accdb') as db:
    db.bulkload('TestTable',
                [(1, 'a', 'b'),
                 (2, 'c', 'd'),
                 (3, 'e', 'f')], key=[0])