Thursday, May 19, 2016

What's the fastest way to concatenate worksheet cells into a string?

I often store string resources (e.g. SQL queries, XML documents) in Excel worksheets. This makes the strings easy to update during development and avoids hassles associated with hard-coding the strings in the VBA code, or using external text files which must be distributed with the workbook.

In order to use the strings in the code, a function is needed that concatenates worksheet cells and returns a string. To keep things simple, I'm going to assume the worksheet only contains data in the first column. The most naive way to write this function is to concatenate in a loop, like this.

Function GetText(ws As Worksheet) As String
    Dim CellData As Range
    Dim Result As String
    Dim i As Long

    Set CellData = Intersect(ws.[A:A], ws.UsedRange)
    For i = 1 To CellData.Cells.Count
        Result = Result & CellData(i) & vbCrLf
    Next

    GetText = Result
End Function

The performance of this method is unacceptable even for a relatively small number of cells, with one call taking up to several minutes to run. At each iteration of the loop, the previous string is copied into a new allocated string and then left as garbage. As the loop continues, the intermediate string gets longer, more data has to be copied, and more and more garbage is generated. This is a textbook instance of a Schlemiel the painter's algorithm.

A better method is to copy the cells into an array and use the Join function to generate the final result.

Function GetText(ws As Worksheet) As String
    Dim CellData As Range
    Dim Result() As String
    Dim i As Long

    Set CellData = Intersect(ws.[A:A], ws.UsedRange)
    ReDim Result(1 To CellData.Cells.Count)
    For i = 1 To UBound(result)
        Result(i) = CellData(i)
    Next

    GetText = Join(Result, vbCrLf)
End Function

Performance is much better, but you'll notice that even this version takes several seconds to process a few thousand lines of data. It would be better if we could just concatenate the cells directly without having to employ an intermediate array. Luckily, it's possible to do this using the WorksheetFunction.Transpose function, which is smart enough to generate the 1-dimensional array we need for Join when a single column 2-dimensonal array is provided as input.

Function GetText(ws As Worksheet) As String
    Dim CellData As Range

    Set CellData = Intersect(ws.[A:A], ws.UsedRange)
    GetText = Join(WorksheetFunction.Transpose(CellData), vbCrLf)
End Function
This function will run in just a second or two even with 10s of thousands of lines of data. It's always worth it to keep an eye out for a better method even when you think you've already optimized your solution.

Friday, May 13, 2016

Extracting a lot of data from a SQL Server database

Often when working in SQL Server Management Studio, you will end up with a complicated query that results in a huge number of records that need to be exported for processing by another tool. When working with more than a few megabytes of data, simply copying and pasting it into Excel isn't an option.

One potential solution is use the bcp utility to export the results into a tab-delimited CSV file using a temporary table. In order for the temporary table to be visible to bcp, we need to use a global temp table. This is achieved by using an INTO clause and prepending two hashes to the temp table name.

SELECT table1.col1, table2.col2, table3.col3
INTO ##tempbcp
FROM table1
INNER JOIN table2 ON table1.key1 = table2.key1
LEFT JOIN table3 ON table1.key2 = table3.key2;

After the temporary table is populated, invoke bcp from a command prompt to extract the data.

bcp ##tempbcp out tempbcp.csv -S sqlhost -T -w

The -T option uses your network credentials to log on to the SQL Server. -w specifies that we want our results in a tab-delimited, unicode format.

Using bcp in this way to move data around is usually the best solution because, most of the time, the ability to create tables in tempdb is unrestricted, so no special database permissions are needed. Also, bcp is much faster and more memory efficient than most other methods.

However, a major drawback is that bcp is unable to include header titles in the output file. In order to get the field names, you can query the system views. From there you can use a variety of methods to get the field names into the CSV file or wherever the data is ultimately being transferred.

SELECT name FROM tempdb.sys.columns
WHERE [object_id] = OBJECT_ID('tempdb..##tempbcp')
ORDER BY column_id;

Wednesday, May 4, 2016

Getting a stored procedure definition in MS SQL Server

The UI in SQL Server Management Studio is so sluggish and disorganized, it usually ends up being faster and easier to directly query the system views to get schema information instead of navigating to the appropriate objects using SSMS. You can do this by preparing your schema queries as needed and saving them in a text file, then cutting and pasting as needed, changing the query parameters to inspect the schema objects that you're interested in as you work. Or you might even develop a simple GUI app to get this information, which would be very handy to use while working with a complicated database.

There's a lot of stuff you might want to query in the system views. Just to get started, let's look at how to print out the SQL definition of a stored procedure. T-SQL's PRINT statement will truncate a string at about 3K characters, so we'll need to PRINT substrings in a loop to make sure that the text for long stored procedures is not being cut off. We also need to make sure that our substrings begin and end at line boundaries; otherwise the output will get messed up with additional newlines sprinkled throughout the output.

DECLARE @proc sysname = 'usp_myproc';
DECLARE @i int = 1, @j int, @eol nchar(2) = CHAR(13) + CHAR(10),
    @minlen int = 99, @def nvarchar(max) =
        (SELECT [definition] FROM sys.sql_modules
        WHERE [object_id] = OBJECT_ID(@proc));
WHILE @i < LEN(@def) BEGIN
    SET @j = ISNULL(NULLIF(CHARINDEX(@eol, @def, @i + @minlen),
                           0),
                    LEN(@def)) - @i;
    PRINT SUBSTRING(@def, @i, @j);
    SET @i = @i + @j + LEN(@eol);
END;

Tuesday, May 3, 2016

Iterative development using the Python interpreter

Because Python is an interpreted language, a productive coding style is to write relatively self-contained functions while repeatedly testing them during development using the interactive interpreter. This is a form of unit testing, and you could almost call this Test-driven development if you copy your Python session into your docstrings using the doctest module. (But you'd be writing the tests after writing the code, which wouldn't be considered TDD by strict adherants.)

This following method might seem obvious to people with a lot of experience in Python, but I've never seen it written down anywhere. An easy way to establish this workflow is to use the reload function to reload your Python script as a module as you make changes in your text editor.

First, you start up a Python session and import your Python script as a module using the import statement. In this example, my script is called test.py.

$ python
Python 2.7.3 (default, Mar 18 2014, 05:13:23)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>import test

Now you can make changes to test.py in your text editor and test them in your Python session.

test.py  
def example(*args):
    return map(lambda x: x + 1, args)
>>>reload(test).example(1, 2, 3)
[2, 3, 4]

After calling your function the first time, you can re-test with the same arguments by pressing to pull the reload statement from the command history in the interpreter.

test.py  
def example(*args):
    return map(lambda x: x + 2, args)
>>>reload(test).example(1, 2, 3)
[3, 4, 5]