Adventures in SQL: Using ASCII to remove carriage returns in user input

A current piece of work involves migrating data from a source system, where there is a lot of user input, into a target system.  The data migration process can always be tricky when user-generated input is involved, since the process involves formalising and sanitising this fluid, informal user input into a formal structure. The informal nature of user input can make this process an interesting challenge!
One recent project involved taking data from a source system and inserting it into a well-known target database, which is a well-known timesheet system. The target system takes comma separated value (CSV) files and imports the data using a proprietary engine.  The data is then propagated to its own underlying SQL Server database.
The CSV file production strategy has involved using SSIS 2008 to extract data from the source system, and place the data into the formal structure of the CSV files. One problem area, as expected, are free-text fields, where users can type in whatever they like!
One issue for CSV file production is carriage returns in free-text fields. SSIS sees the carriage returns, and behaves as its been told to do: it starts a new line.  This can be seen in the Flat File Connection Editor below:

Carriage Returns in Flat File Editor

The {CR}{LF} are control characters which denote ‘carriage return’ and ‘line feed’, respectively. When SSIS sees these control characters in the incoming text field, it naturally believes that this means it should create a new line in the CSV file: which it then goes and undertakes this activity. However, this causes an issue for the CSV file format overall, because the next line will only have one column in it, and will not be correctly formatted for the data import to work properly.
In order to resolve this issue, it is possible to use lower-level constructs in order to strip out any erroneous carriage returns in the user input. The SQL CHAR command is a string function that converts an INT ASCII code to a specific character. Here, it is used to insert control characters that we see in the earlier image, into character strings. Here is an example, which looks for a carriage return (CHAR(13)) and line feed (CHAR(10)):

select REPLACE(cast(w.userentereddetail as nvarchar(4000)), CHAR(13) + CHAR(10), ‘  ‘) as UserComments from tbl w

This SQL Statement uses a combination of CHAR and REPLACE to find any carriage returns in the ‘userentereddetail’ column, and will replace them with a space. This means that any carriage returns in the user input field are captured and replaced so that they will not interfere with the CSV production by SSIS.
To summarise, perhaps I’m showing my age by talking about ASCII. However, these little tidbits can be useful to know!

5 thoughts on “Adventures in SQL: Using ASCII to remove carriage returns in user input

  1. Jen,
    This is helpful and I would like to offer another way of cleaning data. I often pass the entire stream through a script component within my data flow. By using a customized VB script I can pretty much strip out all control characters and anything else I don't want in my data.

    Jon

  2. Hi I found your post interesting, but it doesn't solve my problem 😦
    Do you use REPLACE function in SSIS because I'm trying this and it doesn't accept REPLACE([mycolumn], “CHAR(13)+CHAR(10)”, “;”).
    I also tried REPLACE([mycolumn], “{CR}{LF}”, “;”) or REPLACE([mycolumn], “\n”, “;”) but nothing seems to work.
    Did you used ascii code in sql ? and not in SSIS ? That would explain why it worked for you…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s