Importing Data with MySQL

A Tutorial by Russell J.T. Dyer

When a MySQL developer first creates a MySQL database for a client, often times the client has already accumulated data in other, simpler applications. Being able to convert data easily to MySQL is critical. In the previous two articles of this MySQL series, we explored how to set up a database and how to query one. In this third installment, I will introduce some methods and tools for bulk importing of data into MySQL. This isn't an overly difficult task, but the processing of large amounts of data can be intimidating for a newcomer and and as a result it can be a barrier to getting started with MySQL. Additionally, for intermediate developers, there are many nuances to consider for a clean import, which is especially important for automating regularly scheduled imports. There are also restraints to deal with that may be imposed on a developer when using a web hosting company.

Foreign Data Basics

Clients sometimes give developers raw data in formats created by programs like MS Access or FoxPro. Since non-technical clients don't typically understand database concepts, new clients often give me their initial data in Excel spreadsheets. Let's first look at a simple method for importing data. The simplest way to deal with incompatible data in any format is to load it up in its original software and to export it out to a delimited text file. Most applications have the ability to export data to a text format and will allow the user to set the delimiters. I like to use the bar (i.e., |, a.k.a. pipe) to separate fields and the line-feed to separate records.

For the examples in this article, I will assume that a fictitious client's data was in Excel and that the exported text file will be named prospects.txt. It contains contact information about prospective customers for the client's sales department, located on the client's intranet site. The data is to be imported into a MySQL table called prospect_contact, in a database called sales_dept. To make the process simpler, the order and number of columns in Excel (the format of the data provided by the client) should be the same as the table into which the data is going to be imported. So if prospect_contact has columns that are not included in the spreadsheet, one would make a copy of the spreadsheet and add the missing columns and leave them blank. If there are columns in the spreadsheet that aren't in prospect_contact, one would either add them to the MySQL table, or, if they're not to be imported, one would delete the extra columns from the spreadsheet. One should also delete any headings and footnotes from the spreadsheet. After this is completed then the data can be exported. Since this is Unix Review, I'll skip how one would export data in Excel and assume that the task was accomplished easily enough using its export wizard.

The next step is to upload the data text file to the client's web site by FTP. It should be uploaded in ASCII mode. Binary mode may send binary hard-returns for row-endings. Also, it's a good security habit to upload data files to non-public directories. Many web hosting companies provide virtual domains with a directory like /public_html, which is the document root for the Apache web server; it typically contains the site's web pages. In such a situation, / is a virtual root containing logs and other files that are inaccessible to the public. I usually create a directory called tmp in the virtual root directory to hold data files temporarily for importing into MySQL. Once that's done, all that's required is to log into MySQL with the mysql client as an administrative user (if not root, then a user with FILE privileges), and run the proper SQL statement to import the data.

Loading Data Basics

The LOAD DATA INFILE statement is the easiest way to import data from a plain text file into MySQL. Below is what one would enter in the mysql client to load the data in the file called prospects.txt into the table prospect_contact:

LOAD DATA INFILE '/tmp/prospects.txt'
INTO TABLE prospect_contact
FIELDS TERMINATED BY '|';

Before entering the statement above, the MySQL session would, of course, be switched to the sales_dept database with a USE statement. It is possible, though, to specify the database along with the table name (e.g., sales_dept.prospect_contact). If the server is running Windows, the forward slashes are still used for the text file's path, but a drive may need to be specified at the beginning of the path: 'c:/tmp/prospects.txt'. Notice that the SQL statement above has | as the field delimiter. If the delimiter was [TAB]--which is common--then one would replace | with \t here. A line-feed (\n) isn't specified as the record delimiter since it's assumed. If the rows start and end with something else, though, then they will need to be stated. For instance, suppose the rows in the text file start with a double-quote and end with a double-quote and a Windows hard-return (i.e., a return and a line-feed). The statement would need to read like this:

LOAD DATA INFILE '/tmp/prospects.txt'
INTO TABLE prospect_contact
FIELDS TERMINATED BY '|'
LINES STARTING BY '"'
TERMINATED BY '"\r\n';

Notice that the starting double-quote is inside of single-quotes. If one needs to specify a single-quote as the start of a line, one could either put the one single-quote within double-quotes or one could escape the inner single-quote with a back-slash, thus telling MySQL that the single-quote that follows is to be taken literally and is not part of the statement, per se:

...
LINES STARTING BY '\'' 
...

Duplicate Rows

If the table prospect_contact already contains some of the records that are about to be imported from prospects.txt (that is to say, records with the same primary key), then a decision should be made as to what MySQL is to do about the duplicates. The SQL statement, as it stands above, will cause MySQL to try to import the duplicate records and to create duplicate rows in prospect_contact for them. If the table's properties are set not to allow duplicates, then MySQL will kick out errors. To get MySQL to replace the duplicate existing rows with the ones being imported in, one would add the REPLACE flag just before the INTO TABLE clause like this:

LOAD DATA INFILE '/tmp/prospects.txt'
REPLACE INTO TABLE prospect_contact
FIELDS TERMINATED BY '|'
LINES STARTING BY '"'
TERMINATED BY '"\n';

To import only records for prospects that are not already in prospect_contact, one would substitute REPLACE with the IGNORE flag. This instructs MySQL to ignore records read from the text file that already exist in the table.

Live Data

For importing data into a table while it's in use, table access needs to be addressed. If access to the table by other users may not be interrupted, then a LOW_PRIORITY flag can be added to the LOAD DATA INFILE statement. This tells MySQL that the loading of this data is a low priority. One would only need to change the first line of the SQL statement above to set its priority to low:

LOAD DATA LOW_PRIORITY INFILE '/tmp/prospects.txt'
...

If the LOW_PRIORITY flag isn't included, the table will be locked temporarily during the import and other users will be prevented from accessing it.

Being Difficult

I mentioned earlier that uploading of the text file should not be done in binary mode so as to avoid the difficulties associated with Windows line endings. If this is unavoidable, however, there is an easy way to import binary row-endings with MySQL. One would just specify the appropriate hexadecimals for a carriage-return combined with a line-feed (i.e., CRLF) as the value of TERMINATED BY:

...
TERMINATED BY 0x0d0a;

Notice that there are intentionally no quotes around the binary value. If there were, MySQL would take the value for text and not a binary code. The semi-colon is not part of the value; it's the SQL statement terminator.

Earlier I also stated that the first row in the spreadsheet containing the column headings should be deleted before exporting to avoid the difficulty of importing the headings as a record. It's actually pretty easy to tell MySQL to just skip the top line. One would add the following line to the very end of the LOAD DATA INFILE statement:

...
IGNORE 1 LINES;

The number of lines for MySQL to ignore can, of course, be more than one.

Another difficulty arises when some Windows application wizards export data with each field surrounded by double-quotes, as well as around the start and end of records. This can be a problem when a field contains a double-quote. To deal with this, some applications use back-slash (\) to escape embedded double-quotes, to indicate that a particular double-quote is not a field ending but part of the field's content. However, some applications will use a different character (like a pound-sign) to escape embedded quotes. This can cause problems if MySQL isn't prepared for the odd escape-character. MySQL will think the escape character is actually text and the embedded quote-mark, although it's escaped, is a field ending. The unenclosed text that follows will be imported into the next column and the remaining columns will be one column off, leaving the last column not imported. As maddening as this can be, it's quite manageable in MySQL by adding an ENCLOSED BY and an ESCAPED BY clause:

LOAD DATA LOW_PRIORITY INFILE '/tmp/prospects.txt'
REPLACE INTO TABLE prospect_contact
FIELDS TERMINATED BY '"'
ENCLOSED BY '"' ESCAPED BY '#'
LINES STARTING BY '"'
TERMINATED BY '"\n'
IGNORE 1 LINES;

In the Foreign Data Basics section above, I said that the columns in the spreadsheet should be put in the same order and quantity as the receiving table. This really isn't necessary if MySQL is cued in as to what it should expect. To illustrate, let's assume that prospect_contact has four columns in the following order: row_id, name_first, name_last, telephone. Whereas, the spreadsheet has only three columns, differently named, in this order: Last Name, First Name, Telephone. If the spreadsheet isn't adjusted, then the SQL statement will need to be changed to tell MySQL the field order:

LOAD DATA LOW_PRIORITY INFILE '/tmp/prospects.txt'
REPLACE INTO TABLE sales_dept.prospect_contact
FIELDS TERMINATED BY 0x09
ENCLOSED BY '"' ESCAPED BY '#'
TERMINATED BY 0x0d0a
IGNORE 1 LINES
(name_last, name_first, telephone);

This SQL statement tells MySQL the name of each table column associated with each spreadsheet column in the order that they appear in the text file. From there it will naturally insert the data into the appropriate columns in the table. As for columns that are missing like row_id, MySQL will fill in those fields with the default value if one has been supplied in the table's properties. If not, it will leave the field as NULL. Incidentally, I slipped in the binary [TAB] (0x09) as a field delimiter.

mysqlimport

For some clients and for certain situations it may be of value to be able to import data into MySQL without using the mysql client. This could be necessary when constructing a shell script to import text files on an automated, regular schedule. To accomplish this, the mysqlimport utility may be used as it encompasses the LOAD DATA INFILE statement and can easily be run from a script. So if one wants to enter the involved SQL statement at the end of the last section above, the following could be entered from the command-line (i.e., not in the mysql client):

mysqlimport --user='marie_dyer' --password='angelle1207' \
--fields-terminated-by=0x09 --lines-terminated-by=0x0d0a \
--replace --low-priority --fields-enclosed-by='"' \
 --fields-escaped-by='#' --ignore-lines='1' --verbose \
--columns='name_last, name_first, telephone' \
sales_dept '/tmp/prospect_contact.txt'

Although this statement is written over several lines here, it either has to be on the same line when entered or a space followed by a back-slash has to be entered at the end of each line (as seen here) to indicate that more follows. Since the above is entered at the command-line prompt, the user isn't logged into MySQL. Therefore the first line contains the user name and password for mysqlimport to pass to MySQL. The password itself is optional, but the directive --password (without the equal sign) isn't. If the password value is not given in the statement, then the user will be prompted for it. Notice that the order of directives doesn't matter after the initial command, except that the database and file name go last. Regarding the file name, its prefix must be the same as the table--the dot and the extension are ignored. This requires that prospects.txt be renamed to prospect_contact.txt. If the file isn't renamed, then MySQL would create a new table called prospects and the --replace flag would be pointless. After the file name, incidentally, one could list more text files, separated by a space, to process using the mysqlimport statement. I've added the --verbose directive so as to be able to see what's going on. One probably would leave this out in an automated script. By the way, --low-priority and --ignore-lines are available starting in version 4.0 of MySQL. So if you're using an earlier version, you'll have to use my simpler suggestion of earlier and delete the headings after all.

Web Hosting Restraints

Some web hosting companies do not allow the use of LOAD DATA INFILE or mysqlimport statements due to security vulnerabilities in these statements for them. To get around this, some extra steps are necessary to avoid having to manually enter the data one row at a time. First, one needs to have MySQL installed on one's local workstation. For simplicity I'll assume this is done and is running Linux on the main partition and MS Windows 98 on an extra partition. Recapping the on-going example of this article based on these new circumstances, one would boot up into Windows and start Excel, load the client's spreadsheet into it and then run the export wizard as before--saving the file prospects.txt to the 'My Documents' directory. Then one would reboot into Linux and mount the Windows 98 partition and copy the data text file to /tmp in Linux, locally. Next one would log into the local (not the client's) MySQL server and import the text file using a LOAD DATA INFILE as I've extensively outline above. From there one would exit MySQL and export the data out of MySQL using the mysqldump utility locally, from the command-line like this:

mysqldump --user='root' --password='geronimo' sales_dept prospect_contact > /tmp/prospects.sql

This creates an interesting text file complete with all of the SQL commands necessary to insert the data back into MySQL one record, one INSERT command at a time. When you run mysqlimport, it's very educational to open up it in a text editor to see what it generates.

After creating this table dump, one would upload the resulting file (in ASCII mode) to the /tmp directory on the client's web server. From the command prompt on the client's server one would enter the following:

mysql --user='marie_dyer' --password='angelle12107' sales_dept < '/tmp/prospects.sql'

This line along with the mysqldump line show above are simple approaches. Like the Windows application wizard, with mysqldump one can specify the format of the output file and several other factors. One important factor related to the scenario used in this article is the CREATE TABLE statement that will be embedded in the mysqldump output file. This will fail and kick out an error because of the existing table prospect_contact in the client's database. To limit the output to only INSERT statements and no CREATE TABLE statements, the mysqldump statement would look like this:

mysqldump -u marie_dyer -p --no-create-info sales_dept prospect_contact > /tmp/prospects.sql

Notice that I've used acceptable abbreviations for the user name and the password directives. Since the password was given here, the user will be prompted for it.

The mysqldump utility usually works pretty well. However, one feature it's lacking at this time is a REPLACE flag as is found in the LOAD DATA INFILE statement and with the mysqlimport tool. So if a record already exists in the prospect_contact, it won't be imported. Instead it will kick out an error message and stop at that record, which can be a mess if one has imported several hundred rows and have several hundred more to go. One easy fix for this is to open up prospects.sql in a text editor and do a search on the word INSERT and replace it with REPLACE. The syntax of both of these statements are the same, fortunately. So one would only need to replace the keyword for new records to be inserted and for existing records to be replaced.

Concluding Observations and Admissions

It's always amazing to me how much can be involved in the simplest of statements in MySQL. MySQL is deceptively powerful and feature rich. One can keep the statements pretty minimal or one can develop a fairly detailed, single statement to allow for accuracy of action. There are many other aspects of importing data into MySQL that I did not address--in particular dealing with utilities. I also didn't talk about the Perl modules that could be used to convert data files. These can be useful in scripting imports. There are many ways in which one can handle importing data. Hopefully, this article has presented most of the basics and pertinent advanced details that may be of use to most MySQL developers.