Creating a database from a large CSV file.

In this article, Jean-Michel explains how to create a database from a large CSV file.

In this example, I will use a CSV file, available on France government opendata web site, containing information about all the French Gas Stations GasStationsDatabase.

As Convertigo provides a step to read CSV files, it is very easy to create a project that reads a CSV file, and then write each record into a database.
The weak point of this method is the speed. Indeed, writing into a database takes a while and dealing with large CSV files could take "for ever".
I will show you in this article how to create an 11000 records database out of the corresponding CSV file with minimizing the bottleneck of writing.

The source code for this sample can be downloaded here: DatabaseFromCSV.

Bulk reading and writing

The thing to do is to make Convertigo read the CSV file. And instead of getting each line in turn and write it into the database, we will read a chunk of (250) lines and dynamically create out of it a multi-insert SQL query we will then use to fill up the base, saving so a huge amount of time, then loop until the end.

To keep things simple, we will use a MySQL syntax, but you can adapt the query to any other database. We will also assume the database table is already created regarding the labels of the CSV title line.

DROP TABLE IF EXISTS `pdv`;
CREATE TABLE `pdv` (
 `id` int(11) unsigned NOT NULL,
 `latlng` decimal(18,12) DEFAULT NULL,
 `typeroute` varchar(1) DEFAULT NULL,
 `adresse` varchar(256) DEFAULT NULL,
 `commune` varchar(45) DEFAULT NULL,
 `codepostal` varchar(5) DEFAULT NULL,
 `hdebut` time(3) DEFAULT NULL,
 `hfin` time(3) DEFAULT NULL,
 `saufjour` varchar(128) DEFAULT NULL,
 `services` varchar(512) DEFAULT NULL,
 `carburants` varchar(256) DEFAULT NULL,
 `activite` varchar(1) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


fig 1 : 'pdv' table Schema

In our CsvToSQL sequence, we have the following steps

- a javascript step to create some global variables


/*
* CSV related variables
*/
hasTitleLine = true;    // this CSV file contains a title line
titleSize = 0;          // golbal variable to hold the number of titles in CSV

/*
* Dynamic query related variables
*/
query = "";             // global variable to hold the final query
index = 0;              // global variable to hold the current line index in the CSV
chunkSize = 250;        // global variable to hold the number of lines in the chunk


- a transaction to truncate the database

- a readCSV where you should set the encoding to UTF-8, source to ".//data/pdv.csv" and titleline to false (because we will handle it ourselves)

- a jSource step sourced on the readCSV step to retrieve the first line of the CSV file (title line)

- a while loop to loop through the number of lines of the CSV file

- a javascript step "readChunkOfLines" we will study in details

var i = 0;

/**
* deal with the line of labels
* only once when reading the first line
*/
if (hasTitleLine && (index == 0)) {
    // for convenience, get a variable on the node
    var node = lineNodes.item(0);
    // count the number of titles in the line
    titleSize = node.childNodes.length;
    // initialize the query header
    queryHeader = "insert into pdv (";

    // loop throughout the titles to build up the query header
    for(var j=0; j<titleSize; j++) {
        // trick to add a comma after each label
        if (j>0)
            queryHeader += ", ";

        // add title to the query header
        queryHeader += node.childNodes.item(j).getTextContent();
    }

    // we are done with the queryHeader,
    // close it and start the values
    queryHeader += ") values ";
    // because we had a titleLine, we need to
    // increment loop counter, and chunkSize
    i++;
    chunkSize++;
}
// reset the values part
values = "";

/*
* now get the CSV lines up to chunkSize or end of file
* note the index variable that keep the starting point
* index of each chunk
*/
for(; i<chunkSize && (index+i) < lineNodes.length; i++) {
    // trick to add a comma after each value block
    if (values.length > 0)
        values += ",";

    // get in turn all the value of a value block
    for(var j=0; j<titleSize; j++) {
        // trick to add a comma or a parenthesis
        if (j == 0)
            values += "(";
        else
            values += ",";
                        
        // add the current value to the values query string
        values += """ + lineNodes.item(index+i).item(j).getTextContent() + """;
    }

    // close the values part of the query
    values += ")";
}

// assemble the queryHeader and the value part
query = queryHeader + values;

// adjust the index, so next chunk
// will start at the correct index
index += i;


Conclusion

Using this method you will considerably reduce the overhead of writing large number of records into a database

Back to post list ...