In this example, I will use a CSV file, available on France government opendata web site, containing information about all the French Gas Stations GasStationsDatabase.
As Convertigo provides a step to read CSV files, it is very easy to create a project that reads a CSV file, and then write each record into a database.
The weak point of this method is the speed. Indeed, writing into a database takes a while and dealing with large CSV files could take "for ever".
I will show you in this article how to create an 11000 records database out of the corresponding CSV file with minimizing the bottleneck of writing.
The source code for this sample can be downloaded here: DatabaseFromCSV.
The thing to do is to make Convertigo read the CSV file. And instead of getting each line in turn and write it into the database, we will read a chunk of (250) lines and dynamically create out of it a multi-insert SQL query we will then use to fill up the base, saving so a huge amount of time, then loop until the end.
To keep things simple, we will use a MySQL syntax, but you can adapt the query to any other database. We will also assume the database table is already created regarding the labels of the CSV title line.
DROP TABLE IF EXISTS `pdv`;
CREATE TABLE `pdv` (
`id` int(11) unsigned NOT NULL,
`latlng` decimal(18,12) DEFAULT NULL,
`typeroute` varchar(1) DEFAULT NULL,
`adresse` varchar(256) DEFAULT NULL,
`commune` varchar(45) DEFAULT NULL,
`codepostal` varchar(5) DEFAULT NULL,
`hdebut` time(3) DEFAULT NULL,
`hfin` time(3) DEFAULT NULL,
`saufjour` varchar(128) DEFAULT NULL,
`services` varchar(512) DEFAULT NULL,
`carburants` varchar(256) DEFAULT NULL,
`activite` varchar(1) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
fig 1 : 'pdv' table Schema
In our CsvToSQL sequence, we have the following steps
- a javascript step to create some global variables
/*
* CSV related variables
*/
hasTitleLine = true; // this CSV file contains a title line
titleSize = 0; // golbal variable to hold the number of titles in CSV
/*
* Dynamic query related variables
*/
query = ""; // global variable to hold the final query
index = 0; // global variable to hold the current line index in the CSV
chunkSize = 250; // global variable to hold the number of lines in the chunk
- a transaction to truncate the database
- a readCSV where you should set the encoding to UTF-8, source to ".//data/pdv.csv" and titleline to false (because we will handle it ourselves)
- a jSource step sourced on the readCSV step to retrieve the first line of the CSV file (title line)
- a while loop to loop through the number of lines of the CSV file
- a javascript step "readChunkOfLines" we will study in details
var i = 0;
/**
* deal with the line of labels
* only once when reading the first line
*/
if (hasTitleLine && (index == 0)) {
// for convenience, get a variable on the node
var node = lineNodes.item(0);
// count the number of titles in the line
titleSize = node.childNodes.length;
// initialize the query header
queryHeader = "insert into pdv (";
// loop throughout the titles to build up the query header
for(var j=0; j<titleSize; j++) {
// trick to add a comma after each label
if (j>0)
queryHeader += ", ";
// add title to the query header
queryHeader += node.childNodes.item(j).getTextContent();
}
// we are done with the queryHeader,
// close it and start the values
queryHeader += ") values ";
// because we had a titleLine, we need to
// increment loop counter, and chunkSize
i++;
chunkSize++;
}
// reset the values part
values = "";
/*
* now get the CSV lines up to chunkSize or end of file
* note the index variable that keep the starting point
* index of each chunk
*/
for(; i<chunkSize && (index+i) < lineNodes.length; i++) {
// trick to add a comma after each value block
if (values.length > 0)
values += ",";
// get in turn all the value of a value block
for(var j=0; j<titleSize; j++) {
// trick to add a comma or a parenthesis
if (j == 0)
values += "(";
else
values += ",";
// add the current value to the values query string
values += """ + lineNodes.item(index+i).item(j).getTextContent() + """;
}
// close the values part of the query
values += ")";
}
// assemble the queryHeader and the value part
query = queryHeader + values;
// adjust the index, so next chunk
// will start at the correct index
index += i;
Using this method you will considerably reduce the overhead of writing large number of records into a database