Lately, I’ve been spending a lot of time working with SQL and have encountered a few pitfalls. Here are some insights on how to handle large-scale data updates in production.
Background
In our current system, performing complex UPDATE operations on tables with hundreds of thousands of rows can lead to performance issues or even system crashes. To mitigate this, we adopted a strategy of breaking the updates into small, manageable chunks.
Scripted Approach
The following Node.js script demonstrates how we generate batched SQL statements from a list of primary keys.
/**
* @description
*
* 1. Query the primary keys of the target data and export them to a CSV file.
* 2. Define a single-transaction UPDATE SQL template.
* 3. Run this script to generate the batched SQL file.
* 4. Execute the resulting SQL file on the target server.
*
* Note: Adjust CHUNK_SIZE to control the number of records updated per transaction.
*/
const csv = require('csv-parser');
const fs = require('fs');
const ids = [];
const CHUNK_SIZE = 10; // Number of records per batch
// Input file containing IDs
const INPUT_FILE = 'quote_ids.csv';
// Target output SQL file
const OUTPUT_FILE = 'repair_data.sql';
const createSQLTemplate = (idArr) => `
UPDATE `a-prod`.quote a
LEFT JOIN `b-prod`.geographic_unit b
ON a.sales_org = b.sales_org
AND a.sales_office = b.sales_office
AND a.business_unit = b.business_unit
AND a.country = b.country_iso_code
SET a.geographic_unit_id = b.id
WHERE a.id IN ('${idArr.join('','')}');\n\n\n`;
fs.createReadStream(INPUT_FILE)
.pipe(csv())
.on('data', (row) => {
ids.push(row.id);
})
.on('end', () => {
const chunkIds = ids.reduce((res, item, index) => {
let group = Math.floor(index / CHUNK_SIZE);
if (res[group] === undefined) {
res[group] = [];
}
res[group].push(item);
return res;
}, []);
fs.writeFile(OUTPUT_FILE, chunkIds.map(item => createSQLTemplate(item)).join(''), () => {
console.log('SQL generation complete.');
});
});
Using this method prevents long-running transactions that could lock tables or exhaust database resources.
Reflections
- Data Volume vs. Performance: Even with only 200,000 rows, queries can become sluggish. This is often due to large fields (like
TEXT) being included in the result set unnecessarily. Always select only the columns you need; this is why usingSELECT *is discouraged. Complexity also increases significantly when joining multiple large tables. - Strict Filtering: For any
UPDATEoperation, aWHEREclause is mandatory. You must be extremely certain of your logic before performing an update without a precise filter. - Root Cause Analysis: If you find yourself needing to perform massive data corrections in production, it usually signals a bug in the application logic. Furthermore, if the update process is overly complicated, it might indicate that the database schema itself needs a redesign.

