Best Practices for Batch Updating Production Data

Best Practices for Batch Updating Production Data

Nov 3, 2020 · 2 min read · 423 Words · -Views -Comments

Lately, I’ve been spending a lot of time working with SQL and have encountered a few pitfalls. Here are some insights on how to handle large-scale data updates in production.

Background

In our current system, performing complex UPDATE operations on tables with hundreds of thousands of rows can lead to performance issues or even system crashes. To mitigate this, we adopted a strategy of breaking the updates into small, manageable chunks.

Scripted Approach

The following Node.js script demonstrates how we generate batched SQL statements from a list of primary keys.

/**
 * @description
 *
 * 1. Query the primary keys of the target data and export them to a CSV file.
 * 2. Define a single-transaction UPDATE SQL template.
 * 3. Run this script to generate the batched SQL file.
 * 4. Execute the resulting SQL file on the target server.
 *
 * Note: Adjust CHUNK_SIZE to control the number of records updated per transaction.
 */
const csv = require('csv-parser');
const fs = require('fs');
const ids = [];
const CHUNK_SIZE = 10; // Number of records per batch

// Input file containing IDs
const INPUT_FILE = 'quote_ids.csv';
// Target output SQL file
const OUTPUT_FILE = 'repair_data.sql';

const createSQLTemplate = (idArr) => `
UPDATE `a-prod`.quote a 
LEFT JOIN `b-prod`.geographic_unit b
    ON a.sales_org = b.sales_org 
    AND a.sales_office = b.sales_office 
    AND a.business_unit = b.business_unit 
    AND a.country = b.country_iso_code
SET a.geographic_unit_id = b.id
WHERE a.id IN ('${idArr.join('','')}');\n\n\n`;

fs.createReadStream(INPUT_FILE)
    .pipe(csv())
    .on('data', (row) => {
        ids.push(row.id);
    })
    .on('end', () => {
        const chunkIds = ids.reduce((res, item, index) => {
            let group = Math.floor(index / CHUNK_SIZE);
            if (res[group] === undefined) {
                res[group] = [];
            }
            res[group].push(item);
            return res;
        }, []);

        fs.writeFile(OUTPUT_FILE, chunkIds.map(item => createSQLTemplate(item)).join(''), () => {
            console.log('SQL generation complete.');
        });
    });

Using this method prevents long-running transactions that could lock tables or exhaust database resources.

Reflections

  • Data Volume vs. Performance: Even with only 200,000 rows, queries can become sluggish. This is often due to large fields (like TEXT) being included in the result set unnecessarily. Always select only the columns you need; this is why using SELECT * is discouraged. Complexity also increases significantly when joining multiple large tables.
  • Strict Filtering: For any UPDATE operation, a WHERE clause is mandatory. You must be extremely certain of your logic before performing an update without a precise filter.
  • Root Cause Analysis: If you find yourself needing to perform massive data corrections in production, it usually signals a bug in the application logic. Furthermore, if the update process is overly complicated, it might indicate that the database schema itself needs a redesign.
Authors
Developer, digital product enthusiast, tinkerer, sharer, open source lover