close
close

merge csv files python

2 min read 02-10-2024
merge csv files python

Combining multiple CSV files into one can be a common task for data analysts, developers, or anyone working with data. Whether you're dealing with sales reports, survey responses, or large datasets, merging these files efficiently can save you time and effort. In this article, we’ll provide a clear and easy-to-understand guide on how to merge CSV files in Python.

Problem Scenario

Let's assume you have several CSV files that contain sales data from different regions for a particular year, and you want to consolidate this information into a single file for analysis. Below is an example code snippet that demonstrates how to achieve this:

import pandas as pd
import glob

# Path to your CSV files
path = 'path/to/csv/files/'
all_files = glob.glob(path + "*.csv")

# List to hold the data
dataframes = []

# Loop through the list of files and read them into DataFrames
for filename in all_files:
    df = pd.read_csv(filename)
    dataframes.append(df)

# Concatenate all DataFrames into a single DataFrame
merged_df = pd.concat(dataframes, ignore_index=True)

# Save the merged DataFrame to a new CSV file
merged_df.to_csv('merged_output.csv', index=False)

Step-by-Step Breakdown

  1. Import Necessary Libraries: We use pandas for handling data and glob for file path operations.

  2. Define the Path: Set the path where your CSV files are stored. Replace 'path/to/csv/files/' with the actual path.

  3. Get All CSV Files: The glob function retrieves all CSV files in the specified directory.

  4. Read and Store DataFrames: We create an empty list dataframes to hold the data read from each file. A loop iterates through all CSV files, reads them into a DataFrame, and appends it to the list.

  5. Merge DataFrames: Using pd.concat(), we combine all DataFrames into one while ignoring the index to maintain continuity.

  6. Output the Merged File: Finally, we save the merged DataFrame into a new CSV file named merged_output.csv using the to_csv() method.

Practical Example

Imagine that you have three CSV files containing the following data:

  • sales_region1.csv

    Product,Sales
    A,100
    B,150
    
  • sales_region2.csv

    Product,Sales
    A,200
    C,300
    
  • sales_region3.csv

    Product,Sales
    B,250
    C,100
    

After running the merging script provided above, the content of merged_output.csv will be:

Product,Sales
A,100
B,150
A,200
C,300
B,250
C,100

Additional Insights

  • Handling Duplicate Data: After merging, you might want to handle duplicates or aggregate data. You can use merged_df.drop_duplicates() to remove duplicates or apply group by functions to summarize data.

  • Optimizing Performance: For larger datasets, consider using dask for efficient handling and processing of big data.

  • Working with Headers: Ensure that all your CSV files have the same header structure. If they differ, you might need to align the headers before merging.

Useful Resources

By understanding how to merge CSV files using Python, you can streamline your data processing workflow, making it more efficient and manageable. With the help of the provided code and explanations, you can easily adapt it to your specific needs. Happy coding!