Git and GitHub

 

Comprehensive Guide to Git and GitHub for Version Control and Collaboration

Git and GitHub are essential tools for developers to manage code efficiently and collaborate effectively on projects. This guide provides an overview of Git (local repository) and GitHub (remote repository), their key functionalities, project structuring in PySpark, and practical steps for using Git and GitHub.


Git: A Local Version Control System

  1. Version Control
    Git helps in versioning the code, making it easy to revert to a previous functioning version if needed.

  2. Collaboration
    Git allows multiple developers to work on a base code, develop features, and merge their work into the final code.


GitHub: A Remote Platform for Collaboration

GitHub is an online platform for hosting, tracking, and collaborating on projects. It acts as a central repository where project teams can store and manage their code.


Basic Project Structure in PySpark

  1. Conf Folder
    Contains configuration files for different environments like Development, Testing, and Production.

  2. Lib Folder
    Stores reusable and modular code.

  3. Logger.py
    Maintains log-related details.

  4. Main.py
    Acts as the entry point of the application, containing the execution sequence and calls to various functions.


Steps to Create a Project Locally

  1. Create Project Folder

    cd desktop
    mkdir project_name
    cd project_folder
    
  2. Add Main File

    vi main.py
    
  3. Create Logger File

    vi logger.py
    
  4. Setup Conf Folder

    mkdir conf
    cd conf
    vi spark.conf
    vi project.conf
    

Note: Use IDEs like PyCharm or Visual Studio Code for easier project structuring.


Installing Git

On Mac:

Install Git using Homebrew:

brew install git

On Windows:

Download Git from git-scm.com. Follow the installation wizard and verify the installation with:

git --version

Working with GitHub

  1. Create a Remote Repository
    Go to GitHub -> Repositories -> New.

  2. Clone Repository Locally

    mkdir mag-project
    cd mag-project
    git clone <repository-link>
    
  3. Push Local Changes to GitHub

    git init
    git add .
    git commit -m "Initial Commit"
    git remote add origin <URL>
    git push -u origin main
    

Important Git Commands

  1. Initialize Repository

    git init
    
  2. Stage Changes

    git add <filename>
    git add .
    
  3. Commit Changes

    git commit -m "Commit message"
    
  4. Push Changes

    git push origin main
    
  5. Check Status

    git status
    
  6. Branching

    • Create a branch:
      git branch <branch-name>
      
    • Switch to a branch:
      git checkout <branch-name>
      

Handling Merge Conflicts

Conflicts arise during merges when Git cannot identify changes. Resolve conflicts by manually editing files to retain necessary changes and removing unnecessary ones.


Continuous Integration and Deployment (CI/CD)

Every project has a central repository for managing source code. Platforms like GitHub, GitLab, and Bitbucket are widely used for CI/CD practices, enabling seamless development and deployment workflows.


Reverting Changes

  1. Unstaged Changes

    git restore <file-name>
    
  2. Staged Changes

    git restore --staged <file-name>
    git restore <file-name>
    
  3. Committed Changes

    git log
    git reset <commit-hash>
    

Labels 

Git, GitHub, PySpark project structure, version control, collaboration, Git installation, Git commands, GitHub repository, CI/CD, merge conflicts, branching in Git, reverting changes, logger.py, main.py, project configuration files, local repository, remote repository, IDE for Python projects.

Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post