IT Salaries Survey Cleaning

Sysarmy is a prominent technical community in Argentina that promotes collaboration and knowledge-sharing among IT professionals. The organization has several initiatives, including its biannual Salary Survey, a crowdsourcing project designed to provide transparency and insights into salaries and working conditions in the tech industry.
The Salary Survey collects anonymous data from IT workers, primarily in Argentina and other Spanish-speaking regions. Its main goal is to equip professionals with tools to compare working conditions, identify market trends, and detect inequalities, such as gender gaps or geographic disparities.
In my position as a data analysis intern at Fillex, an Argentine IT solutions firm, I have been assigned the task of cleaning, transforming, and analyzing the 2023.2 edition of the Sysarmy Salary Survey. This assignment aims to evaluate my technical skills and determine my potential role within the team.

Goals

This project aims to conduct an exploratory data analysis (EDA) of Sysarmy's 2023.2 Salary Survey and extract meaningful insights about the IT market in Argentina.

The following specific sub-objectives are outlined:

1. Data Cleaning and Transformation

Detect and correct missing values, inconsistencies, and errors in the dataset.
Standardize formats to facilitate analysis.

2. Exploratory Data Analysis (EDA)

Identify patterns and trends in salaries, roles, and technologies.
Analyze the distribution of demographic and labor data.
Assess inequalities, such as salary gaps by gender or geographic location.

3. Process Documentation

Record each stage of the analysis, from data preparation to final findings.
Create clear and visual reports to communicate the results effectively.

4. Technical Skills Evaluation

Demonstrate proficiency in data handling and analysis using learned tools and techniques.
Define a potential role within the team based on the project's outcomes.

Stakeholders

Fillex SA | Data Team
- Role: They act as mentors and supervisors of the project, reviewing each stage of the analysis to ensure the quality and accuracy of the results. They are also responsible for providing feedback to the intern.
- Interest: Obtain a well-structured and documented project that reflects the intern's technical skills and knowledge of the processes involved. This project could serve as a reference to define the intern's role within the team and organization and highlight their strengths and weaknesses.

Fillex SA | Intern Simón Zanetti
- Role: Execute the technical tasks and effectively document the analysis. Act as a bridge between the raw data and the final insights, working with the learned tools and techniques.
- Interest: Demonstrate skills in data cleaning, exploratory analysis, and visualization while gaining practical experience in a realistic project. Documenting the process is also crucial to measuring professional growth and establishing a potential role within the team.

IT Community
- Role: Indirect beneficiaries of the analysis, as the insights and findings can add extra value to the open results of the Salary Survey.
- Interest: Access new insights that can help IT professionals make informed decisions about their careers, salaries, and working conditions. This also highlights the importance of the survey as a transparency tool in the industry.

Data

Data Specifications

Dataset Dictionary

Tools

We’ll be handling the analysis and data cleanup for Sysarmy’s 2023.2 Salary Survey using a mix of Python tools and libraries. All tasks will be performed in an interactive Jupyter Notebook environment, which allows us to document the analysis process in an organized and reproducible way.

Below are the main tools and libraries used:

re: Used to apply cleaning patterns through regular expressions, making it easier to work with text data and tricky formats.
warnings: Employed to filter out unnecessary warning messages, especially those generated by visualizations in matplotlib, ensuring a cleaner working environment.
pandas: The core library for data manipulation. It allows for efficient data transformation, cleaning, and preliminary analysis.
seaborn: A key tool for data visualization. It helps identify data distribution and patterns, guiding decisions during cleaning.
matplotlib: Complements seaborn by offering greater control and customization over graphical visualizations.
scipy: Essential for performing statistical analysis during data cleaning, ensuring that adjustments are based on solid foundations.
display: Enhances the presentation of datasets in the notebook, allowing them to visualize the changes made in the analysis.

These tools ensure an efficient and reproducible workflow, playing a crucial role in the analysis and cleaning process, allowing the project to be approached with technical rigor and resulting in reliable, clearly visualized outcomes.

Methodology

The project follows a structured methodology, divided into two main stages: data cleaning and exploratory data analysis (EDA). These stages are organized into separate working files to ensure traceability and modularity.

Data Cleaning
Cleaning is essential to guarantee the quality and reliability of the analysis results. The main steps include:

Renaming Columns:
The dataset columns will be renamed using a clear, descriptive, and consistent naming convention, reflecting the content and purpose of each one.
Reordering Columns:
Columns will be reorganized to prioritize the most relevant information, facilitating the understanding and analysis of the dataset.
Removing Unnecessary Columns:
Identification and removal of columns that do not contribute to the analysis, such as those with redundant or irrelevant information.
Handling Missing Values (NaNs):
The missing values will be handled according to the context. Some were modified in reasonable cases, and others were removed.
Reviewing and Removing Duplicates:
Verification of duplicate entries and their removal to preserve the dataset's integrity.
Normalizing Values:
Data is standardized and transformed as needed to ensure consistency. For example, salaries are converted to a common currency, and date and time formats are adjusted.

Exploratory Data Analysis (EDA)
In this stage, the clean dataset generated in the previous phase will be used to extract meaningful insights that provide a better understanding of the IT market dynamics reflected in the survey.

This analysis will be conducted in a separate notebook, organized to facilitate the interpretation and documentation of results.

The main EDA approaches will include:

Distribution of Key Variables:
Variables such as salaries, years of experience, age, and educational level will be analyzed to understand their distribution and range.
Graphs such as histograms, box plots, and violin plots will be used to identify possible outliers and general patterns.
Relationship Between Demographic and Labor Variables:
Exploration of the relationship between personal characteristics (gender, age, location) and work variables (role, contract type, salary).
Visualizations like scatter plots, heat maps, and stacked bar charts will be used to detect correlations and significant differences.
Identifying Relevant Trends and Patterns:
Analysis of how variables change based on different factors, such as salary evolution according to years of experience or the most in-demand roles in the sector.
Identification of the most common technologies in specific salary ranges or roles.
Evaluating Inequalities:
Study of possible salary gaps by gender and region.
Analysis of disparities in working conditions, such as access to benefits depending on location or company type.
Visual representation of these inequalities through comparative graphs.

The decision to divide the project into two notebooks — cleaning and analysis — stems from the need for modularity, which not only simplifies project maintenance but also enables the reuse of clean data for future analyses. Additionally, this structure allows for clearer documentation of each stage, which is key for both evaluating the intern's performance and ensuring the process's reproducibility.

Results

Summary

The analysis of the Sysarmy 2023.2 Salary Survey provided several important insights into the IT market in Argentina:

Cerca del 50% de los encuestados viven en la Ciudad Autonoma de Buenos Aires y otro 20% lo hace en la provincia de Buenos Aires, por lo que 7 de cada 10 encuestados se concentran en esa region.
La edad de los encuestados se ubica entre los 18 y los 55 años, concentrando a la mayoria dentro del rango de los 28 a los 39 años, con una media de 33 años.
75% de los encuestados son hombres y 18% son mujeres. Solo un 2% de los encuestados no se sienten representados con
La mayoria de los encuestados estan conformes con su lugar de trabajo

Salary Insights

Tools Insights

Studies Insights

75% de los encuestados son hombres, lo que coincide con las estadisticas del sector, compuesto mayoritariamente por hombres, aunque año a año la inclusion de mujeres a los puestos IT se ve fortalecida. Se decidio incluir a las minorias sexuales que no se sentian identificadas con los generos bajo la etiqueta 'Otros'.

Tools

Specification	Details
Shape	5805 rows x 43 columns
Format	CSV
Size	3 MB
Codification	UTF-8
Data Range	01/07/23 - 31/12/2023
Geographical Location	Argentina
Licence	Public Domain
Source	https://github.com/simonzanetti/2023.2-SysArmy-IT-Salaries-Survey/blob/main/dataset.csv

Column Name	Type	Example
Estoy trabajando en	object	Argentina
Dónde estás trabajando	object	Chaco
Dedicación	object	Full-Time
Tipo de contrato	object	Contractor
Último salario mensual o retiro BRUTO (en tu moneda local)	float64	345000
Último salario mensual o retiro NETO (en tu moneda local)	float64	330000
Pagos en dólares	object	Cobro todo el salario en dólares
Si tu sueldo está dolarizado ?Cuál fue el último valor del dólar que tomaron?	object	490
Recibís algún tipo de bono	object	No
A que está atado el bono	object	No recibo bono
Tuviste actualizaciones de tus ingresos laborales durante 2023?	object	No
De que % fue el ajuste total acumulado?	float64	0
En que mes fue el último ajuste?	object	No tuve
Cómo considerás que están tus ingresos laborales comparados con el semestre anterior	int64	3
Contás con beneficios adicionales?	object	Capacitaciones y/o cursos, Clases de idiomas, ...
Qué tan conforme estás con tus ingresos laborales?	int64	3
Trabajo de	object	Developer
Años de experiencia	float64	1
Antiguedad en la empresa actual	float64	2
Tiempo en el puesto actual	float64	1
Cuántas personas a cargo tenés?	int64	0
Plataformas que utilizas en tu puesto actual	object	Docker, Linux
Lenguajes de programación o tecnologías que utilices en tu puesto actual	object	PHP
Frameworks, herramientas y librerías que utilices en tu puesto actual	object	Laravel
Bases de datos	object	MySQL
QA / Testing	object	PHPUnit
Cantidad de personas en tu organización	object	De 11 a 50 personas
Modalidad de trabajo	object	100% remoto
Si trabajós bajo un esquema híbrido Cuántos días a la semana vas a la oficina?	int64	0
La recomendás como un buen lugar para trabajar?	int64	9
Qué tanto estás usando Copilot, ChatGPT u otras herramientas de IA para tu trabajo?	int64	4
Salir o seguir contestando?	object	Responder sobre mis estudios
Máximo nivel de estudios	object	Secundario
Estado	object	Completo
Carrera	object	Ingeniería en Sistemas de Información
Institución educativa	object	UTN - Universidad Tecnológica Nacional
Salir o seguir contestando sobre las guardias?	object	Responder sobre guardias
Tenés guardias?	object	No
Cuánto cobrás por guardia	float64	0
Aclará el número que ingresaste en el campo anterior	object	Porcentaje de mi sueldo bruto
Salir o seguir contestando sobre estudios?	object	Terminar encuesta
Tengo (edad)	int64	25
Me identifico (género)	object	Varón Cis