Simon Zanetti
IT Salaries Survey Cleaning
Sysarmy is a prominent technical community in Argentina that promotes collaboration and knowledge-sharing among IT professionals. The organization has several initiatives, including its biannual Salary Survey, a crowdsourcing project designed to provide transparency and insights into salaries and working conditions in the tech industry.
The Salary Survey collects anonymous data from IT workers, primarily in Argentina and other Spanish-speaking regions. Its main goal is to equip professionals with tools to compare working conditions, identify market trends, and detect inequalities, such as gender gaps or geographic disparities.
In my position as a data analysis intern at Fillex, an Argentine IT solutions firm, I have been assigned the task of cleaning, transforming, and analyzing the 2023.2 edition of the Sysarmy Salary Survey. This assignment aims to evaluate my technical skills and determine my potential role within the team.
Goals
This project aims to conduct an exploratory data analysis (EDA) of Sysarmy's 2023.2 Salary Survey and extract meaningful insights about the IT market in Argentina.
The following specific sub-objectives are outlined:
1. Data Cleaning and Transformation
-
Detect and correct missing values, inconsistencies, and errors in the dataset.
-
Standardize formats to facilitate analysis.
2. Exploratory Data Analysis (EDA)
-
Identify patterns and trends in salaries, roles, and technologies.
-
Analyze the distribution of demographic and labor data.
-
Assess inequalities, such as salary gaps by gender or geographic location.
3. Process Documentation
-
Record each stage of the analysis, from data preparation to final findings.
-
Create clear and visual reports to communicate the results effectively.
4. Technical Skills Evaluation
-
Demonstrate proficiency in data handling and analysis using learned tools and techniques.
-
Define a potential role within the team based on the project's outcomes.
Stakeholders
-
Fillex SA | Data Team
-
Role: They act as mentors and supervisors of the project, reviewing each stage of the analysis to ensure the quality and accuracy of the results. They are also responsible for providing feedback to the intern.
-
Interest: Obtain a well-structured and documented project that reflects the intern's technical skills and knowledge of the processes involved. This project could serve as a reference to define the intern's role within the team and organization and highlight their strengths and weaknesses.
-
-
Fillex SA | Intern Simón Zanetti
-
Role: Execute the technical tasks and effectively document the analysis. Act as a bridge between the raw data and the final insights, working with the learned tools and techniques.
-
Interest: Demonstrate skills in data cleaning, exploratory analysis, and visualization while gaining practical experience in a realistic project. Documenting the process is also crucial to measuring professional growth and establishing a potential role within the team.
-
-
IT Community
-
Role: Indirect beneficiaries of the analysis, as the insights and findings can add extra value to the open results of the Salary Survey.
-
Interest: Access new insights that can help IT professionals make informed decisions about their careers, salaries, and working conditions. This also highlights the importance of the survey as a transparency tool in the industry.
-
Data
Data Specifications
Dataset Dictionary
Tools
We’ll be handling the analysis and data cleanup for Sysarmy’s 2023.2 Salary Survey using a mix of Python tools and libraries. All tasks will be performed in an interactive Jupyter Notebook environment, which allows us to document the analysis process in an organized and reproducible way.
Below are the main tools and libraries used:
-
re: Used to apply cleaning patterns through regular expressions, making it easier to work with text data and tricky formats.
-
warnings: Employed to filter out unnecessary warning messages, especially those generated by visualizations in matplotlib, ensuring a cleaner working environment.
-
pandas: The core library for data manipulation. It allows for efficient data transformation, cleaning, and preliminary analysis.
-
seaborn: A key tool for data visualization. It helps identify data distribution and patterns, guiding decisions during cleaning.
-
matplotlib: Complements seaborn by offering greater control and customization over graphical visualizations.
-
scipy: Essential for performing statistical analysis during data cleaning, ensuring that adjustments are based on solid foundations.
-
display: Enhances the presentation of datasets in the notebook, allowing them to visualize the changes made in the analysis.
These tools ensure an efficient and reproducible workflow, playing a crucial role in the analysis and cleaning process, allowing the project to be approached with technical rigor and resulting in reliable, clearly visualized outcomes.
Methodology
The project follows a structured methodology, divided into two main stages: data cleaning and exploratory data analysis (EDA). These stages are organized into separate working files to ensure traceability and modularity.
Data Cleaning
Cleaning is essential to guarantee the quality and reliability of the analysis results. The main steps include:
-
Renaming Columns:
The dataset columns will be renamed using a clear, descriptive, and consistent naming convention, reflecting the content and purpose of each one. -
Reordering Columns:
Columns will be reorganized to prioritize the most relevant information, facilitating the understanding and analysis of the dataset. -
Removing Unnecessary Columns:
Identification and removal of columns that do not contribute to the analysis, such as those with redundant or irrelevant information. -
Handling Missing Values (NaNs):
The missing values will be handled according to the context. Some were modified in reasonable cases, and others were removed. -
Reviewing and Removing Duplicates:
Verification of duplicate entries and their removal to preserve the dataset's integrity. -
Normalizing Values:
Data is standardized and transformed as needed to ensure consistency. For example, salaries are converted to a common currency, and date and time formats are adjusted.
Exploratory Data Analysis (EDA)
In this stage, the clean dataset generated in the previous phase will be used to extract meaningful insights that provide a better understanding of the IT market dynamics reflected in the survey.
This analysis will be conducted in a separate notebook, organized to facilitate the interpretation and documentation of results.
The main EDA approaches will include:
-
Distribution of Key Variables:
Variables such as salaries, years of experience, age, and educational level will be analyzed to understand their distribution and range.
Graphs such as histograms, box plots, and violin plots will be used to identify possible outliers and general patterns. -
Relationship Between Demographic and Labor Variables:
Exploration of the relationship between personal characteristics (gender, age, location) and work variables (role, contract type, salary).
Visualizations like scatter plots, heat maps, and stacked bar charts will be used to detect correlations and significant differences. -
Identifying Relevant Trends and Patterns:
Analysis of how variables change based on different factors, such as salary evolution according to years of experience or the most in-demand roles in the sector.
Identification of the most common technologies in specific salary ranges or roles. -
Evaluating Inequalities:
Study of possible salary gaps by gender and region.
Analysis of disparities in working conditions, such as access to benefits depending on location or company type.
Visual representation of these inequalities through comparative graphs.
The decision to divide the project into two notebooks — cleaning and analysis — stems from the need for modularity, which not only simplifies project maintenance but also enables the reuse of clean data for future analyses. Additionally, this structure allows for clearer documentation of each stage, which is key for both evaluating the intern's performance and ensuring the process's reproducibility.
Results
Summary
The analysis of the Sysarmy 2023.2 Salary Survey provided several important insights into the IT market in Argentina:
-
Cerca del 50% de los encuestados viven en la Ciudad Autonoma de Buenos Aires y otro 20% lo hace en la provincia de Buenos Aires, por lo que 7 de cada 10 encuestados se concentran en esa region.
-
La edad de los encuestados se ubica entre los 18 y los 55 años, concentrando a la mayoria dentro del rango de los 28 a los 39 años, con una media de 33 años.
-
75% de los encuestados son hombres y 18% son mujeres. Solo un 2% de los encuestados no se sienten representados con
-
La mayoria de los encuestados estan conformes con su lugar de trabajo
Salary Insights
Tools Insights
Studies Insights
75% de los encuestados son hombres, lo que coincide con las estadisticas del sector, compuesto mayoritariamente por hombres, aunque año a año la inclusion de mujeres a los puestos IT se ve fortalecida. Se decidio incluir a las minorias sexuales que no se sentian identificadas con los generos bajo la etiqueta 'Otros'.
Tools
Specification | Details |
---|---|
Shape | 5805 rows x 43 columns |
Format | CSV |
Size | 3 MB |
Codification | UTF-8 |
Data Range | 01/07/23 - 31/12/2023 |
Geographical Location | Argentina |
Licence | Public Domain |
Source | https://github.com/simonzanetti/2023.2-SysArmy-IT-Salaries-Survey/blob/main/dataset.csv |
Column Name | Type | Example |
---|---|---|
Estoy trabajando en | object | Argentina |
Dónde estás trabajando | object | Chaco |
Dedicación | object | Full-Time |
Tipo de contrato | object | Contractor |
Último salario mensual o retiro BRUTO (en tu moneda local) | float64 | 345000 |
Último salario mensual o retiro NETO (en tu moneda local) | float64 | 330000 |
Pagos en dólares | object | Cobro todo el salario en dólares |
Si tu sueldo está dolarizado ?Cuál fue el último valor del dólar que tomaron? | object | 490 |
Recibís algún tipo de bono | object | No |
A que está atado el bono | object | No recibo bono |
Tuviste actualizaciones de tus ingresos laborales durante 2023? | object | No |
De que % fue el ajuste total acumulado? | float64 | 0 |
En que mes fue el último ajuste? | object | No tuve |
Cómo considerás que están tus ingresos laborales comparados con el semestre anterior | int64 | 3 |
Contás con beneficios adicionales? | object | Capacitaciones y/o cursos, Clases de idiomas, ... |
Qué tan conforme estás con tus ingresos laborales? | int64 | 3 |
Trabajo de | object | Developer |
Años de experiencia | float64 | 1 |
Antiguedad en la empresa actual | float64 | 2 |
Tiempo en el puesto actual | float64 | 1 |
Cuántas personas a cargo tenés? | int64 | 0 |
Plataformas que utilizas en tu puesto actual | object | Docker, Linux |
Lenguajes de programación o tecnologías que utilices en tu puesto actual | object | PHP |
Frameworks, herramientas y librerías que utilices en tu puesto actual | object | Laravel |
Bases de datos | object | MySQL |
QA / Testing | object | PHPUnit |
Cantidad de personas en tu organización | object | De 11 a 50 personas |
Modalidad de trabajo | object | 100% remoto |
Si trabajós bajo un esquema híbrido Cuántos días a la semana vas a la oficina? | int64 | 0 |
La recomendás como un buen lugar para trabajar? | int64 | 9 |
Qué tanto estás usando Copilot, ChatGPT u otras herramientas de IA para tu trabajo? | int64 | 4 |
Salir o seguir contestando? | object | Responder sobre mis estudios |
Máximo nivel de estudios | object | Secundario |
Estado | object | Completo |
Carrera | object | Ingeniería en Sistemas de Información |
Institución educativa | object | UTN - Universidad Tecnológica Nacional |
Salir o seguir contestando sobre las guardias? | object | Responder sobre guardias |
Tenés guardias? | object | No |
Cuánto cobrás por guardia | float64 | 0 |
Aclará el número que ingresaste en el campo anterior | object | Porcentaje de mi sueldo bruto |
Salir o seguir contestando sobre estudios? | object | Terminar encuesta |
Tengo (edad) | int64 | 25 |
Me identifico (género) | object | Varón Cis |