Data Engineer
Build and maintain the pipelines, platforms, and infrastructure that collect, transform, store, and deliver data at scale — enabling data scientists, analysts, and AI systems to work with reliable, high-quality data across Sri Lanka's growing data economy and globally.
A Data Engineer designs, builds, and operates the data infrastructure that organisations depend on for analytics, business intelligence, machine learning, and operational intelligence. While a Data Scientist builds models and a Data Analyst interprets results, the Data Engineer builds the pipelines that move raw data from source systems into the storage and processing platforms where analysis happens. Without well-engineered data infrastructure, data science and analytics cannot function. Data engineering is one of the fastest-growing and highest-paying roles in the global technology industry. The proliferation of data sources — transactional databases, web and app clickstreams, IoT sensors, social media, third-party APIs, log files, financial feeds — has created an enormous demand for engineers who can reliably collect, clean, transform, and deliver this data at scale. The explosion of cloud data platforms (Snowflake, Databricks, AWS Redshift, Google BigQuery, Azure Synapse Analytics) has transformed data engineering from a specialised on-premises discipline into a cloud-native engineering specialty accessible from anywhere. In Sri Lanka, data engineering demand is concentrated in the financial services sector (banks and insurance companies processing transaction data for fraud detection, regulatory reporting, and customer analytics), the telecommunications sector (Dialog Axiata, SLT-Mobitel, and Hutch processing call records, network data, and customer behaviour data), the IT services export sector (Virtusa, WSO2, 99x Technology, Mitra Innovation delivering data platform services to international clients), and the e-commerce sector (Takas.lk, Ikman.lk, Kapruka processing clickstream and transaction data). The Ministry of Statistics and the Central Bank of Sri Lanka also employ data engineers for national statistical data platforms. Globally, Sri Lankan data engineers are in strong demand. The combination of strong mathematical foundations (from the A/L Combined Mathematics curriculum), English language competence, and relatively lower cost compared to US/UK data engineers makes Sri Lankan data engineers attractive for remote and offshore roles. Companies like Virtusa and 99x Technology deliver significant data engineering work for US and European financial services clients from Sri Lanka. The self-directed learning resources available for data engineering (dbt, Apache Spark, Airflow, Snowflake, and all major cloud data platforms have free tiers and excellent documentation) make it one of the most accessible high-value technology specialisations for self-taught professionals.
What a Data Engineer does daily
- Data pipeline design and development — building ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines that move data from source systems into data warehouses, data lakes, or data lakehouses; the modern paradigm has shifted from ETL (transforming before loading, using on-premises tools like Informatica or SSIS) to ELT (loading raw data first, then transforming using SQL in the data warehouse — the standard pattern with cloud data warehouses); pipeline orchestration (Apache Airflow — the most widely used open-source pipeline orchestrator; free; Python-based DAGs — Directed Acyclic Graphs for defining pipeline dependencies; the standard tool in the Sri Lankan and global data engineering market); dbt (data build tool — the most important transformation tool in modern data engineering; transforms SQL-based data in the warehouse; version control for SQL; testing for data quality; documentation generation; dbt Core is free and open-source; the primary tool shift in data engineering since 2020)
- Data warehouse and lakehouse architecture — designing and building the storage and query layer for analytical data; Snowflake (the dominant cloud data warehouse; SQL-based; separated storage and compute; auto-scaling; Snowflake Time Travel for historical data access; Snowflake Marketplace for data sharing; widely used by Sri Lankan IT services companies for international client data platforms); Google BigQuery (serverless data warehouse; the most cost-effective option for large-scale analytics; columnar storage; partitioned and clustered tables for query optimisation; used extensively by global tech companies); AWS Redshift (Amazon's data warehouse; RA3 node type with managed storage; Redshift Spectrum for querying S3 data lakes; widely used in US financial services); Azure Synapse Analytics (Microsoft's integrated analytics service; SQL pools, Spark pools, and data integration in one platform; the most relevant for Sri Lankan organisations already on Microsoft Azure); Delta Lake and Apache Iceberg (open table formats for lakehouse architectures — combining data lake storage economics with data warehouse ACID transaction capabilities; increasingly the default architecture for new data platforms)
- Data ingestion and streaming — building the systems that collect data from source systems and deliver it to the data platform; batch ingestion (scheduled full or incremental loads from databases, APIs, and files — the most common pattern in Sri Lankan enterprise data engineering); real-time streaming ingestion (Apache Kafka — the most widely deployed distributed event streaming platform; producers, consumers, topics, partitions, consumer groups; used by large Sri Lankan telcos and banks for real-time transaction and event data; free self-hosted; Confluent Cloud managed service); AWS Kinesis (managed streaming for AWS-native architectures); Apache Flink (stream processing engine for real-time analytics; complementary to Kafka); change data capture (CDC — capturing database changes in real-time; Debezium — the most widely used open-source CDC tool; captures PostgreSQL, MySQL, and Oracle transaction log changes and publishes to Kafka)
- Data transformation and modelling — structuring raw data into analytical models that are useful for reporting and analysis; dimensional modelling (Ralph Kimball's approach — star schemas with fact tables and dimension tables; the standard modelling pattern for data warehouses; the ability to design a correct star schema is a foundational data engineering skill); dbt transformations (writing modular SQL transformations with refs, sources, and tests; dbt documentation and lineage; dbt tests for data quality validation); data vault modelling (Dan Linstedt — an alternative to dimensional modelling for large enterprise data warehouses with complex source system landscapes; used in some Sri Lankan banking data platforms); slowly changing dimensions (handling how dimension data changes over time — SCD Type 1, 2, 3)
- Data quality and observability — ensuring that data in the pipeline is accurate, complete, consistent, and timely; dbt tests (schema tests: not_null, unique, accepted_values, relationships; custom SQL tests for business logic validation); Great Expectations (open-source data quality framework — the most widely used Python-based data validation library; free); Monte Carlo or Bigeye (commercial data observability platforms); data quality SLAs (defining acceptable thresholds for completeness and freshness); data lineage (tracking where each field in a report came from — which source systems, which transformations; critical for debugging data quality issues and for regulatory compliance in banking)
- Cloud data platform operations — managing and optimising cloud data platform performance and cost; Snowflake performance tuning (clustering keys; query profiling in Query History; materialized views; resource monitors for cost control); BigQuery cost optimisation (slot reservations; partitioning strategies; clustered tables; query cost estimation before execution); Redshift optimisation (distribution keys; sort keys; ANALYZE and VACUUM; WLM — Workload Management); Azure Synapse performance (dedicated SQL pool distribution; statistics maintenance; workload management); the ability to reduce cloud data platform costs while maintaining query performance is highly valued by cost-conscious data platform owners
- Data lakehouse and object storage management — managing data stored in cloud object storage (AWS S3, Azure Blob Storage, Google Cloud Storage); file format optimisation (Parquet — the most widely used columnar file format for analytical workloads; Avro for streaming data with schema evolution; ORC; Delta Lake and Iceberg table formats for ACID transactions on object storage); data partitioning strategies; storage lifecycle policies (automatic archival to cheaper storage tiers); data compaction (merging small files — a common performance problem in streaming pipelines); the data lakehouse pattern (Delta Lake on Databricks; Apache Iceberg on AWS Glue or Snowflake) is the dominant architecture for new large-scale data platforms in 2026
- Data infrastructure as code and DataOps — applying software engineering discipline to data infrastructure; Terraform for provisioning cloud data infrastructure (Snowflake, Redshift, BigQuery, S3 buckets, IAM roles); CI/CD for dbt transformations (GitHub Actions or GitLab CI running dbt tests and deploying transformations on merge); data platform version control; environment management (development, staging, production data environments); the application of DevOps practices to data engineering — DataOps — is the primary professional maturity differentiator between junior and senior data engineers
- Data governance and compliance for data pipelines — implementing the data governance controls that ensure regulated data is handled correctly in data pipelines; data classification (identifying PII — Personally Identifiable Information — in data pipelines; particularly important under Sri Lanka's Personal Data Protection Act 2022); data masking and anonymisation (masking customer names, NIC numbers, phone numbers in development and testing environments; dynamic data masking in Snowflake and BigQuery); data retention policies (purging data that has exceeded its retention period from all pipeline stages); GDPR compliance for European client data (relevant for Sri Lankan IT services companies delivering data platforms for UK and EU clients)
- API and database integration — connecting data pipelines to source systems; REST API ingestion (using Python requests or Singer taps to pull data from third-party APIs — Salesforce, HubSpot, Stripe, Google Analytics, Facebook Ads — the most common data sources for modern digital business analytics); database replication (PostgreSQL logical replication; MySQL binlog replication; Oracle GoldenGate; understanding how database change capture works at a technical level); JDBC connectors for traditional enterprise data warehouse sources; FTP/SFTP ingestion for legacy file-based data sources (still common in Sri Lankan banking and insurance for inter-system data exchange)
Step-by-Step Career Roadmap
- Build strong mathematics foundations — data engineering requires mathematical thinking (set theory, logic, algebraic manipulation); Khan Academy Mathematics through Algebra 2 and Precalculus; the mathematical intuition developed at this stage underpins the SQL window functions, statistical data quality assessments, and performance optimisation calculations that data engineering requires
- Learn Python fundamentals — Python is the primary programming language of data engineering; CS50x (Harvard, free) Weeks 1–5; Automate the Boring Stuff with Python (free online); focus on: variables, loops, functions, dictionaries, file I/O; the ability to write correct, readable Python code is the foundational skill for all subsequent data engineering work
- Understand databases conceptually — what is a database? what is a table, a row, a column? what is a primary key? what is a foreign key? why do we use databases instead of spreadsheets? Khan Academy SQL introduction; W3Schools SQL (SELECT, WHERE, ORDER BY, simple GROUP BY); this conceptual database foundation makes all subsequent SQL and data modelling learning significantly faster
- Build Excel / Google Sheets data analysis skills — working with tables; sorting and filtering; pivot tables; VLOOKUP; these spreadsheet skills build data intuition and an understanding of what analytical data needs look like before transitioning to programmatic data processing
- CS50x: complete Weeks 0–3 (Scratch, C basics, arrays)
- W3Schools SQL: complete all beginner SQL exercises
- Excel: build a simple data analysis of a public dataset (Sri Lanka weather data; World Bank Sri Lanka data)
- Python: write a script that reads a CSV file and calculates summary statistics
- Khan Academy: complete Algebra 1 and Algebra 2 units
- Data engineering requires genuine programming ability, not just familiarity with code — students who learn Python syntax from YouTube videos without writing and debugging their own programs develop false confidence; writing programs that fail, debugging them, and fixing them is the actual learning process; prioritise writing your own code over watching explanations
