Why Scala Is an Awesome Programming Language

No one will argue that IT is one of the fastest developing areas of engineering. New tools, approaches and ideas complete or even supersede the existing ones. One of the quickest growing technology stacks is the Scala language stack.  In this blog we explore what makes this language awesome.

Language Design

Scala was designed to help write thread safe and laconic code. It overcomes some JVM limitations and provides features that could not be achieved in Java. Scala has a clean, expressive, and extensible syntax with lots of built-in shorthands for most common cases.

Since less code needs to be written to accomplish the same task, the programmer can now focus on the problem’s solution instead of spending the time for boilerplate code. It is especially nice if you pay your programmers for SLOC. Furthermore, Scala reduces the number of places mistakes can be made and hence improves implementation quality.

Here is a short list of Scala language and compiler features:

  • Type inference – in most cases compiler can automatically detect types of values:

val i = 1 // Int

val s = “Hello, world!” // String

  •  Named and default function arguments:

class Point(x: Int = 0, y: Int = 0)

new Point(y = 1) // Point(0, 1)

  •  Tail recursion optimization – recursive function self-calls transforms to loops – no more StackOverflowError.
  •  Support of both imperative object-oriented and functional programming – in Scala object composition, methods, state encapsulation and inheritance, traits and mixins combined with lazy evaluations, algebraic data types, pattern matching, type classes and, of course, first class functions.
  •  Lots of immutable and mutable, finite and infinite collections with many implemented transformations like map, reduce and filter:

users.filter(_.lastVisitDate before today).map(_.email).foreach(sendNotification)

// Sending notifications to all users that visited our  site   too long ago

  •  Concurrency through mechanisms of actors and futures:

val sum = sumActor ? List(1, 2, 3) // sum == Future(6)

val x = sum.map(_ * 2) // x == Future(12)

  •  Generics – JVM does use type erasure and knows nothing about real types of generics in runtime, but Scala keeps that information.
  •  Compile-time meta programming – in addition to runtime reflection, Scala compiler supports macroses – functions evaluated during compilation.

Distributed Computations and Big Data

One of the fields where Scala found its widest application is distributed computing. Scala has great mechanisms for working with data sequences even in a cluster, which is why it is frequently used by Big Data engineers and data scientists. Here is a list of the most well-known technologies that use Scala:

  • Akka –  a framework for creating distributed systems. It is based on actors model and makes it easier to implement concurrent applications without race conditions and explicit synchronization.
  •  Spark – a very popular batch data processing framework. Spark has integration with different data sources such as Cassandra, HBase, HDFS, and Parquet files. In addition, it has a Streaming extension that provides tools for building stream processing pipelines. One of the most powerful features of Spark is the ability to run quick ad-hoc tasks to check hypothesis. Such functionality is reached by Spark’s design that in some cases gives more than 100 times performance impact in comparison with Hadoop jobs.
  •  Kafka – a high performance message queue, one of the key middleware components in data streaming systems. It is distributed by design and usually acts like a data buffer in streaming systems on the different stages of the processing and as a media between different system parts.
  •  Samza – one more framework for stream processing. It is similar to Spark Streaming but works in a different way. It does not create micro batches as Spark does. Instead, it processes data pieces as soon as they arrive, which makes Samza more preferable in some cases.

In addition to these tools, it is also possible to use Scala for implementing good old Hadoop map-reduce jobs, directly or utilizing Scalding. Anyway Scala is a great choice for data processing and distributed computing.

Compatibility with Java

One more important thing is compatibility with libraries written in Java. Java code can be used in Scala directly and without limitations. This allows to keep existing modules without the need to re-implement them. It could be quite helpful when it is needed to use a rare library, legacy code, or API that have implementations only for Java.

Why Scala?

To sum it up, Scala is well designed and very extensible. It has special processes (SIP and SLIP) that let any Scala developer propose enhancements. In conjunction with the large community, Scala ecosystem has been growing rapidly. Scala has its own stack of tools and is compatible with existing Java code. It brings new effective approaches that give the  programmer an ability to do their job more efficiently. All these features make Scala one of the most attractive modern programming languages.

Contact us at contact@dsr-company.com to learn more or if you have a Scala project to discuss.

Which Big Data technology stack is right for your business?

If data analysis is one of the core features of your product, then you probably already know that choosing a data storage and processing solution requires careful consideration. Let’s discuss the pros, and cons of the most popular choices, Redshift/EMR, DynamoDB + EMR, AWS RDS for PGSQL, and Cassandra + Spark.

Managed Amazon Redshift/EMR

Pro – It’s fully-managed by Amazon with no need to hire support staff for maintenance.

Pro – It’s scalable to petabyte-size with very few mouse clicks.
Pro – Redshift is SQL-compatible, so you can use external BI tools to analyze data.

Pro – Redshift is quite fast and performant for its price on typical BI queries.
Con – Redshift’s SQL is the only way to structure/analyze data inside Redshift. It may be easier for simple tasks, but to do complex tasks like social network analysis or text mining (or even running custom AWS EMR tasks) you have to manually export all data to external storage (to S3 for example). You then run all your external analytics tasks, and load results back to Redshift. The amount of manual work will only grow with time ultimately making the use of Redshift an obstacle.
Con – Redshift’ SQL dialect for data analysis used is also very limited (as a tradeoff for its performance), the main drawbacks are: missing secondary indexing support, no full-text search, and no unstructured JSON data support. Usually it’s OK for structured and pre-cleaned sterile data, but it will be really hard to store and analyze semi-structured data there (like data from social networks or text from webpages)
Con – EMR has very weak integration with Redshift: you have to export/import all data through S3.
Con – To write analytical EMR jobs, you have to hire people with pricey Big Data/Hadoop competence

Managed Amazon DynamoDB + EMR

PRO – It’s fully-managed by Amazon with no need to hire support staff for maintenance.
PRO – It’s scalable to petabyte-size with very few clicks of the mouse.
PRO – Pricing is opaque and it may be rather costly to run analytical workload (with full-table scans as for text mining) on large workloads.
CON – DynamoDB is a columnar NoSQL store. For most analytical queries, you have to use EMR tools like Hive, which is rather slow, taking minutes for simple queries that typically execute instantly on Redshift/RDS).
CON – DynamoDB is closed technology which is unpopular in the Big data community (mostly because of its prices). We’ve also noticed difficulty finding people with required competences to extend the system later.

Custom ‘light’ solution with AWS RDS for PGSQL

PRO – Postgresql is easily deployable anywhere, has very large community and there’s a lot of people with required competence. You can use either hosted RDS version, or install your own on EC2 – it does not require any hardcore maintenance (like own custom hadoop cluster) and just works.
PRO – RDS Postgresql supports querying unstructured JSON data (so you can store social network data in a more natural way than in Redshift), full-text search (so you can query user’s friends for custom keywords), and multiple datatypes (like arrays which are very useful for storing social graph data).
PRO – Has full-featured unrestricted SQL support for your analytical needs and external BI tools.
CON – PGSQL is not “big” data friendly. Although versatile for small to medium data, our experience has uncovered difficulties when scaling for large datasets sizes. Scaling may be a serious issue later and require non-easy architectural modifications in the whole analytical backend, but may speed up development if data size is not an issue.

Custom ‘heavy’ solution with Cassandra + Spark:

PRO – Cassandra + Spark can easily handle storing and analyzing petabytes of data.
PRO – Cassandra deals with semi-structured data well, which comes in handy when storing social network data like user’s Facebook wall posts, friends, etc.
PRO – Spark has good machine-learning (for example, dimensionality reduction) and graph-processing (useable for SNA analysis) libraries included. Also has python API to use ant other external tools from numpy and scikit.
PRO – As a self-hosted solution, Cassandra + Spark is much more flexible for future complex analysis tasks.
PRO – Spark has SparkSQL which is an easy integration add-on for external BI tools.
CON – Cassandra may need higher tier competences for challenges that arise when scaling which, in turn, may require additional investment in support staff supervision.
CON – Spark is a rather new technology, but it has already positioned itself well within the big data community as a next-gen Hadoop. At present, it may be hard to find people with Spark competency, but the user community is quickly growing, thus making skills easier to find as time passes.

To Conclude

The final choice is dictated by your current business priorities.

If you need to move forward fast with less maintenance routine and are not afraid of later technical debt, we recommend using the “Light” solution or Amazon DynamoDB. If your top priority is system scalability, then the ‘Heavy’ solution surfaces as the clear choice.

The impact of Big Data is really Big

Data analytics (and big data, as a part of it) is not just an ordinary business tool. It is also not just a buzzword, it deeply impacts almost all modern industries. By the end of 2014 the size of Big Data industry has reached 16 billion, with a forecasted value of 48 billion over the next five years. The fact is that as an innovation point big data has much in common with the invention of the internet – it can revolutionize every industry and affect everybody.  

Many industries benefit from data analysis and here are a few not-so-obvious examples:
  • Farming. With the help of drones farmers are now able to collect precise information about the health of the crops, the level of field hydration and the crops growth dynamics. Analyzing that data, they are able to use fertilizers more economically or build a more effective irrigation system. As a result, production costs decrease and revenue is growing.
  • Film making. Prior to filming its TV series project “House of cards,” Netflix has performed an extensive research, trying to determine, who must be the director (Finch), who must play the main role (Kevin Spacey), and what the plot must be (it is a remake of an older series) in order to hit a certain audience. As its IMDB rating is 9.1, it is safe to say that the analysis was executed well and has contributed to the overall success of the show.
  • Oil extraction. Kaggle has built software that helps oil companies to determine how much to bid on a lease of an oil spot or to determine optimal well spacing. Now oil companies are able to make more informed decisions, resulting in improved operational indicators.
  • Professional athlete training.  Using dozens of different detectors and tools, measuring almost every physical parameter of an athlete – blood pressure, heart rate, body temperature, muscle tension – coaches are now able to detect the best training strategy or even predict athlete’s performance.
  • Fighting crime. The Smart Policing program, implemented in 38 different American police departments, funds and empowers local, data-focused, crime prevention tactics. A key feature of the program is “hot spot policing,” which analyzes geographic patterns to uncover highly likely crime locales. 
Big data and data analysis has potential to improve performance and operations of any business. How has your company been affected by the revolution of big data? With the help of data analytics you can make your product offering more competitive, your services more targeted  and  your next steps in the market more strategic. It is very likely that your competitors are already taking advantage of data analytics that are helping to strengthen their product or service offering. If you haven’t already, it is time to consider what big data and data analysis can do for you.