Archive for the ‘General’ Category

What is AWS S3?

S3 stands for Simple Storage Service.

Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, and inexpensive data storage infrastructure.

Unlike the other storage systems like Unix file system, HDFS (i.e. Hadoop Distributed File System), etc which are based on having folders & files, the S3 is based on a concept of a “key” and a “object“. Amazon S3 stores data as objects within a bucket, which is a logical unit of storage. An object consists of a file and optionally any metadata that describes that file.

To store an object in Amazon S3, you upload the file you want to store to a bucket. When you upload a file, you can set permissions on the object as well as any metadata. Buckets are the containers and you control access per bucket, view access logs for it and its objects, and choose the geographical region where Amazon S3 will store the bucket and its contents. Customers are not charged for creating buckets, but are charged for storing objects in a bucket and for transferring objects in and out of buckets.

Amazon S3 data model is a flat structure, and there is no hierarchy of sub-buckets or sub-folders. You can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console supports a concept of folders for example

documents/csv/datafeed.csv

Each Amazon S3 object has data (e.g. a file), a key, and metadata (e.g. object creation date, privacy classification like protected,sensitive, public etc).  A key uniquely identifies the object in a bucket.  Object metadata is a set of name-value pairs. You can set object metadata when you upload it. Metadata cannot be modified after uploading, but you can make a copy of the object and set the new metadata.

Advantage of S3

  • Elasticity

If you were to use HDFS on Amazaon EC2 (i.e. Elastic Compute Cloud) infrastructure, and if your storage requirements grow you need to add AWS EBS (i.e. Elastic Block Storage) and other resources in the EC2 infrastructure to scale up. You also need to take additional steps for monitoring, back ups & disaster recovery.

The S3 decouples compute against the storage requirements. This decoupling allows you to easily (i.e. elastically) scale up or down the storage requirements.

S3’s opt-in versioning feature automatically maintains backups of modified or deleted files, making it easy to recover from accidental data deletion.

  • Cost

S3 is 3 to 5 cheaper that WS EBS (i.e. Elastic Block Storage) used by HDFS.

  • Performance

S3 consumers don’t have the data locally, hence all reads need to transfer data across the network, and S3 performance tuning itself is a black box. Since HDFS data is more local to it, it is much faster(e.g. 3 to 5 times) than S3.  S3 has a higher read/write latency than HDFS.

  • Availability & Durability & Security

Availability guarentees system uptime and Durability guarantees that the data that gets written will survive permanently.  S3 claims 99.999999999% durability and 99.99% availability as opposed to HDFS on EBS gives an availability of 99.9%

S3’s cross-region replication feature can be used for disaster recovery & enhances its strong availability by withstanding the complete outage of an AWS region.

S3 has easy-to-configure audit logging and access control capabilities. These features along with multiple types of encryption makes S3 easy to meet regulatory compliance needs such as PCI (i.e. Payment Card Industry) or HIPAA (i.e. Health Insurance Portability and Accountability Act) compliance.

  • Multipart Upload

You can now break your larger objects (e.g. > 100 MB) into chunks and upload a number of chunks in parallel. If the upload of a chunk fails, you can simply restart it.
You’ll be able to improve your overall upload speed by taking advantage of parallelism.

For example, you can break a 10 GB file into as many as 1024 separate parts and upload each one independently, as long as each part has a size of 5 MB or more.
If an upload of a part fails it can be restarted without affecting any of the other parts.
S3 will return an ETag in response to each part uploaded. Once you have uploaded all of the parts you can ask S3 to assemble the full object with another call to S3.

 

Advertisements
  • Large monolith architectures are broken down into many small services.
    • Each service runs in its own process.
    • The applicable cloud rule is one service per container.
  • Services are optimized for a single function.
    • There is only one business function per service.
    • The Single Responsibility Principle: A microservice should have one, and only one, reason to change.
  • Communication is through REST API and message brokers.
    • Avoid tight coupling introduced by communication through a database.
  • Continuous integration and continuous deployment (CI/CD) is defined per service.
    • Services evolve at different rates.
    • You let the system evolve but set architectural principles to guide that evolution.
  • High availability (HA) and clustering decisions are defined per service.
    • One size or scaling policy is not appropriate for all.
    • Not all services need to scale; others require auto scaling up to large numbers.

Lombok

Posted: June 9, 2018 in General, Java, Java8
Tags: , ,

Lets take a look at a following sample code.

import java.io.Serializable;
import java.util.Objects;

public class User implements Serializable {

    private long id;
    private String username;
    private String login;

    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public String getUsername() {
        return username;
    }

    public void setUsername(String username) {
        this.username = username;
    }

    public String getLogin() {
        return login;
    }

    public void setLogin(String login) {
        this.login = login;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        User user = (User) o;
        return id == user.id &&
                Objects.equals(username, user.username) &&
                Objects.equals(login, user.login);
    }

    @Override
    public int hashCode() {

        return Objects.hash(id, username, login);
    }
}

A class should have getter-setters for the instance variables, equals & hashCode method implementation, all field constructors and an implementation of toString method. This class so far has no business logic and even without it is 50+ lines of code. This is insane.

Lombok is used to reduce boilerplate code for model/data objects, e.g., it can generate getters and setters for those object automatically by using Lombok annotations. The easiest way is to use the @Data annotation.

import java.io.Serializable;
import lombok.data

@Data
public class User implements Serializable {

    private long id;
    private String username;
    private String login;
}

How to add Lombok to your java project ?

Using Gradle

dependencies {
    compileOnly('org.projectlombok:lombok:1.16.20')
}

Using Maven

<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <version>1.16.20</version>
</dependency>

Tips to remember while using Lombok

  1. Don’t mix logic with lombok
  2. Use @Data for your DAOs
  3. Use @Value for immutable value objects
  4. Use @Builder when you have an object with many fields with the same type
  5. Exclude generated classes from Sonar report. If you are using Maven and Sonar, you can do this using the sonar.exclusions property.

Understanding the CAP theorem

Posted: June 3, 2018 in General
Tags:

Finding the ideal database for your application is largely a choice between trade-offs. The CAP theorem is one concept that can help you understand the trade-offs between different databases. The CAP theorem was originally proposed by Eric Brewer in 2000. It was originally conceptualized around network shared data and is often used to generalize the tradeoffs between different databases. The CAP theorem centers around three desirable properties; consistency is where all users get the same data, no matter where they read the data from, availability ensures users can always read from and write to the database, and finally partition tolerance ensures that the database works when divided across network.

The theorem states that at most you can only guarantee two of the three properties simultaneously. So you can have an available partition- tolerant database, a consistent partition-tolerant database or a consistent available database. One thing to note is that not all of these properties are necessarily exclusive of each other. You can have a consistent partition-tolerant database that still has an emphasis on availability, but you’re going to sacrifice either part of your consistency or your partition tolerance.

Relational databases trend towards consistency and availability. Partition tolerance is something that relational databases typically don’t handle very well. Often you have to write custom code to handle the partitioning of relational databases. NoSQL databases on the other hand trend towards partition-tolerance. They are designed with the idea in mind that you’re going to be adding more nodes to your database as it grows. CouchDB, which we looked at earlier in the course, is an available partition-tolerant database.

That means the data is always available to read from and write to, and that you’re able to add partitions as your database grows. In some instances, the CAP theorem may not apply to your application. Depending on the size of your application, CAP tradeoffs may be irrelevant.If you have a small or a low traffic website, partitions may be useless to you, and in somecases consistency tradeoffs may not be noticeable. For instance, the votes on a comment may not show up right away for all users.

This is fine as long as all votes are displayed eventually. The CAP theorem can be used as a guide for categorizing the tradeoffs between different databases. Consistency, availability, and partition tolerance are all desirable properties in a database. While you may not be able to get all three in any single database system, you can use the CAP theorem to help you decide what to prioritize.

To build a Java application, the first step is to create a Java project. Most Java projects rely on third-party Java archive dependencies, and these third-party archives usually have dependencies of their own. On top of that, each version of the dependencies rely on other versions. Managing all these dependencies is a nightmare that Java developers have nicknamed JAR hell. To avoid JAR hell, we use build dependency management systems like Maven or Gradle.

But even with Maven and Gradle, versioning between individual .jar files can be a nuisance.Spring Boot recognizes this, and created the notion of a Spring Boot Starter, which bundles several dependencies into a grouping that is easier to manage. There are a lot, and I mean a lot of Spring Boot Starter dependencies so even cobbling together a project on your own can be difficult. This is where Spring Initializr comes to the rescue. Spring Initializr is a tool for creating Spring Boot Java projects by answering a series of questions and selecting check boxes to choose which features to include.

Initializr creates the package structure, the pom.xml for Maven, or build.gradle for Gradle files, and any required Java source classes.

Lets see how to use Spring Initializr

Step 1 : Goto https://start.spring.io/

Step 2 : Choose a java project with maven and latest spring boot support

Screen Shot 2018-05-19 at 8.38.24 PM.png

Step 3: If you want to see more options, click on ‘switch to full version’ link at the bottom of the page.

Screen Shot 2018-05-19 at 8.40.14 PM.png

Step 4 : Choose Spring Starter Packages

Now, we’re going to scroll past the Generate Project button and look at all of these Spring Starter packages, and from these we’re going to choose Web and within Web is Rest Repositories.

Screen Shot 2018-05-19 at 8.43.41 PM

And then keep scrolling, and we get to the SQL part, we’re going to choose JPA and H2.

Screen Shot 2018-05-19 at 8.43.54 PM

Now we’re going to go back and click the Generate Project button

Now Spring Initializr will generate the zip file. I will copy it to my working folder and unzip the file there and start working on your project. 🙂 

Let’s say the git remote repository has the following branches

master
develop
bug-fix-3
bug-fix-4

and your local repository has the following branches

master
develop
bug-fix-1
bug-fix-2
bug-fix-3

As you see the branches bug-fix-1 and bug-fix-2 doesn’t exist on the remote. They have been deleted by someone.
Note: For covering
bug-fix-1 branch was deleted after merging with master
bug-fix-2 branch was deleted without merging as the changes were not required.

Now I will explain how to remove the local branches which are no longer on remote.

How to remove merged local branches which were deleted in remote?

Step 1: git fetch -p
After fetching, remove any remote tracking branches which no longer exist on the remote.
This should ideally remove the branches which were merged with master.

Step 2: git branch

You will see that the merged bug-fix-1 branch is removed from your local repository.
master
develop
bug-fix-2
bug-fix-3

But sometime you will see that some local branches (in this case : bug-fix-2) are still present as they were not merged with master branch.

How to delete unmerged local branches which were deleted in remote?

Step 1: git branch -vv

This step will list all the local branchs with some additional information such as their related upstream/remote branch and latest commit message

master 49a9c07a71 [origin/master: behind 1] Merge branch ‘bug-fix-3’ into ‘master’

develop 877142a45c [origin/develop: behind 4] Test commit message

bug-fix-2 cba6823909 [origin/bug-fix-2 : gone] Bug fix 2 final commit

bug-fix-3 1f0a4ace9e [origin/bug-fix-3: ahead 3] Bug fix 3 final commit

You might have noticed that against bug-fix-2 branch, the additional information says “gone“. This means this has been deleted in the remote. Now lets remove it from local.

Step 2 : git branch -D bug-fix-2
Output : Deleted branch bug-fix-2 (was cba6823909).

This command will delete the unmerged branch from your local repository.

Step 3 : git branch
You will see that the bug-fix-2 branch is removed from your local repository.
master
develop
bug-fix-3

Hope this article helps all of you !

A batch application is nothing more than a program whose goal is to make the processing of large amounts of data, on a scheduled basis.

Most enterprise applications rely heavily on batch jobs. They run during the night and do all the time consuming tasks that cannot be done during business hours. These tasks are often critical to the business and errors can cause serious damage. Spring Batch can help us to achieve these goals.

Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advance technical services and features that will enable extremely high-volume and high performance batch jobs though optimization and partitioning techniques

Let’s take a look at the spring batch architecture

Each batch Job is composed of one or more Steps.

JobInstance represents a given Job, parametrized with a set of typed properties called JobParameters.

Each run of of a JobInstance is a JobExecution.

JobLocator: Class responsible for getting the configuration information, such as the implementation plan (job script), for a given job passed by parameter. Works in conjunction with the JobRunner;

Launching jobs with their job parameters, which is the responsibility of JobLauncher which is inistantiated by JobRunner.

JobRunner is the class responsible to make the execution of a job by external request. Has several implementations to provide method invocation call for different modes such as a shell script, for example.

Finally, various objects in the framework require a JobRepository to store runtime information related to the batch execution.

 

 

A Spring Batch job consists of the following components:

  • The Job represents the Spring Batch job. Each job can have one or more steps.
  • The Step represents an independent logical task (i.e. import information from an input file). Each step belongs to one job.
  • The ItemReader reads the input data and provides the found items one by one. An ItemReaderbelongs to one step and each step must have only one ItemReader
  • The ItemProcessor transforms items into a form that is understood by the ItemWriter one item at a time. An ItemProcessor belongs to one step and each step can have one ItemProcessor
  • The ItemWriter writes an information of an item to the output one item at a time. An ItemWriterbelongs to one step and a step must have only one ItemWriter

 

Let’s see how a job with multiple steps work.

 

How to configure a job with one or more steps?

<job id="suhasJob1">
<step id="step1" next="step2"/>
<step id="step2" next="step3"/>
<step id="step3/>
</job>

 

How to configure a job referencing the job repository ?

<job id="suhasJob1" job-repository="suhasJobRepo">
<step id="step1" next="step2"/>
<step id="step2" next="step3"/>
<step id="step3/>
</job>

How to add Job Listeners ?

During the course of the execution of a Job, it may be useful to be notified of various events in its lifecycle so that custom code may be executed. We can implement JobExecutionListener Interface.

<job id="suhasJob1" job-repository="suhasJobRepo">
<step id="step1" next="step2"/>
<step id="step2" next="step3"/>
<step id="step3/>
<listeners>
<listener ref="sampleListener"/>
</listeners>
</job>

 

public interface JobExecutionListener {
 void beforeJob(JobExecution jobExecution);
 void afterJob(JobExecution jobExecution);
}

 

Inheriting from Parent Job

<step id="parentStep">
<tasklet allow-start-if-complete="true">
<chunk reader="suhasItemReader" writer="suhasItemWriter" commit-interval="10"/>
</tasklet>
</step>

<step id=”childStep” parent=”parentStep”>
<tasklet start-limit=”5″>
<chunk processor=”suhasItemProcessor” commit-interval=”5″/>
</tasklet>
</step>

In the above configuration,  the Step “childStep” will inherit from “parentStep”. It will be instantiated with ‘suhasItemReader’, ‘suhasItemProcessor’, ‘itemWriter’, startLimit=5, and allowStartIfComplete=true. Additionally, the commitInterval will be ‘5’ since it is overridden by the “childStep”

 

Inheriting from Parent Job along with listeners

<step id="parentStep">
<tasklet allow-start-if-complete="true">
<chunk reader="suhasItemReader" writer="suhasItemWriter" commit-interval="10"/>
</tasklet>
</step>

<step id=”childStep” parent=”parentStep”>
<tasklet start-limit=”5″>
<chunk processor=”suhasItemProcessor” commit-interval=”5″/>
</tasklet>
</step>

In the above configuration,  the Step “childStep” will inherit from “parentStep”. It will be instantiated with ‘suhasItemReader’, ‘suhasItemProcessor’, ‘itemWriter’, startLimit=5, and allowStartIfComplete=true. Additionally, the commitInterval will be ‘5’ since it is overridden by the “childStep”

Abstract Step

Sometimes it may be necessary to define a parent Step that is not a complete Step configuration. If, for instance, the reader, writer, and tasklet attributes are left off of a Step configuration, then initialization will fail. If a parent must be defined without these properties, then the “abstract” attribute should be used. An “abstract” Step will not be instantiated; it is used only for extending.

In the following example, the Step “abstractParentStep” would not instantiate if it were not declared to be abstract. The Step “step1” will have ‘itemReader’, ‘itemWriter’, and commitInterval=10.

<step id="abstractParentStep" abstract="true">
<tasklet>
<chunk commit-interval="10"/>
</tasklet>
</step>

 

<step id="step1" parent="abstractParentStep">
<tasklet>
<chunk reader="suhasItemReader" writer="suhasItemWriter"/>
</tasklet>
</step>

 

Note:  In order to allow a child to add additional listeners to the list defined by the parent, every list element has a “merge” attribute. If the element specifies that merge=”true”, then the child’s list will be combined with the parent’s instead of overriding it

<job id="abstractParentStep" abstract="true">
<listeners>
<listener ref="listenerOne"/>
<listeners>
</job>

 

<job id="childJob" parent="abstractParentStep">
<step id="step1"/>
<listeners merge="true">
<listener ref="listenerTwo"/>
<listeners>
</job>